lexical approaches to backoff in statistical parsing - citeseerx

310
Lexical Approaches to Backoff in Statistical Parsing Corrin Lakeland SAPERE-AUDE a thesis submitted for the degree of Doctor of Philosophy at the University of Otago, Dunedin, New Zealand. 31 May 2005

Upload: khangminh22

Post on 29-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Lexical Approaches to Backoff in

Statistical Parsing

Corrin Lakeland

S A P E R E - A U D E

a thesis submitted for the degree of

Doctor of Philosophy

at the University of Otago, Dunedin,

New Zealand.

31 May 2005

ii

Abstract

This thesis is an investigation of methods for improving the accuracy of a sta-

tistical parser. A statistical parser uses a probabilistic grammar derived from a

training corpus of hand-parsed sentences. The grammar is represented as a set of

constructions — in a simple case these might be context-free rules. The probabil-

ity of each construction in the grammar is then estimated by counting its relative

frequency in the corpus.

A crucial problem when building a probabilistic grammar is to select an appro-

priate level of granularity for describing the constructions being learned. The

more constructions we include in our grammar, the more sophisticated a model

of the language we produce. However, if too many different constructions are

included, then our corpus is unlikely to contain reliable information about the

relative frequency of many constructions.

In existing statistical parsers two main approaches have been taken to choosing

an appropriate granularity. In a non-lexicalised parser constructions are speci-

fied as structures involving particular parts-of-speech, thereby abstracting over

individual words. Thus, in the training corpus two syntactic structures involving

the same parts-of-speech but different words would be treated as two instances

of the same event. In a lexicalised grammar the assumption is that the individ-

ual words in a sentence carry information about its syntactic analysis over and

above what is carried by its part-of-speech tags. Lexicalised grammars have the

potential to provide extremely detailed syntactic analyses; however, Zipf’s law

makes it hard for such grammars to be learned.

In this thesis, we propose a method for optimising the trade-off between infor-

mative and learnable constructions in statistical parsing. We implement a gram-

mar which works at a level of granularity in between single words and parts-

of-speech, by grouping words together using unsupervised clustering based on

bigram statistics. We begin by implementing a statistical parser to serve as the

basis for our experiments. The parser, based on that of Michael Collins (1999),

contains a number of new features of general interest. We then implement a

model of word clustering, which we believe is the first to deliver vector-based

word representations for an arbitrarily large lexicon. Finally, we describe a series

of experiments in which the statistical parser is trained using categories based on

these word representations.

iii

iv

Acknowledgements

It is difficult to overstate my gratitude to my Ph.D. supervisor, Dr. Alistair Knott.

While I always enjoyed working on my thesis, it was your drive that pushed me

to complete it. Especially throughout my writing period, you provided encour-

agement, sound advice, and an infinite supply of patience. I would never have

finished without you.

I would also like to thank my assistant supervisor, Dr. Richard O’Keefe. No

matter how difficult my problem, you always helped me out. To my surrogate

supervisor, Dr. Peter Andreae. You accepted me turning up in your office as

another student to look after, and still provided encouragement, guidance and

direction with good humour. Similarly, to the many people who helped me when

I was out of my depth: Simon McCallum, Dr. Anthony Robins, Dr. Marcus Frean,

Dr. Michael Albert, Dr. Karsten Worm and Dr. Joshua Goodman.

To my wife, Andrea, for always being there but never asking why it was taking

so long, or when I would start earning some money. . . I love you. And to my

parents, who put up with me saying “I’ll finish next year” more times than I care

to recount, while providing encouragement and support, quite literally at the

end by accepting the job of a last-minute thesis editors with far more grace than

I would have.

To Nathan Rountree and Dr. Janet Rountree, for support and many shared meals

over the years. For showing that after spending a week in the debugger it is

great to just sit back, forget about my thesis, and enjoy a glass of wine. Also,

to Nathan, for the many hours we chatted about data mining and other areas of

computer science, and to Janet for being the only example I know of someone

who actually finished a Ph.D.

I am indebted to the many students who kept me company as I whiled away the

years: Peter Vlugter, Richard Mansfield, Andrew Webb, Samson De Jager, Mike

Liddle, Tom Eastman, Robin Sheat, Pont Lurcock, Chris Monteith, and Alicia

Monteith. Without the tea parties, innumerable games of croquet, cricket, xrisk,

and petanque I would have tired of the student life years ago. Of course, we

didn’t only play games: I’m sure the snack-box really did need a mysql database

backend, though it currently escapes me why we all sat down one day and wrote

a Markov-based sentence generator.

v

vi

Contents

Contents vii

List of Tables xiii

List of Figures xv

1 Introduction 11.1 What is parsing for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Statistical parsing and its problems . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Main aims of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Statistical parsing 72.1 Deterministic grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Lexical heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 HPSG and subcategorisation lists . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Deterministic parsing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Chart parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Problems with deterministic parsing . . . . . . . . . . . . . . . . . . . . 14

2.3 Probabilistic grammars and corpus-based NLP . . . . . . . . . . . . . . . . . . 152.3.1 Building a corpus of hand-parsed sentences . . . . . . . . . . . . . . . 162.3.2 Evaluating parser performance . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Probabilistic grammar formalisms . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1 Lexical semantics in probabilistic grammars . . . . . . . . . . . . . . . 212.4.2 Black et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.3 Exhaustive grammars: Bod and Scha’s approach . . . . . . . . . . . . . 292.4.4 Klein and Manning’s statistical parser . . . . . . . . . . . . . . . . . . . 31

2.5 Backoff, interpolation and smoothing . . . . . . . . . . . . . . . . . . . . . . . 322.5.1 Backoff and interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.3 Combined interpolation and smoothing techniques . . . . . . . . . . . 38

2.6 Probabilistic parsing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.6.1 Inside and outside probabilities . . . . . . . . . . . . . . . . . . . . . . . 412.6.2 Parsing as state space navigation . . . . . . . . . . . . . . . . . . . . . . 432.6.3 Viterbi optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.7 Three statistical parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.7.1 Klein and Manning’s statistical parser . . . . . . . . . . . . . . . . . . . 462.7.2 Bod’s statistical parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.7.3 Collins’ statistical parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

vii

2.8 Summary and future direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 A description of Collins’ parser 513.1 Collins’ grammar formalism and probability model . . . . . . . . . . . . . . . 51

3.1.1 New nonterminal categories: NPB and TOP . . . . . . . . . . . . . . . 523.1.2 A distance metric: adjacency and verbs . . . . . . . . . . . . . . . . . . 523.1.3 Preprocessing the Penn treebank . . . . . . . . . . . . . . . . . . . . . . 533.1.4 Collins’ event representation . . . . . . . . . . . . . . . . . . . . . . . . 543.1.5 Backoff and interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 553.1.6 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2.1 Dependency productions, and the use of a reference grammar . . . . . 593.2.2 Unary productions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2.3 Search strategy in Collins’ parsing algorithm . . . . . . . . . . . . . . . 643.2.4 Summary of Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . 65

4 A reimplementation of Collins’ parser 674.1 The complexity of Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . . 674.2 Implementation of the treebank preprocessor . . . . . . . . . . . . . . . . . . . 704.3 Implementation of the probability model . . . . . . . . . . . . . . . . . . . . . 724.4 Implementation of a POS tagger . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.1 The relationship between POS tagging and lexicalised statistical parsing 744.4.2 Part of speech tagging using hidden Markov models . . . . . . . . . . 754.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.5 Implementation of the chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.6 Implementing add singles stops and beam search . . . . . . . . . . . . . . . . 83

4.6.1 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.6.2 Skiplists for implementing beam search . . . . . . . . . . . . . . . . . . 85

4.7 Some software engineering lessons learned . . . . . . . . . . . . . . . . . . . . 894.7.1 Programming languages for statistical parsing . . . . . . . . . . . . . . 904.7.2 Revision control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.7.3 Efficiency and debuggability . . . . . . . . . . . . . . . . . . . . . . . . 914.7.4 Debugging methodology and test suites . . . . . . . . . . . . . . . . . . 924.7.5 Naming of variables and parameters . . . . . . . . . . . . . . . . . . . . 93

4.8 Results of the parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.8.1 A re-evaluation of Collins’ parser: precision and recall . . . . . . . . . 944.8.2 Evaluation of my preprocessor and parser: precision and recall . . . . 964.8.3 The complexity of Collins’ and my parsers . . . . . . . . . . . . . . . . 974.8.4 Evaluation of my parser with my new POS tagger . . . . . . . . . . . . 1004.8.5 An analysis of the errors in Collins’ parser . . . . . . . . . . . . . . . . 100

4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5 Thesaurus-based word representation 1055.1 An example of the benefits of grouping similar words . . . . . . . . . . . . . . 1065.2 Criteria for semantic relatedness measures . . . . . . . . . . . . . . . . . . . . 107

5.2.1 Attention to infrequently occurring words . . . . . . . . . . . . . . . . 1075.2.2 Multidimensional representations of word semantics . . . . . . . . . . 107

5.3 A survey of approaches for computing semantic similarity between words . . 108

viii

5.3.1 Hand-generated thesauri: WordNet and Roget . . . . . . . . . . . . . . 1085.3.2 Unsupervised methods for thesaurus generation . . . . . . . . . . . . . 1085.3.3 Finch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.3.4 Brown et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.3.5 Smrz and Rychly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.3.6 Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.3.7 Elman/Miikkulainen/Liddle . . . . . . . . . . . . . . . . . . . . . . . . 1165.3.8 Bengio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3.9 Honkela (Self Organising Maps) . . . . . . . . . . . . . . . . . . . . . . 1215.3.10 Joachims (Support Vector Machines) . . . . . . . . . . . . . . . . . . . . 1225.3.11 Schutze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6 A derivation of word vectors 1276.1 Obtaining a training corpus: Tipster and Gutenberg . . . . . . . . . . . . . . . 1276.2 Preparing the corpus for clustering . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2.1 Processing the corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2.2 Off-the-shelf tools for clustering: a brief survey . . . . . . . . . . . . . 1316.2.3 Dealing with large matrices . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3 An implementation of Schutze’s algorithm for word clustering . . . . . . . . . 1336.3.1 Building a table of bigram counts . . . . . . . . . . . . . . . . . . . . . . 1336.3.2 Normalising the bigram table . . . . . . . . . . . . . . . . . . . . . . . . 1346.3.3 The PCA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.4 Tuning the clustering process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.4.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.4.2 Dimensions of the bigram matrix . . . . . . . . . . . . . . . . . . . . . . 1376.4.3 Normalising bigram vectors . . . . . . . . . . . . . . . . . . . . . . . . . 1386.4.4 Choice of feature words . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.5 Window size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.6 Iterated clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.7 Integrating POS tag representations . . . . . . . . . . . . . . . . . . . . 1436.4.8 Windows revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1466.5.1 Results for the first four thousand words . . . . . . . . . . . . . . . . . 1486.5.2 Results for the second four thousand words . . . . . . . . . . . . . . . 1486.5.3 Results for the last four thousand words . . . . . . . . . . . . . . . . . 151

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7 Improving backoff using word representations 1537.1 Feasibility study: Noise in backoff . . . . . . . . . . . . . . . . . . . . . . . . . 1547.2 Parsing by grouping nearest-neighbour words . . . . . . . . . . . . . . . . . . 156

7.2.1 Integrating neighbours in parsing . . . . . . . . . . . . . . . . . . . . . 1567.2.2 Reversing the neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . 1577.2.3 How to select a group of neighbours for a word . . . . . . . . . . . . . 1577.2.4 Avoiding swamping counts . . . . . . . . . . . . . . . . . . . . . . . . . 1597.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.2.6 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.3 Parsing using a neural network probability model . . . . . . . . . . . . . . . . 1657.4 Cascade Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

ix

7.5 Testing the vector representation of words . . . . . . . . . . . . . . . . . . . . . 1667.5.1 Mapping words to words . . . . . . . . . . . . . . . . . . . . . . . . . . 1677.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.6 A vector representation of tags . . . . . . . . . . . . . . . . . . . . . . . . . . . 1707.7 A vector representation of nonterminals . . . . . . . . . . . . . . . . . . . . . . 1737.8 Neural network design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.8.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1767.8.2 Neural network parameters . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.9 Training the tag network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1827.9.1 The initial tag network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1837.9.2 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1857.9.3 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1917.9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.10 Training the other networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1967.10.1 Training the prior network . . . . . . . . . . . . . . . . . . . . . . . . . 1967.10.2 Training the top network . . . . . . . . . . . . . . . . . . . . . . . . . . 1977.10.3 Training the unary network . . . . . . . . . . . . . . . . . . . . . . . . . 1987.10.4 Training the subcategorisation network . . . . . . . . . . . . . . . . . . 1997.10.5 Training the dependency network . . . . . . . . . . . . . . . . . . . . . 202

7.11 Final evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8 Conclusion 2058.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.1.1 Implementing Collins’ 1997 parser . . . . . . . . . . . . . . . . . . . . . 2078.1.2 Word and nonterminal representations . . . . . . . . . . . . . . . . . . 2088.1.3 Experiments in using word vectors for backoff . . . . . . . . . . . . . . 208

8.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2098.2.1 Reimplementing Collins . . . . . . . . . . . . . . . . . . . . . . . . . . . 2098.2.2 Word vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2108.2.3 Backoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.2.4 Using Maximum Entropy methods instead of a neural network . . . . 2118.2.5 Using a different parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

8.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

References 213

A Tags and Nonterminals used 221A.1 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221A.2 Nonterminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

B Code specifications for my parser 227B.1 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227B.2 The node data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230B.3 The beam data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

C Relevant source code 235C.1 Build script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236C.2 R scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240C.3 Funnelweb code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

x

C.4 Processing the treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243C.4.1 Transforming the corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 244C.4.2 Deriving a grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

C.5 Processing bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255C.5.1 Counting bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255C.5.2 Scaling bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

Glossary 275

Index 288

xi

xii

List of Tables

2.1 Context-free rules needed to parse The cat sat on the big brown mat. . . . . . . . 82.2 A context-sensitive rule using features for number agreement. Number is a

variable that denotes singular and plural. . . . . . . . . . . . . . . . . . . . . . 92.3 Some example (simple) HPSG rules . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Frequencies of rules in the simple corpus . . . . . . . . . . . . . . . . . . . . . 222.5 Probabilities of rules in the simple corpus . . . . . . . . . . . . . . . . . . . . . 232.6 Klein and Manning’s parsing accuracy and grammar size for different model

complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Collins’ unary and subcat events . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2 Collins’ dependency events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3 Collins’ TOP events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.4 Collins’ Prior events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 Part of Speech event representation . . . . . . . . . . . . . . . . . . . . . . . . . 774.2 Actual position of the tag that should be in first position . . . . . . . . . . . . 794.3 Results from Collins’ 1997 parser including my code hooks . . . . . . . . . . . 954.4 My evaluation of the parser in Collins’ thesis (Collins, 1999) . . . . . . . . . . 954.5 Results from my parser using Collins’ preprocessor . . . . . . . . . . . . . . . 964.6 Results from my parser using Collins’ output as a gold standard . . . . . . . . 974.7 A selection of correctly parsed sentences . . . . . . . . . . . . . . . . . . . . . . 1024.8 A selection of poorly parsed sentences . . . . . . . . . . . . . . . . . . . . . . . 104

5.1 An example of bigram counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.1 Bigram counts in the process of being normalised . . . . . . . . . . . . . . . . 1406.2 Parameters chosen for the generation of word vectors . . . . . . . . . . . . . . 1466.3 A sample of nearest-neighbour words from the first four thousand words . . 1496.4 A sample of nearest-neighbour words from the second four thousand words . 1506.5 A sample of nearest-neighbour words from the last four thousand words . . . 151

7.1 Performance of Collins’ 1996 parser over Section 23 before and after integrat-ing neighbour information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.2 Performance of Collins’ 1996 parser over a sub-corpus of two hundred sen-tences containing rare verbs, before and after integrating tag information . . . 162

7.3 Performance of Collins’ 1999 parser over Section 23 before and after integrat-ing neighbour information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.4 Performance of Collins’ 1999 parser over a sub-corpus of two hundred sen-tences containing rare verbs, before and after integrating tag information . . . 163

xiii

7.5 Learned mapping of words to words from four hundred words . . . . . . . . 1697.6 Evaluation of network generalisation after learning from the first four hun-

dred words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1707.7 Hand encoded categories for POS tags . . . . . . . . . . . . . . . . . . . . . . . 1757.8 Tagger accuracy as hidden units are added to the neural network . . . . . . . 1857.9 Tagger accuracy as extra training data is provided to the neural network . . . 1917.10 Tagger accuracy as extra training data is incrementally provided to the neural

network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1927.11 Performance of different taggers using half unique and half duplicate training

data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

A.1 Tags related to symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222A.2 POS tags used for nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222A.3 POS tags used for verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223A.4 POS tags used for adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224A.5 POS tags used for pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224A.6 Other POS tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225A.7 The main nonterminal categories . . . . . . . . . . . . . . . . . . . . . . . . . . 226

B.1 Data structure for phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231B.2 Member functions for phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232B.3 High level API for the beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

xiv

List of Figures

1.1 An example parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Example parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 A chart parser after parsing the large can. The top part of the diagram shows

the partial phrases while the bottom part shows the completed phrases. . . . 132.3 A chart parser after parsing the large can can hold. The top part of the diagram

shows the partial phrases while the bottom part shows the completed phrases. 142.4 Some analyses which a wide-coverage grammar should include . . . . . . . . 152.5 A dubious parse by a deterministic grammar . . . . . . . . . . . . . . . . . . . 152.6 Example of Probabilistic CFG rules . . . . . . . . . . . . . . . . . . . . . . . . . 162.7 Example sentence from the Penn treebank . . . . . . . . . . . . . . . . . . . . . 182.8 Two phrases showing that WSJ phrases contain little attachment information 182.9 A simple corpus of hand-parsed sentences . . . . . . . . . . . . . . . . . . . . 212.10 Parse trees with associated prior and conditional probabilities for John saw Mary 232.11 Syntactically valid but unlikely parse of “The man saw the dog with the tele-

scope.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.12 Likely parse of “The man saw the dog with the telescope.” . . . . . . . . . . . 242.13 Head and sibling productions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.14 A simple lexicalised parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.15 Sample representation of “with a list” in the HBG model, taken from Black,

Jelinek, Lafferty, Magerman, Mercer, and Roukos (1992) . . . . . . . . . . . . . 282.16 Sample DOP grammar for a tiny corpus . . . . . . . . . . . . . . . . . . . . . . 302.17 Partial parse showing the different areas examined by the inside and the out-

side probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.18 Two alternate interpretations of saw the girl with the telescope, showing the ef-

fect of the Viterbi optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.19 Using DOP to parse Mary likes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.20 Idealised pseudocode for Bod’s statistical parser . . . . . . . . . . . . . . . . . 48

3.1 Conversion from a WSJ style tree to head driven . . . . . . . . . . . . . . . . . 543.2 Collins’ representation of a left production event . . . . . . . . . . . . . . . . . 553.3 Collins’ smoothing function as implemented . . . . . . . . . . . . . . . . . . . 583.4 Simplest possible chart parser pseudocode . . . . . . . . . . . . . . . . . . . . 593.5 Pseudocode for combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.6 Simple example showing how an (incomplete) VP can have a (complete) NP-

C added as a right sibling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.7 Pseudocode for joining two edges (dependency events). . . . . . . . . . . . . . 613.8 Simple example showing the steps in building an NP constituent the man. . . 62

xv

3.9 Pseudocode for add singles. The previous parent is demoted to a head, andnew parents are generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.10 Pseudocode for add stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.11 Pseudocode for add singles stops . . . . . . . . . . . . . . . . . . . . . . . . . . 643.12 Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 Simplified data flow diagram for my implementation of Collins’ parser . . . . 694.2 Actual high-level code for preprocessing the treebank . . . . . . . . . . . . . . 704.3 Pseudocode to implement Magerman’s headword algorithm . . . . . . . . . . 714.4 The tagger’s control structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Histogram of the probability assigned to the correct tag, in the cases where

the tagger chooses a wrong tag as best . . . . . . . . . . . . . . . . . . . . . . . 804.6 Code for a beam search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.7 A simple skiplist showing the first sixteen items . . . . . . . . . . . . . . . . . 864.8 Code to find the highest node in a skiplist with priority ≤ n . . . . . . . . . . 864.9 Code for inserting a node into a skiplist . . . . . . . . . . . . . . . . . . . . . . 874.10 Time taken by the skiplist to insert random elements with different beam sizes 894.11 Scatter-plot of time taken by my parser to parse sentences of different lengths 984.12 Scatter-plot of log(time) versus log(sentence length) — the gradient is the

parser’s complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.13 Parsing accuracy versus sentence length. . . . . . . . . . . . . . . . . . . . . . 1014.14 Two parse trees showing that changing ‘bore’ to ‘fool’ corrects the parse. . . . 1014.15 Rank of the sentence’s least frequent head word versus parse accuracy . . . . 103

5.1 A figure from Finch’s thesis showing the internal structure from several partsof the dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2 Finch’s dendrogram generation algorithm . . . . . . . . . . . . . . . . . . . . . 1125.3 Sample clusters from Brown, deSouza, Mercer, Pietra, and Lai’s algorithm . . 1135.4 Pseudocode of Smrz and Rychly’s clustering algorithm . . . . . . . . . . . . . 1145.5 Analysis of the weights in Elman’s network, showing the linguistic knowl-

edge which had been learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.6 Liddle’s network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.7 Clusters of Liddle’s output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.8 A sample of Honkela’s word map . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.9 Two-dimensional version of Schutze’s output . . . . . . . . . . . . . . . . . . . 125

6.1 Graph of the frequency of every word in the WSJ against that word’s fre-quency in T/G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Pseudocode to count all co-occurrences in the corpus . . . . . . . . . . . . . . 1336.3 Word dendrogram with RMS scaling . . . . . . . . . . . . . . . . . . . . . . . . 1396.4 Word dendrogram with log applied to all counts before processing . . . . . . 1416.5 Dendrogram from iterating SVD four times . . . . . . . . . . . . . . . . . . . . 1446.6 Dendrogram where POS tags are used as extra features . . . . . . . . . . . . . 1456.7 Dendrogram using the final parameters (a window of twenty words and tag

information). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.1 Graph of noise against parser accuracy . . . . . . . . . . . . . . . . . . . . . . . 1557.2 Graph of the log of a word’s frequency versus the distance to its nearest neigh-

bour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.3 Dendrogram of tag representation . . . . . . . . . . . . . . . . . . . . . . . . . 172

xvi

7.4 Dendrogram of nonterminals produced using only unsupervised training . . 1747.5 Hand encoded representation of nonterminals . . . . . . . . . . . . . . . . . . 1757.6 Dendrogram of the representation of nonterminals . . . . . . . . . . . . . . . . 1777.7 Probability of different outputs from genprob, after outputs of zero are excluded1797.8 Plot of errors in the tag network against units used . . . . . . . . . . . . . . . . 1837.9 Scatter plot of output in the tag network using six hundred hidden units

against genprob’s output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1847.10 Graph of the training set and the test set error as hidden nodes are added to

the tag network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1867.11 Scatterplot of output from the tag network against genprob’s output, using

just twenty hidden nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1877.12 Density plot of output from the tag network against genprob’s output, after

twenty hidden nodes have been added . . . . . . . . . . . . . . . . . . . . . . . 1887.13 Scatter plot of output from the tag network against genprob’s output, using

unique training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1937.14 Scatter plot of output from the tag network against genprob’s output, using a

mix of unique and duplicate training data . . . . . . . . . . . . . . . . . . . . . 1957.15 Scatter plot of output from the unary network against genprob’s output, using

100k training patterns and eighty hidden nodes . . . . . . . . . . . . . . . . . 2007.16 Scatter plot of output from the unary network against genprob’s output, using

the network trained directly on the raw event file . . . . . . . . . . . . . . . . 2007.17 Scatter plot of output from the subcat network, trained with ten thousand

events with ten hidden units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

8.1 Data flow diagram for the entire thesis (simplified) . . . . . . . . . . . . . . . . 206

B.1 Data flow diagram of the parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 227B.2 Class structure of the parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

xvii

xviii

Chapter 1

Introduction

Computational linguistics has undergone something of a revolution in the last fifteen years.

Until recently, computational linguists were primarily grammar writers, building systems to

process natural language by formulating grammatical rules themselves and implementing

these rules in specialised high level programming languages. This kind of work still goes

on; see for example Copestake and Flickinger (2000); Ginzburg and Sag (2000). However,

there is a new dominant paradigm in natural language processing, which involves the ap-

plication of statistical techniques to learn appropriate rules from large corpora of examples.

The emphasis has moved from directly implementing systems with knowledge of language

to implementing systems which can acquire such knowledge, by various supervised and

unsupervised learning techniques.

This thesis is about statistical natural language processing (statistical NLP). It focuses on

one statistical NLP technique in particular, namely statistical parsing — i.e. finding the most

probable syntactic analysis of an input sentence. I will begin by motivating and introducing

statistical parsing in Sections 1.1 and 1.2. In Section 1.3, I outline the goals of the thesis, and

in Section 1.4 I give an overview of the thesis chapter by chapter.

1.1 What is parsing for?

Parsing is such a well accepted topic in NLP that it is easy to think of it as an end in its own

right. However, a parse tree is not intrinsically useful; we only need it as a means to other

ends. It is worth thinking about what these ends are because these give us useful guidelines

about the kind of parsers we want.

To begin at the beginning, we need a definition of parsing. Parsing is the determina-

tion of the syntactic structure of a sentence. What is a syntactic structure? A useful way of

answering this question is to make reference to a grammar: the abstract mechanism which

is able to generate every sentence in the language under investigation (and only these sen-

1

tences). The syntactic structure of a sentence is a description of how this grammar gener-

ated this particular sentence. The generative process is a recursive one, and thus the syn-

tactic structure of a sentence is a hierarchical, tree-like structure. We assume that a set of

context-free rules can be used to generate all and only the well-formed sentences in a lan-

guage (Chomsky, 1965). A context-free grammar rule looks something like S→NP, VP. This

means that a sentence (S) can be decomposed into a noun phrase (NP) followed by a verb

phrase (VP). Similarly, the rule VP → Verb, NP means a verb phrase can consist of a verb

followed by another noun phrase. An example of a noun phrase is “The cat”, an example of

a verb is “chased”, and an example of another noun phrase is “a rat”. Sometimes the first

verb phrase is referred to as the subject, while the second is referred to as the object. The

rules involved in this example can be expressed together by drawing a tree, as shown in

Figure 1.11 In this tree the parent node is decomposed into the children nodes. To go from

S

NP

Det

The

Noun

cat

VP

Verb

chased

NP

Det

a

Noun

rat

Figure 1.1: An example parse tree

a grammar to a parse tree, all the computer has to do is try every combination of grammar

rules and see if they match the input sentence. This can be made moderately efficient with

very little effort, as will be discussed in Section 2.2.

So, what is the point in computing a structure such as this one? As I have already said,

parsing is not an end in itself. The reason for deriving a sentence structure is because it

is required for the process of sentence interpretation; the syntactic structure of a sentence

is essentially a set of instructions for how to derive the meaning of the sentence from the

meanings of its individual words. Once we can compute sentence meanings, the list of

applications is very large — for instance, natural language interfaces, machine translation

systems, intelligent web browsers.

An important point about most example applications of parsing is that a parser is useless

unless it has good coverage. The parser has to be able to deal with a fairly high proportion of

the user’s input sentences if it is going to be a useful tool. In practice most early work in com-

putational linguistics concentrated on building grammars and parsers for small fragments

of natural language. They did not lend themselves well to use in practical applications. It is

1This thesis will not include descriptions of grammatical terms in the text, but such terms are defined in

Appendix A on page 221.

2

only quite recently that the creation of wide coverage grammars has been seen as a realistic

goal.

Wide coverage grammars and the problem of ambiguity

The grammars discussed above are the kind that have been used in computational linguis-

tics from the introduction of computer processing of language in the 1950s to the early 1990s.

The problem with such grammars does not surface until one attempts to write a full gram-

mar for a real language. Real language usage includes a huge number of infrequently oc-

curring syntactic constructions and the grammar builder has to include these constructions

if they want their grammar to be able to parse most sentences. However these constructions

very quickly lead to many incorrect parses for a sentence. Consider the following example

(taken from Lee (2004)):

(1.1) At last, a computer that understands you like your mother.1985 McDonnell-Douglas ad.

There are at least three different but valid interpretations of this sentence:

1. There is a computer which understands you as well as your mother understands you.

2. There is a computer which understands that you like your mother.

3. There is a computer which understands you as well as it understands your mother.

The problem here is that the input is ambiguous but that humans are so good at resolving

ambiguity they hardly notice. In reality the problem is less obvious forms of ambiguity,

which will be discussed in the Section 2.2.2, but for now it is sufficient to state that ambiguity

is the problem. It seems that it is impossible to develop a grammar that can understand a

large chunk of the English language, without also producing many incorrect interpretations

for valid sentences.

To address this problem, linguists have applied various different strategies. One ap-

proach is to add semantics to grammars and use a database of domain knowledge to try and

infer which parses of a sentence are semantically implausible. For instance, the message in

the third interpretation of Example 1.1 above is semantically rather anomalous. However,

this solution demands very good symbolic knowledge-bases and a good theorem prover,

as well as a good theory of compositional semantics; it is essentially ‘AI-complete’. An-

other approach is to use discourse context to help resolve the sentence’s ambiguity. For

example, if Example 1.1 appears in a context where it has just been mentioned that you like

your mother, this would lend support to reading 2. However, context can never provide a

complete source of information about how to correctly read a sentence.

3

1.2 Statistical parsing and its problems

Another approach to disambiguation is to annotate the grammar rules with frequency infor-

mation. To return to Example 1.1, it might be the case that the third parse given above uses

rules which are significantly less frequently applied than the other two parses. In this case,

we can reason that this parse is less likely to be the correct one. This frequency information

could be obtained, with some degree of accuracy, by simply asking the linguists making the

rules to estimate how often this rule occurred. Just as parse trees have always meant that

two grammatical rules occur, the frequencies of these rules can be combined in this prob-

abilistic grammar to give an estimate of the likelihood for any given interpretation. The

advantage of this method is that the parser would favour obscure grammatical structures

over no interpretation, and favour common structures over obscure structures. This should

lead to the parser picking the same interpretation as a human.

However the approach still has a number of problems. Firstly all humans, including

linguists, are notoriously bad at estimating probabilities — if this wasn’t so then casinos

would be far less successful. It soon became necessary to derive the probabilities from their

actual frequency of occurring in a corpus. Secondly, the way probabilities are combined is

normally by multiplying them, just like everywhere else in statistics. But this presupposes

that the two grammatical constructions being combined are independent. However this is

almost never the case, when we see certain words or phrases we are primed for certain

other phrases, simply because they frequently occur together. The grammatical rules need

to be modified so that their probabilities are dependent on surrounding context. Unfortu-

nately the second problem compounds the first. It means the corpus from which frequencies

are derived has to be big enough not only to include every type of grammatical structure

occurring multiple times, but that we need these structures occurring with every possible

different context as well. Of course, corpora containing this much information are simply

not available. The problem of gathering an adequate corpus is compounded by Zipf’s law,

which states that “the nth most frequent word occurs roughly 1/n times the frequency of

the most frequent word” (Li, 1992). This means we have excellent counts for a few words,

but that assembling a corpus with adequate counts for rare words is essentially impossible.

The challenge for a statistical parser is to simplify its grammar rules and demand for context

enough to make maximum usage of the data it has available.

Some statistical parsers do not simplify the grammar rules at all, and instead rely on very

good estimation functions. Others are fully configurable in how much context information

they use. It is also interesting to compare some of the earlier statistical parsers to the latest

statistical parsers. The earlier models used less context because they had a smaller corpus

to train with. Overall, it cannot be denied that this trade-off is one of the most important

issues in statistical parsing.

4

1.3 Main aims of the thesis

There are two main aims. The first is to develop a statistical parser which is easily con-

figurable, and whose code is readable, to serve as the basis for new experiments. Many

existing statistical parsers are not distributed (for example, Bod and Scha (1996), Goodman

(1998), Magerman (1996)) and those which are distributed, are often heavily optimised at

the expense of readability and modifiability (for example, Collins (1999)).

The second (and primary) aim of the thesis is to find some way of extending a statistical

parser to address the critical problem of statistical parsing just mentioned — the trade-off

between the frequency of grammatical structures in a training corpus, and their linguistic

usefulness in a parser. The main idea I pursue involves generalising between words so

that high-frequency words will share some counts with low frequency words, and later

extending this idea to generalising between any related events.

1.4 Overview of the thesis

Chapter 2 reviews previous work in statistical parsing. I begin by briefly considering pars-

ing in general, what it is for, and briefly introduce the ‘classical’ non-statistical paradigm

of grammars and parsing. After this I introduce the statistical parsing paradigm, and

summarise what I take to be the key challenges for current work in this field. Hav-

ing summarised the non-statistical approaches I then describe their statistical counter-

parts. Here I introduce the system I will focus on in my thesis, Collins’ 1997 statistical

parser, and situate it in the context of other statistical parsers.

Chapter 3 describes Collins’ statistical parser in detail. This chapter serves two purposes:

firstly, it provides the background necessary to understand my reimplementation of

Collins in Chapter 4, but equally importantly it provides a description of the operation

of Collins’ parser which in places extends that given by Collins himself.

Chapter 4 describes my reimplementation of Collins’ parser. All aspects of the system are

examined, including data structures, algorithms and any major design decisions and

their impact. An analysis of the performance of Collins’ parser and its reimplementa-

tion is given. The conclusion drawn is that Collins’ parser has a number of problems

that can be reduced by a different algorithm for backoff of rare events.

Chapter 5 discusses the problem of word representation, from the perspective of statistical

parsing. It begins by examining the properties of a good word representation from the

perspective of statistical parsing. Considering these properties, a number of different

techniques are examined and I conclude that the most suitable for my purposes is one

developed by Hinrich Schutze.

5

Chapter 6 describes my implementation of Schutze’s approach. It begins by explaining how

I modified Schutze’s approach to support a large lexicon. However, the majority of

this chapter summarises the work in adjusting the algorithm, its parameters, and its

input data to produce word vectors that are useful in a statistical parser. Specifically,

I found previous research concentrated on developing good representations for com-

mon words, whereas our concern here is in developing a good representation for rare

words.

Chapter 7 starts with the word representations derived in Chapter 6 and looks at how they

can be integrated into the statistical parser. As a first approach, they are integrated by

creating a new intermediate level of backoff, where words are grouped into clusters

with similar syntax and semantics. Due to the limitations in this approach, a much

more ambitious approach is proposed, involving the use of a large neural network to

compute event probabilities. The chapter develops an input representation suitable

for training a neural network, demonstrates the feasibility of the approach by training

a neural network based tagger and then works through the development of neural

networks for several different types of probability calculations. At the end a complete

system is presented that uses a neural network hybrid for the probability model, and

the performance of this new hybrid model is analysed.

Chapter 8 summarises the results that have been found and discusses what has been learned

for possible directions for further research.

6

Chapter 2

Statistical parsing

This chapter is a survey of current work in statistical parsing, presented in a textbook style.

In Sections 2.3 to 2.6 I examine the design and implementation issues present in building

a statistical parser. This covers classical statistics such as backoff as well as describing the

standard algorithms such as Earley parsing in their simplest form. I then describe some in-

fluential statistical parsers, concluding with an overview of the parser developed by Michael

Collins (1997) which forms the basis for my own implementation.

2.1 Deterministic grammars

Determining the meaning of a natural language sentence is commonly assumed to involve

determining its grammatical structure. In this section, we discuss the traditional concep-

tion of grammars within computational linguistics, in order to motivate the presentation of

statistical grammars, and to introduce some core linguistic concepts.

A grammar is in essence a set of rules, which operate together to specify the space of

well-formed sentences in a language. A well-formed sentence is one which a native speaker

of the language accepts as being part of the language. To take some clear examples, Sen-

tence 2.1 is a well-formed English sentence, whereas Sentence 2.2 is ill-formed.

(2.1) The cat sat on the big brown mat.

(2.2) *1 Mat brown big the on sat cat the.

Grammars typically operate on the assumption that the well-formedness of a sentence is

defined recursively in terms of the well-formedness of its constituent parts. The assumption

is generally that sentences are hierarchical entities, which can be described using trees. The

leaf nodes of these trees are the individual words in the sentence, and the non-terminal

1By linguistics convention, * denotes a grammatically ill-formed sentence and ? denotes a dubious sentence.

7

nodes are sequences of adjacent words which ‘group together’ particularly closely, known

as phrases. An example of a syntactic tree is given in Figure 2.1.

S

NP

Det

The

N’

Noun

cat

VP

VP

Verb

sat

PP

Prep

on

NP

Det

the

N’

Adj

big

N’

Adj

brown

N’

Noun

mat

Figure 2.1: Example parse tree

There are several criteria for these groupings — for instance, in Figure 2.1, the phrase the

big brown mat can be replaced by a single word it and still be well formed. Some common

nonterminals are noun phrases, verb phrases, prepositional phrases and sentences, although

the exact terms vary depending on the formalism being used. These will be abbreviated to

NP, VP, PP, S in future and a comprehensive list is given in Table A.7. For a comprehen-

sive introduction to syntactic analysis which motivates these nonterminals, see for example

Haegeman (1991).

The simplest grammars are collections of context-free grammar (CFG) rules. A context-

free rule is a rule that specifies how a complex phrase decomposes into simpler phrases. For

example, to describe the structure of Example 2.1 above, we would use the set of rules given

in Table 2.1.

S→ NP VP Noun→ cat, mat, . . .

NP→ Det N’ Det→ the, . . .

N’→ Adj N’ Adj→ big, brown, . . .

N’→ Noun Prep→ on, . . .

VP→ VP PP Verb→ sat, . . .

VP→ Verb

PP→ Prep NP

Table 2.1: Context-free rules needed to parse The cat sat on the big brown

mat.

8

Context sensitive grammar: CFG with features

In modern grammar formalisms, phrases are not atomic entities, but are parametrised using

features. For instance, our grammar needs to be able to distinguish between singular and

plural noun phrases and verb phrases, so as to prevent number mismatches such as the

following:

(2.3) * The dogs is happy.

To enforce agreement, we can say that the phrases NP and VP are each annotated with a

feature called number, which can take the alternative values singular or plural. As well as

number features, NPs and VPs need to be annotated with several other agreement features,

such as person and gender. We can then write CFG rules which use variable binding to

enforce number agreement, as in Table 2.2.

S→ NP VP Noun→ cat, mat, . . .

NP→ Det N’ Det→ the, . . .

N’→ Adj N’ Adj→ big, brown, . . .

N’→ Noun Prep→ on, . . .

VP→ VP PP Verb→ sat, . . .

VP→ Verb

PP→ Prep NP

Table 2.2: A context-sensitive rule using features for number agreement.

Number is a variable that denotes singular and plural.

2.1.1 Lexical heads

Naturally, the agreement features specified on an NP or VP phrase have to come from some-

where. The assumption is that they come from a lexical item within the phrase. Another key

concept for modern grammar formalisms is the idea that every phrase has a lexical head or

headword. The lexical head of a phrase is the word within the phrase from which its key

syntactic features (such as number) are inherited. Intuitively, the head of a phrase is the

word which contributes the core of its syntactic and semantic characteristics. For instance,

the semantics of the noun phrase the big brown mat will be taken to be primarily a function

of the semantics of its lexical head mat. Decomposing the phrase one level at a time, we

should say more properly that the head of the NP is its N’ child, the head of the N’ child is

its N’ child, and the head of this N’ child is the noun mat. For an in-depth discussion of the

role of heads in modern deterministic grammars, see Pollard and Sag (1986). The notion of

9

lexical heads is a notion which will be of crucial importance in our discussion of statistical

grammars.

2.1.2 HPSG and subcategorisation lists

Agreement features are just one kind of grammatical feature used to annotate phrases. An-

other important kind of feature is used to distinguish between verbs which select for dif-

ferent complements. For example, the verbs chase and hiccup differ in that chase requires a

direct object to be specified, while hiccup does not:

(2.4) John chased the dog.

(2.5) ? John chased.

And vice versa:

(2.6) John hiccuped.

(2.7) ? John hiccuped the dog.

To represent different verb types, we could simply use different atomic phrase types

(e.g. ‘trans verb’ and ‘intrans verb’). However, there are many properties of verbs which do

not depend on their pattern of complements, such as agreement features. Consequently, it

makes sense to specify a verb’s complements as a feature.

This feature is termed the verb’s subcategorisation list: it is basically an ordered se-

quence of the complements which a verb must accept in order to produce a complete VP.

For example, the subcategorisation (subcat) list of the verb chase would be the list [NP] ,

and the subcategorisation list of hiccup would be the empty list [] ; the subcategorisation list

of a complex verb like put or introduce would be the list [NP, PP] . Subcategorisation lists

considerably reduce the number of rules needed within a grammar. Instead of needing one

rule for each verb type, we can now have a single rule with a recursive structure. This is one

of the key innovations in a grammatical formalism known as Head-Driven Phrase Structure

Grammar (HPSG, Pollard and Sag (1986)). HPSG also has the concept of lexical heads as

just described. Given these two concepts it would be reasonable to write HPSG–like rules

such as those in Table 2.3, although real HPSG rules tend to be extremely complex.

In these rules, lexical items carry with them a subcat list specifying the complements

which they require. For instance, the verb sat has a subcat list containing a PP. Rather than

a rule explicitly allowing a VP to be made up of a verb like sat and a PP, there is a general

recursive rule (the last line) that splits the verb’s subcat list into a head and a tail; the head

appears to the right of the verb, and the tail is the new subcat list of the parent node. Note

that a similar treatment has been used for PPs: the lexical item on has a subcat list specifying

10

S→ NP, V([])

NP→ Det, N

sat→ V(subcat[PP])

on→ P(subcat[NP])

the→ Det

cat→ N

mat→ N

P(Tail)→ P([Head—Tail]), Head

V(Tail)→ V([Head—Tail]), Head

Table 2.3: Some example (simple) HPSG rules

that its complement is an NP. The two recursive rules are almost identical; in HPSG, these

rules are replaced with a general rule of the form X(Tail)→ X([Head—Tail]), Head, which

works for all cases where a word’s complements appear to its right.

We will not be making any further reference to HPSG. But subcategorisation lists are

another grammatical notion which will be extensively used in our discussion of statistical

parsing in Section 2.3.

2.2 Deterministic parsing algorithms

Having decided on a grammar formalism to use, the next stage is to process sentences using

this formalism. Firstly we need to categorise the input, a process known as part-of-speech

(POS) tagging, and then we need to combine the categories into larger phrases using a

parsing algorithm.

A parsing algorithm takes a grammar and an input sentence, and produces a parse tree

such as in Figure 2.1. The algorithm can be either goal driven (top-down) where it starts

with a special nonterminal in the grammar (i.e. “S”) and searches all the trees the grammar

can produce to find the words, or data driven (bottom-up) where it starts with the words

and searches all the phrases that can be produced. Either way, the parser maps between

grammar rules and sentences. When more than one mapping exists, the parser is typically

designed to return all possible mappings. When there are multiple mappings, the parse is

ambiguous, and the problem of selecting the correct parse is known as resolving ambiguity.

These multiple mappings are extremely common in natural language. For example, in

The man saw the girl with a telescope, who has the telescope? Additionally, many words such

as fly can act either as a noun or a verb. Because the human brain is extremely good at

resolving ambiguity, it is rarely noticed, although it is possible to derive exceptions such as

11

when reading the sentence: A list of lecturers broken down by age and sex will be posted in the

lobby. Deterministic grammars do not have this capacity for resolving ambiguity and simply

pass the problem on to the next stage in the process.

2.2.1 Chart parsing

A standard mechanism for generating parse trees is chart parsing. A chart parser works

by considering every subsequence of words in the input sentence, finding every possible

phrase in each of these subsequences. (Because every sequence is considered only once,

a chart parser avoids the expensive backtracking involved in more straightforward search

algorithms.) For any real grammar the time spent analysing unimportant subsequences is

much less than the time spent backtracking. Additionally, a chart parser is more useful for

parsing naturally occurring text because it can ignore the errors to find the largest possible

‘chunks’.

The core of a chart parser is the chart data structure. This is a multidimensional array,

indexed by the start and end of each phrase. (A phrase is often referred to as an arc in chart

terminology.) Each element in the array is called an edge Edges can either be complete or

incomplete. An incomplete edge (also called an active edge) is one that corresponds to a

phrase not all of which has yet been found. For example, after seeing the, a chart parser will

use the rule NP → Det, N to generate an incomplete edge for this word labeled with the

category NP, with the Det marked as already found and the Non an list of phrases still to be

found. By contrast, complete edges (also known as inactive edges) are phrases the parser

has found. This is illustrated in Figure 2.2 which shows a chart after parsing the large can.

Incomplete edges are at the top of the chart and use the symbol ◦ to denote the division

between the parsed section and the section the parser is looking for.

The chart parsing algorithm, called the Earley algorithm (Earley, 1970) is sufficiently

complex to warrant explanation. Parsing is performed using two functions: extend com-

bines an incomplete edge with a complete edge to form a new edge which may now be

complete; and parse takes newly completed edges and finds grammar rules that they can

start, generating incomplete edges. This is all better understood by way of example. Figure

2.2 shows the parser’s state after the large can. Now consider the next input word can. This

is first tagged as possibly a noun or a verb, with both alternatives entered into the chart as

complete edges. Next the parser looks for rules that can be started with the newly com-

pleted edges as well as incomplete edges that can be extended with the newly completed

edges. The whole process is repeated for hold and the result is given in Figure 2.3. Read-

ers interested in a gentler introduction to chart parsing are referred to Allen (1995) or most

introductory artificial intelligence texts.

12

NP→ D ◦ ADJ N

NP→ D ◦ N

NP→ D ◦ N

NP→ D ADJ ◦ N

S→ NP ◦ VP

S→ NP ◦ VP

VP→ AUX ◦ VP

VP→ V ◦ NP

NP

NP

D ADJ N

AUX

V

1 the 2 large 3 can

Figure 2.2: A chart parser after parsing the large can. The top part of the

diagram shows the partial phrases while the bottom part shows the com-

pleted phrases.

13

S→ NP ◦ VP

S→ NP◦ VP

VP→ AUX ◦ VP

VP→ V ◦ NP

VP→ AUX ◦ VP

VP→ V ◦ NP

VP→ V ◦ NP

NP

NP

N N

V V V

D ADJ AUX AUX N

1 the 2 large 3 can 4 can 5 hold

Figure 2.3: A chart parser after parsing the large can can hold. The top part

of the diagram shows the partial phrases while the bottom part shows the

completed phrases.

2.2.2 Problems with deterministic parsing

Until recently, most deterministic grammars were fairly small, in terms of both numbers of

lexical items and grammatical rules. They were frequently designed to deal very well with

a limited set of grammatical rules. However, such parsers tended to fail miserably when

tested on real corpora because they do not have broad coverage.

It might be thought that the solution to the coverage problem is to augment the number

of rules and words in the grammar. This does indeed result in better coverage; however a

new problem now arises — as mentioned in Section 1.1 — a proliferation of ambiguity.

To illustrate this problem, begin by assuming we need our wide-coverage grammar to

deal with the following perfectly reasonable sentences:

(2.8) Here is the equation for a Laplace transform

(2.9) Mustang Sally walked into the bar.

(2.10) Who volunteers to mark the assignment? Me.

We therefore need rules to generate the analyses shown in Figure 2.4.

However, if the grammar contains these rules — all of them fairly unusual — then if we

try and parse an ordinary sentence like John saw Mary, we end up with an entirely spurious

parse, as shown in Figure 2.5.

14

NP

Det

a

N

PN

Laplace

N

transform

PN

N

Mustang

PN

Sally

S

NP

Me

Figure 2.4: Some analyses which a wide-coverage grammar should in-

clude

S

NP

PN

N

PN

John

N

saw

PN

Mary

Figure 2.5: A dubious parse by a deterministic grammar

In the spurious parse (on the right-hand side), ‘John saw Mary’ is interpreted as a single

word answer to a question, referring to a strange character called ‘John saw Mary’. John saw

is a N, just like Laplace transform, and John saw Mary is then analogous to Mustang Sally. Any

wide-coverage grammar is bound to contain massive ambiguity.

Some grammarians have dismissed this argument by saying disambiguation is a post-

processing phrase and not part of parsing, but this dismissal is weak: the problem of am-

biguity is so severe that assuming an oracle for disambiguation is basically giving up on

doing syntax altogether. The deterministic grammar has not found the correct parse of the

sentence, it has just rejected the obviously wrong parses. Sophisticated linguistic parsers

may yield fascinating results about the structure of the language, but they are useless for

parsing real unrestricted natural language input.

2.3 Probabilistic grammars and corpus-based NLP

Probabilistic grammars are a method of overcoming the ambiguity inherent in deterministic

grammars. Instead of enumerating all grammar rules as if they were equal, rules are an-

notated with their relative frequency. This means the parser can state not only that a rule

matches, but can also sort alternative parses by their likelihood of being correct. This in-built

mechanism for ambiguity resolution is a huge advantage over deterministic grammars. Ad-

ditionally, the improved measurement of ambiguity in probabilistic grammars means that

15

probabilistic grammars can be much, much bigger than deterministic ones. For instance, a

probabilistic grammar is allowed to have a very large number of very specific rules describ-

ing how an NP can be formed. This will often result in a large number of spurious parses of

a given sentence, but this does not matter, provided we have chosen suitable probabilities

for the rules in the grammar, because we can be confident the spurious parses will have low

probabilities.

The first probabilistic grammars were written by hand in a similar way to deterministic

grammars. The only difference was that every grammar rule had an associated probabil-

ity. These probabilities were derived by educated guesses on the part of the grammar en-

coders. The grammars were called probabilistic context free grammars or PCFG (Booth and

Thompson, 1973) and had rules of the form given in Figure 2.6. (The probabilities are of a

VP constituent being rewritten as VT NP and as V0 NP are 0.8 and 0.2 respectively in this

case, these two alternatives exhaust the possible ways VP can be expanded, because they

sum to 1.) Analogous extensions to CFG, TAG, XBAR and many other formalisms were also

developed. Conceptually, parsing using a grammar uses exactly the same process as with

deterministic grammars. The only difference is that since every rewrite rule has an associ-

ated probability, the probability of the derivation can be obtained by multiplying together

the probability of all of the rules used.

VP→ VT NP (0.8)

VP→ V0 (0.2)

Figure 2.6: Example of Probabilistic CFG rules

2.3.1 Building a corpus of hand-parsed sentences

Adding probability information to a grammar by hand-annotating every grammar rule with

a likelihood is prone to many errors, as human annotators are poor at guessing probabilities.

In addition, building a probabilistic grammar by hand does not take full advantage of the

fact that probabilistic grammars can be extremely large. Rather than guessing probability

information it is more natural to derive the probabilities from a corpus of annotated text.

Probabilistic grammars can be annotated with context information and semantic require-

ments in the same way as deterministic grammars.

Perhaps the most interesting example of an early probabilistic grammar was History

Based Grammar (HBG) by Black et al. (1992). This was the first project to bridge the gap

between manually guessed probabilities and automatic probability generation because as

part of developing the parser, Black et al. developed a large corpus of parsed text from

which they extracted probabilities. HBG uses a feature based grammar with twenty-one

16

features, where the features are enumerated sets with an average of eight different values.

For example, ‘past’ is a valid value for the ‘tense-aspect’ feature.

While this approach leads to a fairly good statistical parser, it requires a huge amount of

effort to build. More work is needed to build the corpus than to build the parser. The corpus

needs to be large enough to have statistically significant frequencies. An obvious solution

to this is to make generating the corpus an entirely separate project which can then be used

by a statistical parser. This is the goal of the Penn treebank project (Marcus, Santorini, and

Marcinkiewicz, 1993). The Penn treebank was the second large corpus developed (the first

was the Lancaster corpus used by Black et al.). The Penn treebank is essentially the only

treebank available for building a parser in English and so any deficiencies in it will lead to

deficiencies common to all statistical parsers. It is therefore particularly important to discuss

its peculiarities.

Building a corpus such as the Penn Treebank is a huge undertaking. The corpus must

be large in order to obtain accurate statistics, and while statistical methods are robust to

errors in their training data, they are much less robust to errors in the testing data. This is

a major problem if the corpus is to be used to compare the accuracy of different parsers.

Additionally, the people who built the corpus did not know what the best grammatical for-

malism would be for a statistical parser so they had to make the corpus independent of

the grammatical formalism being used. Figure 2.7 gives an example tree from the treebank

demonstrating the minimal syntactic information present. There are many good reasons for

this, one of them being not knowing which formalism would be best, as was just mentioned.

However it also leads to a number of problems as there is very little information that every

formalism agrees on. For example, GB and HPSG both contain a lot of attachment informa-

tion but the information they contain is quite different and so the corpus contains none of

this information.

Another side effect of this representation problem on the representation is that trees are

very flat, as demonstrated in Figure 2.8. While this does not prevent a parser obtaining good

results when compared to the test corpus, it may mean that such a parser is still not good

enough for many uses. The section on evaluation (Section 2.3.2) examines this point.

Because of the huge amount of work required to build an accurate corpus, the corpus

is relatively small. It contains a total of fifty thousand sentences, and a similar number of

words. Compared to the corpora used in unsupervised learning, which typically run to

millions of sentences, this is really tiny. Furthermore, it is based solely on a number of years’

worth of articles from the Wall Street Journal (the WSJ). This restriction to the domain of

carefully edited discussions of financial data means a good statistical parser is only going

to perform well on other Wall Street Journal style sentences. No solutions to this have been

advanced in the literature, perhaps because everybody is still concerned with getting a good

17

TOP

S

NP-SBJ

PRP

We

VP

VBP

’re

VP

VBG

talking

PP-CLR

IN

about

ADVP-TMP

ADVP

NP

NNS

years

IN

ago

SBAR

IN

before

S — see below

. . .

.

S from above

NP-SBJ

NN

anyone

VP

VBD

heard

PP-CLR

IN

of

S-NOM

NP-SBJ

NN

asbestos

VP

VBG

having

NP

DT

any

JJ

questionable

NNS

properties

Figure 2.7: Example sentence from the Penn treebank

NP

PRP$

its

NNP

nielsen

NNP

marketing

NNP

research

, NNP

nielsen

NNP

clearing

NNP

house

CC

and

NNP

donnelley

NNP

marketing

NNS

businesses

NP

DT

a

NNP

san

NNP

francisco

NN

food

NNS

products

CC

and

NN

building

NNS

materials

NN

marketing

CC

and

NN

distribution

NN

company

Figure 2.8: Two phrases showing that WSJ phrases contain little attach-

ment information

18

statistical parser. However, it is a big problem which will need to be addressed at some stage.

Because the Penn treebank is so closely tied to the Wall Street Journal, I will usually refer to

it as the WSJ.

2.3.2 Evaluating parser performance

Once you decide to take real natural language corpora seriously, you need to have in place

some proper quantitative measures for evaluating the performance of a parser on a given

corpus. The goal is to minimise the overall error, which differs significantly from previous

work in linguistics where the aim was to correctly parse a few very complex sentences.

The most obvious metric for determining a parser’s performance is the percentage of

sentences it gets completely correct, called the exact match. This method does not work

very well in practice because the current generation of parsers get all but the very simple

sentences slightly wrong, and so maximising this metric becomes a task of getting simple

structures right at the expense of complex structures, hardly a laudable goal.

An alternative method is to measure the percentage of phrases the parser gets right.

This is a good metric in that a high score implies a better parser but it is surprisingly diffi-

cult to formalise how incorrect phrases are scored. For example, what if the parser finds a

large phrase from the treebank but not the component phrases, or if it finds the component

phrases but incorrectly labels the parent? The most common method used is to split the

parser’s accuracy into precision and recall. Precision is the percentage of phrases found by

the parser which are in the ‘correct’ analysis of a sentence, while recall is the percentage of

phrases in the ‘correct’ analysis of the sentence which are found by the parser. Perhaps the

best way of illustrating the difference is with an extreme example: a parser that labels every-

thing as a phrase has perfect recall but terrible precision while a parser that labels nothing

as a phrase has perfect precision but terrible recall. For some tasks one metric is more useful

than the other, but in general they are weighted equally. Most current parsers score about

the same in both metrics.

The precision/recall metric is not without problems. One problem is that the scores are

too high given the low overall standard of parsers. The best parsers at the moment score

around 85% precision and recall, but this means that a parse of a fifteen word sentence with

around 20 constituents will probably have around three errors in it. Another problem is

that a higher average score does not necessarily mean an empirically better parser. Because

average scores are relatively high, a single sentence parsed very badly will significantly de-

crease results. Obtaining a high score becomes a problem of tweaking the parser to correctly

handle the strange cases in the treebank such as sentences with unusual punctuation or not

ending in a full stop. The difference between 75% and 80% on ordinary sentences is quickly

lost if the parser performs poorly on these edge cases. One excellent way of avoiding this

19

problem, which I have not seen mentioned in the literature so far, would be to use median

precision/recall scores instead of mean scores.

If we are trying to obtain a precision/recall above 85%, it is sensible to ask how much

higher it is possible to go. After all, 85% sounds quite accurate. The answer is not yet known

but appears to be slightly above 90% since this is the accuracy that can be obtained by hand-

picking between a selection of automatically derived parses. Additionally, the ability of

parsers to generalise to genres of text different from those contained in the training data is

expected to be quite low. So, while we may have 85% when testing in the same genre, it will

be some time before this is achieved in different genres.

One final problem with the high precision/recall scores is that it forces parsers to tightly

conform to the WSJ representation. This is a problem because the representation does not

include enough information to be useful for many tasks. A third metric for evaluating parser

accuracy is the number of crossing brackets found when comparing the parser’s output to

the correct parse. This metric has the advantage that over- or under-specific constituents

do not get penalised, and somewhat alleviates the problem of tight conformity to the WSJ

annotations.

Another aspect of a parser’s performance that is hardly mentioned in the literature is

parsing speed. Many cited uses of parsers, such as automatic translation and summarisa-

tion, require the parser to parse around twenty words per second, yet one of the best parsers

requires an hour to parse twenty words (Bod and Scha, 1996). Even very fast parsers such

as Collins (1999) are unable to parse very large sentences in a reasonable length of time.

This area may become more interesting in the future but in this thesis I will be focusing on

precision and recall rather than parsing time.

2.4 Probabilistic grammar formalisms

The first step in generating a probabilistic grammar is to count all the events that occur in the

training corpus. An event can be understood as some component of a parse tree. If we are

trying to learn a context-free grammar, the most obvious events to count are productions;

i.e. individual applications of context-free rules. For instance, assume we are building a

probabilistic context-free grammar to disambiguate a sentence with spurious ambiguity,

such as the sentence John saw Mary discussed in Section 2.2.2. If we take a mini-corpus of

sentences such as those in Figure 2.9, we can estimate the probabilities of rules by counting

the number of occurrences of each rule. The frequencies are shown in Table 2.4. (Note that

an unusual rule like PN→ N, PN is relatively rare.)

To estimate the probability of a complete parse tree from this frequency information,

we need to break the parse tree in question into its constituent events. Any tree can be

20

S

NP

PN

N

Mustang

PN

Mary

VP

VT

saw

NP

PN

Bill

S

NP

PN

Mary

VP

VT

saw

NP

Det

a

N

PN

John

N

saw

S

NP

PN

Mary

VP

VT

saw

NP

PN

John

S

NP

PN

Mary

Figure 2.9: A simple corpus of hand-parsed sentences

thought of as a set of productions, but crucially, these productions are not fully independent

of one another; the children of the highest production in the tree determine what the parents

are for the next productions down, and so on recursively. What we need, therefore, is the

conditional probability of each production given the occurrence of its parent. To estimate

this we simply count the number of times the rule is applied in the corpus, and divide by

the number of times the rule’s parent occurs. The conditional probabilities derived by this

method are given in Table 2.5.2 (We also need the prior probability of the node at the root of

the parse tree being the root of a tree, which for our corpus is 1 for S, and 0 for every other

phrase.) Given that we are working with a context-free grammar, in which the way a node is

expanded does not depend on the context in which it appears, the probability of a complete

parse tree is then simply the product of all of these probabilities. For the two interpretations

of John saw Mary, these probabilities are given in Figure 2.10.

These are sufficient to strongly prefer the left-hand parse over the spurious parse on the

right.

2.4.1 Lexical semantics in probabilistic grammars

While simple probabilistic context-free grammars such as that just described are very help-

ful in giving preferences for commonly used productions, there are many cases where a

sentence has alternative readings which both involve common productions. For instance,

consider the sentence “The man saw the dog with the telescope”, for which two alternative

2Henceforth, we will often leave it implicit that probabilities are being estimated from relative frequencies,

where this is obvious from context.

21

Rule Frequency in corpus

S→ NP, VP 3

S→ NP 1

NP→ PN 6

NP→ Det, N 1

VP→ VT, NP 3

N→ PN, N 1

N→mustang 1

N→ saw 1

VT→ saw 1

det→ a 1

PN→Mary 2

PN→ John 1

PN→ N, PN 1

Table 2.4: Frequencies of rules in the simple corpus

parses are given in Figures 2.11 and 2.12.

Both of these syntactic structures are very common and while a PCFG would be able to

select one over the other, it would be expected to give both an approximately equal weight-

ing. One possible solution to this problem is to include information about the lexical items

in the sentence in the phrases involved in its analysis. Intuitively, we expect events of seeing

to frequently involve telescopes, while we expect dogs infrequently to have telescopes. The

HPSG notion of a lexical head is useful in spelling out this intuition. We expect a VP headed

by the verb saw to be quite frequently modified by a PP involving the word telescope in a

representative corpus, while we expect an NP headed by dog only rarely to be modified by

a PP involving the word telescope in such a corpus.

How can we modify our grammar to include the appropriate lexical information? A use-

ful solution, also originally proposed by Black et al. (1992), basically involves a huge increase

in the number of phrases in the grammar. Instead of simply having a phrase NP, we need

one phrase for each possible headword of an NP: that is, NP-headed-by-dog, NP-headed-

by-telescope, and so on. At this point, unfortunately, we are faced with a data sparseness

problem: we are unlikely to find sufficient counts for individual productions, even with a

very big corpus. The problem is partly due to Zipf’s law; most words in the language only

occur very infrequently, so most grammatical categories, when tagged with an open-classed

headword, will be fairly rare. The problem is compounded by the fact that many grammars

allow a node to take several children. If each child is already rare, then the combination of

22

Rule Frequency in corpus Estimated P(Rule|Parent)

S→ NP, VP 3 3/4 = .75

S→ NP 1 1/4 = .25

NP→ PN 6 6/7 = .86

NP→ Det, N 1 1/7 = .14

VP→ VT, NP 3 1

N→ PN, N 1 1/3 = .33

N→mustang 1 1/3 = .33

N→ saw 1 1/3 = .33

VT→ saw 1 1

det→ a 1 1

PN→Mary 2 2/4 = .5

PN→ John 1 1/4 = .25

PN→ N, PN 1 1/4 = .25

Table 2.5: Probabilities of rules in the simple corpus

p(S is a root node) = 1.00

S(0.75)

NP (0.86)

PN (0.25)

John

VP (1.00)

VT (1.00)

saw

NP (0.86)

PN (0.5)

Mary

p(S is a root node) = 1.00

S(0.25)

NP (0.86)

PN (0.25)

N (0.33)

PN (0.25)

John

N (0.33)

saw

PN (0.50)

Mary

Figure 2.10: Parse trees with associated prior and conditional probabilities

for John saw Mary

23

S

NP [man]

Det

The

Noun

man

VP [saw]

VP [saw]

Verb

saw

NP [dog]

NP [dog]

Det

the

Noun

dog

PP [with]

Prep

with

NP [telescope]

Art

the

N

telescope

Figure 2.11: Syntactically valid but unlikely parse of “The man saw the

dog with the telescope.”

S [saw]

NP [man]

Det

The

Noun

man

VP [saw]

VP [saw]

Verb

saw

NP [dog]

Det

the

Noun

dog

PP [with]

Prep

with

NP [telescope]

Art

the

N

telescope

Figure 2.12: Likely parse of “The man saw the dog with the telescope.”

24

n such children will be exponentially so. With low counts, we cannot be confident in the

probabilities we derive as we discuss in detail in Section 2.5.

The problem of Zipf’s law and the problem of multiple children needs to be addressed in

different ways. Very few solutions have been proposed for the former problem; in fact this

thesis will focus largely on the problems caused by Zipf’s law. The latter problem can be

addressed by finding a way of splitting a parse tree into events that are smaller than single

context-free rule applications. The rest of this section will discuss how this can be done.

One idea, originally proposed by Magerman (1995), is to break each single rule appli-

cation into several components: a head production which takes a phrase and generates its

head constituent, and a set of sibling productions which take a phrase and its head con-

stituent, and generate the remaining child constituents, either to the left or the right of the

head. The occurrence of a parent node decomposing into a set of children is now represented

using the kinds of events shown in Figure 2.13.

Parent

Head

Parent

Head . . . Right sibling

Parent

Left sibling . . . Head

Figure 2.13: Head and sibling productions

The conditional probabilities we are interested in are the probability of a head constituent

given its parent (for a head production) and the probability of a sibling constituent given its

parent and its head (for a sibling production). These probabilities can be estimated from

relative frequencies of events, as described in Section 2.4. The events this time are not pro-

ductions of context-free rules but partial descriptions of such productions:

P(Head|Parent) =Count(Head, Parent)

Count(Parent)(2.1)

P(Le f t|Head, Parent) =Count(Le f t, Head, Parent)

Count(Head, Parent)(2.2)

P(Right|Head, Parent) =Count(Right, Head, Parent)

Count(Head, Parent)(2.3)

The notation here needs some explanation. Taking Equation 2.1 as an example, if you

know the parent and you are trying to derive the probability for a given head, you can

estimate the probability by counting the number of times that head occurs as the head of

that parent, and dividing by the total number of times that parent occurs.

If we move to a lexicalised grammar, the data sparseness problem due to Zipf’s law

is now reduced; head productions only involve one lexical item, and sibling productions

only involve two. A concern when shifting to a HPSG approach, is that the parser will lose

dependency information. In a PCFG approach, you list all the arguments when you list a

25

grammar rule, but in the HPSG approach, you add the arguments one at a time. How do

you ensure only the right number of arguments are assigned? For instance, consider the

fragment kicked the ball the table: In PCFG this can be quickly rejected as a VP because kicked

only takes one argument while the naive HPSG already described would accept either and

therefore both ball and table as arguments to kicked. Magerman’s solution to this is to split

arguments into two classes, adjacent arguments and non-adjacent arguments.

Of course, the above equations do not yet actually refer to words. To introduce words

into these equations, we should first introduce some new terminology. Consider a con-

stituent C1, which decomposes into a head constituent Chead and a left sibling constituent

Csib, as shown in Figure 2.14.

sibC

headC VP[chased]

chased a mouseThe cat

NP[cat]

C1

S[chased]

Figure 2.14: A simple lexicalised parse tree

When we refer to the ‘head’ of C1, we could be referring to the entire constituent Chead,

or to the label of Chead (i.e. VP), or to the head word of Chead (i.e. chased). Similarly, when we

refer to the ‘parent’ of Chead, we could be referring to the whole constituent C1, or to the label

of C1 (i.e. S), or to the headword of C1 (i.e. chased). To disambiguate, we will say that Chead

is the head constituent of C1, VP is its head nonterminal label, and chased is its headword,

while C1 is the parent constituent of Chead, S is its parent nonterminal, and chased is its

parent headword. (Since there is some redundancy in recording the headword of a head

constituent and its parent constituent, we do not in fact need to record this latter piece of

information.) We abbreviate ‘head constituent’ as H, ‘head nonterminal’ as Hnt, ‘headword’

as Hw, ‘parent constituent’ as P, ‘parent nonterminal’ as Pnt. We likewise abbreviate ‘left

sibling constituent’ as L, ‘left sibling nonterminal’ as LNT, and ‘left sibling headword’ as Lw,

and similarly for right siblings. To estimate the probability of lexicalised productions, we

can now use the modified equations given below.

P(HNT|PNT, Hw) =Count(HNT, PNT, Hw)

Count(PNT, Hw)(2.4)

P(LNT, Lw|HNT, Hw, PNT) =Count(LNT, Lw, HNT, Hw, PNT)

Count(HNT, Hw, PNT)(2.5)

26

P(RNT, Rw|HNT, Hw, PNT) =Count(RNT, Rw, HNT, Hw, PNT)

Count(HNT, Hw, PNT)(2.6)

These equations can be applied to the ambiguous sentence we started with, The man saw

the dog with the telescope. Recall that we are looking for a way of preferring the parse in

Figure 2.12 over that in Figure 2.11. Informally, we need to find that the probability of a PP

headed by telescope is more likely to occur as the right sibling of a VP headed by saw than as

the right sibling of an NP headed by dog. More formally, substituting the actual heads and

parents into Equation 2.6 leads to the calculations given in Equations 2.7 and 2.8.

P(PP, telescope|VP, saw, VP) =Count(PP, telescope, VP, saw, VP)

Count(VP, saw, VP)(2.7)

P(PP, telescope|NP, dog, NP) =Count(PP, telescope, NP, dog, NP)

Count(NP, dog, NP)(2.8)

It is now reasonable to expect the correct parse to have a higher probability.3

Having discussed the theory behind probabilistic grammars it is now possible to exam-

ine how this has been used by a few real probabilistic parsers. We will discuss Black et al.

(1992), Bod and Scha (1996) and Klein and Manning (2003).

2.4.2 Black et al.

Black et al.’s history based grammar (HBG) differs significantly from the generative gram-

mars that are now standard. The first major difference is that it imposes a fixed derivation

order which essentially requires the leftmost derivation to be expanded. This requirement

on the derivation order means that every grammar rule can only depend on things derived

before (that is, to the left of it). For instance, in parsing the aggressive potato, the internal frame

will store a representation for the aggressive and attempt to coerce it with a representation of

potato. This differs from other formalisms such as HPSG where potato will be derived first

since it is the head of the phrase, and then aggressive will be coerced into the phrase headed

by potato.

The grammar rules themselves are simple context-sensitive rules, which Black et al.

refers to as context-free with features. Like HPSG, each state is represented as a frame

containing syntactic and semantic information. The grammar rules therefore show how fre-

3We have taken a couple of liberties in this example. Firstly, we assume a corpus with relatively frequent

mentions of the concepts see, dog or telescope. The WSJ is certainly not such a corpus! Secondly, we are assuming

that the head of a PP is a noun, to allow for telescope to be the head of a PP. In most NLP work, the head of a

PP is the preposition. However, there is a huge debate in linguistics about what the ‘correct’ headwords are

for different constituents. Here our sole concern is to give a simple example of how lexical items can help in

ambiguity resolution. The idea of multiple ‘headwords’ is explored more fully by Bod and Scha, discussed here

in Section 2.4.3.

27

quently a given frame combines with other frames. An example showing what the grammar

stores is given in Figure 2.15.

R: P1Syn: PP

Sem: With−DataH1: listH2: with

R: NBAR4Syn: NPSem: DataH1: listH2: a

R: N1Syn: NSem: DataH1: listH2: *

with

a

list

Figure 2.15: Sample representation of “with a list” in the HBG model,

taken from Black et al. (1992)

There are two significant advantages in having only one derivation. Firstly, it makes it

much easier to spend CPU cycles deriving each derivation once, much the same benefit as

using a chart parser. Secondly, writing the probability model is significantly simpler than for

PCFG. The probability for every production is based on the probability for each data field,

so for instance list includes the information that it is a noun and has a semantic role of data.

In matching, the probability for these two parameters is derived separately and combined

using a form of Naive Bayes.

In terms of exactly what to store in each event, Black et al. experimented with several dif-

ferent models but eventually chose p(Syn, Sem, R, H1, H2|Synp, Semp, Rp, Ipc, H1p, H2p). This

equation means that a mix of syntax and semantics is used at every stage. One point that

was particularly clever was the use of two headwords. It is generally accepted that the head-

word of a prepositional phrase is the preposition rather than the PP’s complement. Ignoring

the linguistic justification and choosing the complement instead will lead to problems when

the parent of the PP does not match the preposition. However, the initial motivation I made

for statistical parsers was the sentence the boy saw the girl in the park with a telescope. In this

sentence we noticed that attaching the PP with a telescope to saw makes sense because saw

hopefully co-occurs frequently with telescope. It is unreasonable to assume saw will co-occur

with with more frequently than girl will co-occur with with. The addition of an auxiliary

28

headword solves this problem nicely.

As already mentioned in Section 2.3.1 Black et al.’s system was the first to make use of a

treebank. Black et al. used a five thousand sentence treebank taken from technical manuals

called the Lancaster treebank which was created by IBM specifically for this project. The use

of a treebank was a huge departure from previous work and marks the real birth of statistical

parsing, all the other parsers examined here also use a treebank. However, Black et al. did

not use a treebank in the same way as later parsers. The grammar rules were still derived

by hand, whereas in later systems they are automatically extracted from the treebank.

Overall, HBG performed very well. Since we are discussing the grammar rather than

the parser I will not give detailed performance figures; but roughly speaking Black et al.

was able to achieve equivalent performance to a parser based on a PCFG grammar which

was trained on a corpus twice the size, and achieved thirty percent less errors than a hand-

encoded PCFG.

2.4.3 Exhaustive grammars: Bod and Scha’s approach

Bod and Scha developed a parser for Scha’s data-orientated parsing (DOP) (Bod and Scha,

1996). There are two major additions to the literature in Bod and Scha’s grammatical for-

malism. The first is the idea that the training treebank is not just used to train the grammar,

it is the grammar. Black et al. also used a treebank, but Bod and Scha were the first to derive

the grammatical rules from the treebank instead of just extracting frequency counts. The

second extension was moving away from the idea that a linguist can tell what is important

in a parse tree and instead leaving this job to the parser.

Simple context-free grammar rules do not contain any defeasible information while more

complex formalisms such as HPSG frequently do. Because probabilistic grammars include

information that does not have to be satisfied, it is natural to view probabilistic grammars

as a very fine grained form of defeasible reasoning. For example the first probabilistic gram-

mars only differed by including low-probability rules for structures a deterministic gram-

mar would probably reject. Next, Black et al. included a number of features, all of which are

defeasible; and more recently Collins includes around a dozen parameters (Collins, 1999)

that are better described as guiding the search space than being defeasible features because

they are so easily contradicted. Despite publishing in 1996, Bod has taken this trend to its

logical conclusion in data orientated parsing DOP, or TREE-DOP (Bod and Scha, 1996). In

this formalism the grammar is replaced with the entire training corpus.

Simply using the grammar as a corpus would make it impossible to parse novel sen-

tences. Instead, Bod and Scha stores every possible permutation of every tree in the corpus.

With this representation it is possible to find the number of trees in the grammar that match

a given parse structure and the number that cannot be matched. In this way every struc-

29

ture in the corpus leads weight to similar interpretations. For example, Figure 2.16 shows a

complete grammar formed from a corpus of three sentences.

S

NP

John

VP

V

Likes

NP

Mary

S

NP

Peter

VP

V

hates

NP

Susan

VP

V

likes

NP

Mary

NP

John

V

likes

S

NP VP

V

likes

NP

Mary

S NP

VP

V

hates

NP

Susan

VP

V NP

Mary

NP

Mary

V

hates

S

NP

John

VP

V NP

Mary

S

NP

Peter

VP

V NP

Susan

VP

V

likes

NP

NP

Peter

S

NP

John

VP

V

likes

NP

S

NP

Peter

VP

V

hates

NP

VP

V NP

NP

Susan

S

NP VP

V NP

Mary

S

NP VP

V NP

Susan

VP

V

hates

NP

Susan

S

NP VP

V

likes

NP

S

NP VP

V

hates

NP

VP

V NP

Susan

S

NP

John

VP

V NP

S

NP

Peter

VP

V NP

VP

V

hates

NP

S

NP VP

V NP

S

NP VP

V NP

VP

V NP

S

NP

John

VP

S

NP

Peter

VP

S

NP VP

S

NP VP

Figure 2.16: Sample DOP grammar for a tiny corpus

An obvious complication with this grammatical representation is how to store it. For

example, the Penn treebank contains fifty thousand sentences with an average of twenty

five words and forty phrases per sentences. Given that a tree of n nodes can be permuted

2n ways, this gives around 107 subtrees per sentence, or over 1012 subtrees in the training

corpus. Even if hard disk sizes continue to increase at the same rate, it will take years before

a typical research workstation has this sort of capacity. Bod and Scha has not come up with

30

an efficient solution to this yet. Bod and Scha’s current approach is simply throwing away

random grammar rules until the grammar comes down to a manageable size. This topic,

and ways of resolving it, will be discussed in Section 2.7.2.

2.4.4 Klein and Manning’s statistical parser

A recent parser developed by Klein and Manning (2003) is particularly interesting because,

despite forgoing a lexicalised grammar, it achieves extremely high accuracy. Initial exper-

iments with PCFG by e.g. Black et al. had shown that a lexicalised probabilistic grammar

leads to significantly improved parse (as was discussed in the last section). Before Klein

and Manning’s results were released, statistical parsing had became synonymous with lex-

icalised statistical parsing. However, people building statistical parsers invariably included

more information than just lexical items. For example, distinctions between verb adjunct

and verb arguments, information about semantic role of PPs, and subcategorisation lists

were commonplace. What Klein and Manning asked is: How well does a statistical parser

perform without lexical information, but with this extra information? The answer is, almost

as well as it would if it also had the lexical information. This result is extremely significant;

apart from the implications in linguistics, almost all the computational complexities in a sta-

tistical parser are side-effects of the lexicalised grammar. So it is worth examining Klein and

Manning’s grammar to see how the result was achieved.

Klein and Manning’s approach was to implement a basic PCFG and then to add informa-

tion to this grammar, measuring the improvements in the parser’s accuracy. The first step is

to include the parent nonterminal in attachments, so that for example when attaching the to

cat, the parent of NP is considered part of the grammar rule. This was motivated by noting

that a subject noun phrase is nine times more likely than an object noun phrase to expand as

just a pronoun. The second step is to shift to an approach which has been described in this

thesis as HPSG-like, where the head is generated first and the left and right siblings are then

generated. Since Klein and Manning’s approach is significantly different to HPSG, they in-

stead use the term ‘Markovize’ to describe the transformation. This step was motivated by

concerns that the grammar strongly disfavoured syntactic structures it had not seen during

training. One particularly nice observation of Klein and Manning’s was that these two steps

can be generalised by talking about the amount of vertical context (parents, grandparents,

etc.), and horizontal context (number of siblings) to use at once, so that a grammar formal-

ism could be described simply by saying v = 2, h = 1 to mean that the parent is used, but no

siblings are used. Under this formalism, Klein and Manning note that Collins’ parser could

be approximately represented by v = 2, h = 1. Klein and Manning investigated performance

with a number of parameters as shown in Table 2.6.

By noting that the first cell in this table is 72%, we can see that Klein and Manning were

31

Horizontal Markov Order

Vertical Order h = 0 h = 1 h ≤ 2 h = 2 h =∞v = 1, No annotation 72.27 72.5 73.46 72.96 72.62

854 3119 3863 6207 9657

v ≤ 2, Some parents 74.75 77.42 77.77 77.50 76.91

2285 6564 7619 11398 14247

v = 2, Parents 74.68 77.42 77.81 77.50 76.81

2984 7312 8367 12132 14666

v = 3, All GParents 76.74 79.18 79.74 79.07 78.72

7797 15740 16994 22886 22002

Table 2.6: Klein and Manning’s parsing accuracy and grammar size for

different model complexities

able to reduce the model perplexity by almost ten percent. However, 79% is still far from

state-of-the-art. Klein and Manning then investigated a very large number of improvements

which were applied cumulatively. As a random example, giving percentage signs their own

tag instead of sharing the symbol tag leads to four percent fewer errors. Klein and Man-

ning used a total of fifteen such improvements to get a cumulative performance improve-

ment of another ten percent, to 87%. The four most significant new features were TAG-PA

(providing the grammar with the parent’s POS tag), SPLIT-IN (passing the IN tag to its

parent), DOMINATES-VERB(set to true if the child includes a verb), and RIGHT-REC-NP

(for NPs with a recursive NP on the right.) RIGHT-REC-NP is a simple distance met-

ric — it is designed to discourage attachments that are excessively large. Further details

of these metrics and the others used can be found in Klein and Manning’s source code at

http://nlp.stanford.edu/downloads/lex-parser.shtml.

A similar approach was taken by Bikel (2004), which contains a more detailed analysis of

the factors underlying the success of Collins’ parser. Essentially, the conclusion is the same:

the crucial factors are not word representations, but clever preprocessing.

2.5 Backoff, interpolation and smoothing

Before we discuss how these probabilistic grammars can be used in parsing, it is important

to discuss some practicalities concerning how the probabilities are derived. Ideally they

would be produced by counting events in the training corpus as has already been described.

However, the training corpus is finite, and it may not contain exactly the events we need in

order to compute the probabilities for a new sentence being parsed. What if the event did

32

not occur frequently enough for counting its occurrences to be an accurate sample, or didn’t

occur at all? This problem is particularly difficult when the event representation is complex

because this will decrease counts, as mentioned in the previous section.

To illustrate the problem, consider parsing our example sentence The man saw the dog with

a telescope. As discussed in Section 2.4.1, we have to decide whether to attach the PP with a

telescope to the NP the dog or the VP saw the dog. If we are using a lexicalised probabilistic

grammar, we can estimate the probabilities of these two events by the occurrence counts

in the corpus for PPs headed by telescope attaching to NPs headed by dog or VPs headed

by saw, as shown in Equations 2.7 and 2.8. If we are using the WSJ as our training corpus

then unfortunately there are no such events; in fact there are only a few adjacency events for

telescopes (Hubble, space, was, instrument and the). Since telescope was never associated with

saw or dog in the corpus, the estimated probability of either attachment is zero. The problem

is that the corpus is not large enough to contain useful counts for every type of event. Even

if the previous example had occurred in the corpus, the single occurrence would not have

been enough to give an accurate probability.

There are two separate kinds of solution to this problem. One approach is to simplify

the events being looked up until they are general enough for there to be sufficient counts of

similar events in the corpus. At this point the relative frequencies in the corpus become an

accurate estimate of the correct probability of the event. This process is called backoff. A

second approach is to look for ways of determining how much we can trust the counts of any

given event in our training corpus. (If we trust them completely then it will be impossible

to generate any novel structures since they will all have a probability of zero.) In general,

the fewer instances of an event there are, the less we will trust the estimate derived from the

corpus. We need to work out how we can derive an estimate of the true frequency from the

counted frequency. This process is called smoothing.4

2.5.1 Backoff and interpolation

Let’s return to our example sentence The man saw the dog with the telescope. As just noted, we

need to look for PPs headed by telescope attaching to NPs headed by dog or VPs headed by

saw, and the problem is there are none in the corpus. Of course, if we were working with a

grammar that did not contain any lexical heads, then there would be no problem; there are

thousands of cases where PPs attach to both NPs and VPs and so the probability estimate

is very good. Thus we can back off by deciding to throw away lexical heads altogether,

4The terms ‘smoothing’ and ‘backoff’ are used in different ways by different writers. For instance, Niesler

uses the term discounting where I use the term smoothing. The definitions I have just given will be used in the

remainder of this text, except when I refer to the titles of existing techniques such as ‘Katz smoothing’ (which in

my terms is actually a combination of smoothing and backoff techniques).

33

and thereby derive a reliable estimate from the corpus. On the other hand, the backed-

off grammar will not be as sensitive in choosing the correct parse, precisely because it has

thrown away useful information about how to do this. The problem here is how far to back

off. Solving this problem is one of the most important issues in statistical parsing.

A first step towards a solution is to represent events at several different levels of granu-

larity. For instance, we might use lexical information when counts are high and then discard

it for a less accurate model when counts are low. Having decided the different levels, the

next step is to combine their probability estimates into a single probability. This process is

known as interpolation.

n-gram models

In order to describe backoff and interpolation formally, it is useful to think about a simple

example domain. While we have so far been thinking about sentences as hierarchical syn-

tactic structures, it is quite common in statistical NLP to think about a sentence simply as a

sequence of words. In this situation, the context of a given word in a sentence is simply the

sequence of words which precedes it. Formally, we would write that the probability of the

ith word in the sentence as shown in Equation 2.9. In this Equation, and elsewhere in the

thesis, the notation wi refers to the i’th word in the sentence, and wi−11 refers to the sequence

of words from the first word to the i− 1’th word.

P(wi|wi−11 ) (2.9)

To estimate the probability of a word appearing after a given sequence of words, we can

then simply count how many times the word appeared after this sequence divided by the

number of times the sequence itself occurred. This equation is not very useful in practice

because it makes it impossible to derive probabilities for words appearing in novel contexts.

To resolve this, we approximate the context by the last few words. This is known as an

n-gram approximation, where n is the number of words making up the context. Using this

idea, An n-gram approximation of this equation is given in Equation 2.10.

P(wi|wi−11 ) ≈ P(wi|wi−1

i−n+1) (2.10)

Essentially, we are assuming that the probability of seeing wi is independent of the words

seen much earlier in the sentence. Careful independence assumptions always underlie the

process of backoff: when we decide to throw away some piece of knowledge about an event

to increase counts, we are always assuming that this component is independent of the aspect

we are interested in.

By varying n we can sacrifice counts for greater discriminating power. So, for instance,

a unigram model would give the probability of the current word, regardless of context; a

34

bigram model takes the previous word into account to predict the likelihood of the current

word, while a trigram model takes into account the two previous words.

When using an n-gram model, we sometimes need to estimate the probability of a com-

mon event, and we sometimes need to estimate the probability of a rare event. In the former

case, we would like to derive our estimate by using a fairly large value of n, because there

will be high counts of the relevant events in the corpus. In the latter case, we would prefer

n to be small, since data will be sparser. To allow both situations, it makes sense to be able

to operate at several different levels of backoff, as mentioned in the previous section, by

counting events with several different values of n. A first step could be to adjust the model

complexity based on the counts, using something like Equation 2.11.

P(wi) =

{P(wi|wi−1

i−n+1) for high counts

P(wi|wi−1i−n+2) otherwise

(2.11)

This equation has the advantage that it is extremely simple, but it has several problems.

Firstly the use of two subequations like this makes it extremely difficult to ensure the prob-

ability distribution sums to one, and secondly it seems wasteful to discount the probability

estimate produced using the small to medium number of counts and only use the backed-off

model.

Interpolation using n-grams

To address both these problems, Equation 2.11 is never used in practice. Instead we inter-

polate between these models as shown in Equation 2.12.

P(wi) = λP(wi|wi−1i−n+1) + (1− λ)P(wi|wi−1

i−n+2) (2.12)

In this equation, λ is a value between zero and one which determines how much weight-

ing to give to the more complex model; when the counts are high, λ will be near one. The

equation for actually computing λ is not presented here since there is no single standard

equation. One key idea is that this approximation can be applied recursively, so the n− 1

model can be simplified to a n− 2 model, and so on. While in theory we could reduce the

complexity of the model very slowly using this method, in practice the number of probabil-

ity estimates is tightly constrained. Parsing time and memory usage are directly dependent

on the number of backoff levels because each level of backoff requires storing a set of events

covering the whole training corpus. So going from one level of backoff to two will almost

double parsing time and memory requirements.

35

Deciding which terms to treat as independent

In the previous equations, the n-grams were used to compute the probability of a word. In

this case, it is obvious that close words are a better predictor of the likelihood of the current

word than distant words, and so the simplified models simply discard the distant words. Let

us now return to the scenario where we are estimating the probability of whole grammatical

constructions. In this scenario, the order in which terms should be discarded is much less

obvious. For instance, should we discard the word first, or the POS tag? This problem will

be discussed in Chapter 7.

Another closely related point is that in parsing we frequently work with conditional

probability statements with more than one term on the left. Recall for instance that in Equa-

tion 2.6 we are computing the probability of seeing a constituent with a given nonterminal

category and a given headword in a particular syntactic context. (Thus in our PP attachment

example we are computing the probability of seeing a PP headed by the word telescope in

various contexts.) By specifying that we are computing both terms at once, we are assuming

that these terms are dependent on each other. That is, we cannot reduce the probability to a

simple product:

P(RNT, Rw|HNT, Hw, PNT) 6= P(RNT|HNT, Hw, PNT)× P(Rw|HNT, Hw, PNT) (2.13)

This may seem obvious — the word and the nonterminal are clearly related — but en-

forcing their dependence causes problems. For instance, if the word was never seen with

this nonterminal in the training corpus then the probability of the pair in any attachment

would be undefined. What we would like to do is break the dependency assumption when

the counts are too low, much as we discarded extra information from the more complex

models. The approach taken is by noting the basic Equation 2.14 from statistics.

P(a, b) = P(a|b)P(b) (2.14)

This equation says we can split any dependency event into a conditional probability.

An important point for later equations is that we can have extra terms after the b without

affecting the equation at all. Using Equation 2.14, we can replace the incorrect Equation 2.13

with a corrected version as Equation 2.15

P(RNT, Rw|HNT, Hw, PNT) = P(RNT|Rw, HNT, Hw, PNT)× P(Rw|HNT, Hw, PNT) (2.15)

Initially, this has not gained us anything, Equation 2.15 still involves the calculation of

counts(Rw, HNT, Hw, PNT) — it was the lack of reliable counts for this term that lead us down

this path in the first place. However, the simplification techniques just discussed, can now

36

be applied to each of these terms separately. Because of this, we can assume independence

as necessary to increase counts.

2.5.2 Smoothing

In any operations research it is quite normal to distinguish between a model and what is

expected from the population. For example, if a die is rolled twice and both times score a six

then a model of the die will say it always rolls six, yet it is entirely possible the die is fair and

the double six is a coincidence. At the same time we cannot conclude the die is fair because it

may be weighted, though by throwing the die more often we can gain increased confidence

in the model. Similarly in probabilistic grammars, just because an event has never occurred

before does not make it impossible. The probability estimate must balance the distribution

of language it has seen with an acceptance that it has only seen a small and probably biased

subset of possible sentences.

Maximum Likelihood estimation

The principle of maximum likelihood estimation (MLE) can be summarised by the state-

ment: Find the parameters that make the observed data most likely. If we treat the corpus

as our source of probabilities then we can derive a probability model from it. However, if

we instead treat the corpus as a random sample of the true language then we can instead

maximise the probability both of seeing our corpus and of seeing the input sentence.

In practice, it is impossible to compute MLE for complex domains and so most practical

techniques involve approximations. This usually comes down to the same steps as generat-

ing a probability with an additional tweak to provide some counts for events that were not

seen during training. There are a few approaches to approximating MLE; in this thesis I will

introduce two.

Add-one smoothing

A very simple approximation to use to account for the finite size of the corpus is simply to

add one to every count. For every event that occurs zero times this means we treat it as

if it occurred once. This approach works surprisingly well, and is included in many other

approximations of MLE.

Good-Turing estimate

Good-Turing is a technique to address the problem of low frequency events by estimating

the frequency of events that were never seen during training (Gale and Sampson, 1995).

To alleviate the problem of events which did not occur during training, the Good-Turing

37

estimate says we should replace the counts for how often an event actually occurred by an

estimate of how often it occurred. More precisely, it says:

r∗ = (r + 1)× E(nr+1)E(nr)

(2.16)

In this equation, r is the number of times an event occurred during training, and r∗ is our

reestimation of r. nk is the number of events that occurred k times in the training corpus and

the function E smooths nr since actual values of nr are subject to a lot of noise, especially for

large r. Since this equation adjusts the number of times an event occurs, this will affect the

probability of that event. Of course, this is our intention.

So what does this equation mean? The r + 1 term adds one to the frequency of every

event, in an analogous manner to add-one smoothing. The total number of events that

could be expected to have been seen is E(n0) which we approximate by E(n1), and then

further approximate by n1. Since we have given each of these n1 unseen events a frequency

of 1, we must subtract n1 events to keep the probability model summing to one, and since

we do not know where to subtract the events from, we distribute the subtractions across all

events. Since nr+1/nr will tend to one for high r, this term emphasises subtractions from

infrequent events.

2.5.3 Combined interpolation and smoothing techniques

We are now in a position to cover two backoff techniques. The techniques that will be ex-

amined are perhaps the most popular in the literature, Jelinek-Mercer smoothing and Katz

smoothing.

Jelinek-Mercer Smoothing

Jelinek and Mercer have developed a model that uses a Maximum Likelihood estimate in-

stead of the Good-Turing estimate just discussed. Jelinek’s equation is presented in Equation

2.17. As with the equations in Section 2.5.2, this equation is presented as an estimation of

the probability of a word in terms of n-grams.

pinterp(wi|wi−1i−n+1) ∆= λwi−1

i−n+1PML(wi|wi−1

i−n+1)

+ (1− λwi−1i−n+1

)Pinterp(wi|wi−1i−n+2) (2.17)

The key innovation in this equation is the use of recursion to combine backoff and

smoothing. The probability of a target word given a context of n− 1 preceding words is

first computed using some approximation to MLE (in the above equation, PML(wi|wi−1i−n+1)).

This probability is then interpolated with a recursive call to the same equation with n− 1.

38

Essentially, this equation is identical to Equation 2.12 except that instead of simply looking

up the counts for each level, the maximum likelihood estimate is used instead, and that Je-

linek gives a specific function for computing the interpolating term λ. The actual algorithm

used to compute λ is quite complex and precise details will not be presented here (see Jelinek

and Mercer (1980)). Like Good-Turing they look for events that occurred the same number

of times as the event being estimated, and they bucket neighbouring counts together to cope

with insufficient training data.

Katz smoothing

While Jelinek-Mercer builds on MLE, Katz smoothing is a direct extension of the Good-

Turing estimate to interpolate between different n-grams. It was developed and described

by Katz (1987) although the equations in this thesis are based in a slightly modified form

by Wu and Zheng (2000). Equation 2.18 shows how a probability is computed using Katz

smoothing, for an n-gram model.

Pkatz(wi|wi−1i−(n−1))

∆= λ(wi, wi−1i−(n−1))PGT(wi−1

i−(n−1)|wi) +

(1− λ(wi, wi−1i−(n−1)))Pkatz(wi−1

i−(n−2)) (2.18)

The similarly between this equation and Equation 2.17 is obvious. Apart from the use of

a different lambda function, the only difference is the use of PGT, the Good-Turing estimate,

instead of PML. Katz’s lambda term is significantly simpler than Jelinek’s; it is assumed to be

one (no interpolation) if there are any counts. A much more complete introduction to Katz

smoothing can be found in a number of references, such as Chen and Rosenfeld (2000).

Comparison of backoff techniques

Having covered two of the standard methods for combining interpolation and smoothing,

it seems reasonable to compare them and choose the best. Curiously, this is a step most

people in the literature skip, simply stating for instance that they are using Jelinek-Mercer

smoothing, Katz smoothing, Witten-Bell smoothing, or Wu’s enhanced version of Katz, etc.

The best comparison of the different methods is provided by Chen and Goodman (1996).

They found that both Jelinek-Mercer and Katz smoothing perform well, with Katz smooth-

ing performing slightly better on more complex models with more training data. They also

demonstrate several other smoothing algorithms which perform better; these will not be

covered here since we are discussing well-established algorithms.

39

2.6 Probabilistic parsing algorithms

The parsing algorithm most commonly used in statistical parsers is heavily based on the

standard chart parsing algorithm already discussed in Section 2.2. There are a number of

extensions needed to this algorithm in order to support statistical parsing. The most ob-

vious one is that, given every edge in the chart is associated with a probability, how can

a probabilistic grammar be incorporated into a chart parser? For one thing, a chart parser

is (almost always) a bottom-up parsing algorithm, while the statistical grammars we have

been working with use a top-down probability model. In fact there is no contradiction here,

but it is worth spelling out how the parser and the probability model work together. Recall

that in chart parsing we begin by creating an edge for each word in the input string, then we

derive all combinations of edges recursively, computing all possible parses of all substrings

of the input string, with the set of parses of the full input string being computed last. In

a statistical grammar with a top-down probability model, we compute the probabilities of

child nodes given parent nodes — for example the probability that a certain parent node

expands as a subtree with a certain node as head child and certain other nodes as left and

right sisters of this head. If a top-down probability model is used in a chart parser, this

means that the edges in the chart are all generated bottom-up, in the usual chart-parser way,

but that the probability of each of these edges is computed by beginning at its root node.

Some extensions, which are needed to adapt a regular chart parsing algorithm to a prob-

abilistic grammar, are specific to the particular grammar formalism we are using. (For in-

stance, if we adopt an HPSG-based formalism, we need to modify the operations for cre-

ating and combining edges; see Section 3.2 for a discussion of these extensions.) However,

the main extensions which are needed result from the fact that statistical grammars derived

from large corpora are typically far too big to permit the complete set of possible parses to

be derived for each constituent. Chart parsing has a complexity of O(n3m2) where n is the

sentence length and m is the number of rules in the grammar; if there are lots of grammar

rules then parsing is impossible for ordinary-sized input sentences. The solution is to move

away from computing every possible parse of every possible substring of the input. Clearly

what we want to do in practice is to throw away some of the edges with low probability.

This operation can be interpreted using the metaphor of heuristic search in classical AI, or

using the related metaphor of efficient graph search from computer science, or using the

slightly different metaphor of Markov modelling from probability theory. In practice, sta-

tistical parsers typically use algorithms ‘inspired by’ these metaphors, rather than precise

implementations, but it is worth presenting the underlying theory clearly before discussing

approximations in real implementations. Before we discuss these approaches, however, we

first need to say a little more about the notion of ‘the probability of an edge’, because this

term can actually be understood in two separate ways.

40

2.6.1 Inside and outside probabilities

When we say that the probability of a constituent is p, what does this mean exactly? One

thing it could mean is: the probability that the substring spanned by the constituent has this

parse is p. This is known as the inside probability of a constituent. There are a couple of

things to note here. Firstly, this probability is completely independent of probabilities of

structures elsewhere in the sentence; it is a statement about the substring in question, and

nothing else. Secondly, p is the probability of this interpretation of the substring as opposed

to another interpretation. The probabilities of all possible interpretations of the substring

will sum to 1.

There is a problem in using inside probabilities by themselves as a guide for deciding

which edges to focus on in order to reduce the complexity of parsing. Consider a substring

which splits a sentence in an unnatural way; e.g. the boy saw. This substring is unlikely to

feature as a constituent in the final parse. But it might nonetheless have some possible parses;

for instance, it might have a single parse as an NP (analogous to the band saw). In fact, the

inside probability for this parse will actually be very high; since we are just looking for

parses of this substring, and there is only one such parse, its inside probability will actually

be 1. Conversely, if a substring cuts an input string into a more likely constituent then it will

probably have several possible parses; for instance saw the girl can be analysed in several

ways. The inside probabilities of each of these parses must still sum to 1, and each will thus

be correspondingly lower. This is unfortunate; it has the effect that the probability of the best

parse of a good candidate substring will probably be lower than that of a bad one. Naturally,

this problem will resolve itself when we try to combine these edges with other edges. the

boy saw parsed as an NP will combine very badly with other edges in the chart, and result

in edges with very low probabilities, whereas the parses of saw the girl will combine more

successfully. However, during parsing, the spuriously high probability of constituents like

the boy saw, which are locally likely but globally implausible, could easily lead us to wasting

time. This is especially true if we are implementing a heuristic which leads us to concentrate

on the locally most likely edges.

Fortunately, there is another way of looking at the probability of a given parse of a sub-

string: the probability of this parse appearing as part of the final parse of the sentence. This

is known as the outside probability of the parse. This alternative interpretation nicely re-

solves the problem of saw the girl being globally unlikely, but it has the inverse problem. Say

S→ NP, VP is the most common rule application found in the training corpus. If we favour

edges with high outside probabilities, we will tend to interpret every edge as S→ NP, VP,

regardless of what material it actually contains. As with inside probabilities, this problem

is resolved at the end of parsing, but will easily lead to us wasting time trying to generate

constituents which are locally unlikely but fit well on a global scale.

41

In practice, when we are parsing, we want to use both the inside and the outside prob-

abilities of the edges we generate to help us focus on the most likely edges. If we have

a bottom-up parser, we can compute accurate estimates of the inside probabilities of the

edges we generate, but we have to estimate outside probabilities more crudely, because we

do not yet know the global structure of the input string. If we have a top-down parser, we

begin by computing hypotheses about the global structure of the input string, and so we can

compute accurate estimates of the outside probabilities of edges, but we have to use cruder

estimates of inside probabilities, because we have not yet generated their internal structure.

The cruder estimates in each case can simply be derived by looking at the relative frequency

of suitable events in the whole training corpus. For instance, if we have a bottom-up chart

parser, we can estimate the outside probability of an edge whose parent node is P simply by

counting the number of Ps in the whole corpus and dividing by the total number of nodes

in the corpus.

The difference between inside and outside probabilities is illustrated for saw the girl in

Figure 2.17. In this figure, the smaller triangle spans the constituent saw the girl, and it has

saw the girl

S(........)

VP(........)

Figure 2.17: Partial parse showing the different areas examined by the

inside and the outside probabilities

the analysis VP(. . .). The inside probability of this constituent is the probability of VP(. . .)

being the appropriate interpretation of these three words (the area with dark shading). The

outside probability of the constituent is the probability of VP(. . .) appearing in this position

in a parse of the rest of the sentence (the area with light shading).

Using this terminology it is much easier to follow what happens to the probabilities

of edges during chart parsing. Since we start with single words, each edge will have an

inside probability of one, and an outside probability estimated simply by counting the rel-

ative frequency of the word in the whole training corpus. As the words are expanded into

constituents, their inside probabilities will always decrease while their outside probabili-

ties will typically increase. Inside probabilities are strictly decreasing — the probability of a

combined edge will be the product of the two edges which form it and the probability of this

combination, and since all probabilities are bounded by one, this must be less than or equal

42

to the lowest of the three terms. However, the outside probability is not accumulated with

each edge; it must be recomputed after each combination and may be larger or smaller than

its children’s outside probability. Since combined edges span more of the sentence, leaving

less unknown, they tend to have a higher outside probability. Of course, the logic in this

paragraph could easily be reversed if the parser started with the root node of the tree. In

that case the outside probabilities would be accumulated while inside probabilities would

have to be recomputed at each step.

2.6.2 Parsing as state space navigation

This thesis largely ignores the theoretical foundations of computer language processing,

they have simply been taken for granted. However, there are some optimisations to the

parsing process that are best explained by examining the foundations of parsing and so

these will be briefly discussed here. We have already discussed the idea that a grammar can

be viewed as a set of phrase-structure rules. These rules can also be viewed as transforming

from one state into another. Formally, we can define a grammar as a four-tuple consisting

of a start state S, an input string i, a set of transformation rules t, at least one goal state x.

Given this representation, parsing is the process of starting at the start state and applying

transformations out of t until a state x is reached.

Parsing with hypergraphs

Having formally defined parsing, representing it as a walk in a graph is relatively simple,

the application of one of the transformation rules outlined above (the grammar rules) corre-

sponds to following an edge. Further, probabilistic parsing can be represented by putting a

cost on the arcs in the graph. Under this formalism, if a node can be reached that means it

can be parsed, with the shortest path corresponds to the best parse.

The advantage of looking at parsing this way is that there are a large number of algo-

rithms for graph processing that can then be applied to parsing. For instance, applying

Dijkstra’s algorithm leads to a probabilistic chart parser. This approach has been examined

extensively by Goodman (1998) and also by Klein and Manning (2001b, and other papers).

Parsing and A∗ search

Just as with graphs, it is relatively simple to define the problem of finding the most probable

parse (or, a good parse) of an input string as heuristic search, in the classical AI sense of the

word. Search involves systematically generating a set of possible states of a domain, starting

from an initial state and attempting to produce a goal state. Each state can be expanded, to

produce new states. In a simple top-down parser, each state is a partially-built parse tree;

43

each state can be expanded by choosing a node in this tree and applying all possible rules

in the grammar to grow this node.5 In a chart parser, states of the search space are not

stored explicitly, but the set of states on the fringe of the search space (i.e. the set of nodes

in the search space which have yet to be expanded) can be viewed as the set of possible

combinations of edges which span the complete input string. Expanding a state on the fringe

can still be quite clearly modelled as creating all possible edges that span a new substring,

using the existing edges in the chart.

In AI search, different routes to a goal state can have different costs. There is a straight-

forward analogy of cost in probabilistic parsing; the cost of a complete parse is simply an

inverse function of its probability, so that the highest probability parse has the lowest cost

and vice versa. A heuristic search is used in cases where the search tree is too big to be gen-

erated in its entirety. This is clearly the case in our probabilistic parsing scenario. The idea

is to estimate the cost of intermediate states in the search space, and first expand those with

the lowest cost. This is known as best first search. One useful algorithm from classical AI is

the A∗ search strategy which is a form of best-first search in which the heuristic evaluation

function is broken into two parts — one part is the cost of getting to the current state, and

the second part is an estimate of the cost of reaching the goal state from the current state. If

two conditions are met, it can be shown that an A∗ search is guaranteed to terminate, and to

find the shortest-cost path to the goal first.

The first condition is that all path costs are non-zero. This corresponds to the requirement

that all grammar productions have an inside probability strictly less than 1. Given a suitable

smoothing model, this will indeed be the case.

The second condition is that the heuristic evaluation function is optimistic about the cost

of reaching the goal from the current state. This means that the actual cost of reaching the

goal from the current state can never be less than our estimate of the cost. It is possible to

use the inside and outside probabilities so that they meet the requirements of an A∗ heuris-

tic. The inside probability accurately represents the cost to the current state while, at least

theoretically, the outside probability, accurately represents the cost to get from the current

state to the goal. In practice the method used to evaluate the outside probability is just an

estimate, but it is possible to write this estimate so that it is always optimistic. For instance,

Klein and Manning (2002) developed such a parser.

5When we talk about ‘parent nodes’ and ‘child nodes’ in a parsing algorithm, there is a potential ambiguity,

since we could be talking about relationships in the search space, or in the syntactic structures being built. I will

restrict the terms ‘parent’ and ‘child’ to refer only to relationships in syntactic structures.

44

2.6.3 Viterbi optimisation

It is also possible to construe probabilistic parsing as a walk through a Markov model. A

Markov model specifies a set of states, and a transition probability between each pair of

states. In the parsing scenario, states are partial parses (just as they were when parsing was

interpreted as AI search). The transition probability between two states is the probability

of applying a grammar rule which makes the transition between the two states. Thinking

of probabilistic parsing in this way allows us to use standard techniques from probability

theory for reducing the complexity of probabilistic reasoning tasks.

The AI search algorithms described in Section 2.6.2 are still intended to find every pos-

sible parse of the input sentence. All A∗ search does is provide a useful ordering on nodes

to be expanded so that the most probable complete parse is the one found first. However,

we are frequently not interested in all parses of a sentence, but only in the single most likely

parse. The Viterbi algorithm is a general statistical technique for finding the most likely

state in a Markov model (Viterbi, 1967). When applied to parsing, the algorithm states we

can discard any interpretation which we know will not form part of the final parse (Manning

and Schutze, 1999). Formally, we have previously been attempting to compute the probabil-

ity of different parse trees given some input sentence, i.e. P(T|S). The Viterbi optimisation

states that we wish to find the most likely parse tree.

The most obvious optimisation from the Viterbi algorithm is to remove ‘duplicate’ con-

stituents. Recall that when the parser considers how a given constituent might combine with

other constituents in the chart, it does so on the basis of the parent node of this constituent.

The parent node carries some information about the internal structure of the constituent (for

example, its headword and head constituent), but by no means everything. This means that

it is possible that the chart parser produces two edges which have different internal struc-

tures, but identical parent nodes. Since they are different, these edges will almost certainly

have different inside probabilities, but since they happen to have the same parameters, the

probability model will always give the same probability for any further grammatical pro-

ductions in which they appear. The probability of the full trees created from these new

productions is the inside probability of their constituents times the probability of the combi-

nation, so the constituent which had the highest inside probability to begin with will always

result in trees with higher probability than the constituent with the lower inside probability.

If we assume the goal of parsing is simply to find the single best parse, we can discard the

less likely subtree and know for certain that we are not discarding the best parse. This op-

timisation occurs very frequently in practice, because often local ambiguity relates to how

constituents are attached, but the final generated constituent must always be the same type.

For instance, consider Figure 2.18.

This figure shows how the phrase saw the girl with the telescope has two possible inter-

45

VP (prob = 0.001)

V

saw

NP

DT

the

NN

girl

PP

PREP

with

NP

DT

a

NN

telescope

VP (prob = 0.003

V

saw

NP

DT

the

NN

girl

PP

PREP

with

NP

DT

a

NN

telescope

Figure 2.18: Two alternate interpretations of saw the girl with the telescope,

showing the effect of the Viterbi optimisation

pretations. Both of these interpretations have the same general structure; they are both VPs

headed by saw that do not need any more arguments. Since they are equivalent in the hy-

pothetical grammar that has been used throughout this section, the lower probability inter-

pretation will be discarded. If a different grammatical formalism was used which included

more internal information, such as Black et al.’s with its h2 or Bod’s which includes the entire

subtree, then the less likely interpretation could not be discarded.

Another optimisation caused by the Viterbi algorithm is to note that if no constituent

ends at a particular location, there is no point looking for constituents starting at this location

since they cannot possibly be used to span the whole input.

2.7 Three statistical parsers

Having discussed statistical parsing in abstract terms, I will now describe three real sta-

tistical parsers in order to show how the techniques interact. The parsers which will be

described are Klein and Manning (2003) because it is a high-performance non-lexicalised

statistical parser, Bod and Scha (1996) because it shows how a large amount of information

can be used, and Collins (1999) because it strikes a nice balance between theory and prac-

tice. Magerman’s parser SPATTER will not be discussed because it is a precursor of Collins’

system, and has been superseded by it. We focus here on complete systems, rather than on

grammatical formalisms, some of which have already been discussed.

2.7.1 Klein and Manning’s statistical parser

Recall that Klein and Manning’s parser is unlexicalised, and obtains its high performance

instead through the use of features such as whether the attachment is internal or external,

if it dominates a verb, contains a gap, and so on. Another interesting decision Klein and

Manning made was to use no smoothing or interpolation. This is only possible because of

the unlexicalised grammar, though even so it is a little surprising. Because of this, Klein and

46

Manning had to be careful about the cost of each extension to the number of counts, just

as lexicalised grammars do and so Klein and Manning found the best cell in Table 2.6 was

v ≤ 2, h ≤ 2 rather than the locally better performing v = 3, h ≤ 2. This same trade-off is ex-

tremely common throughout statistical grammars — extra information is usually useful, but

it results in decreased counts, which prevents other information being considered. A backoff

strategy that could intelligently choose which information to use would be extremely useful

both to lexicalised statistical parsers, and to Klein and Manning’s parser.

Final results from Klein and Manning’s paper are very good, precision and recall are

86.3%, 85.1% respectively. So, despite being unlexicalised, Klein and Manning obtained

higher accuracy than Collins’ 1996 lexicalised statistical parser.

2.7.2 Bod’s statistical parser

To recall from Section 2.4.3, Bod and Scha’s grammatical representation stores a sample of

every permutation of every training tree from the corpus. Bod’s parser is called TREE-DOP.6

Like most statistical parsers, the parsing process is derived from chart parsing and involves

viewing the trees as rewrite rules.

Having built the grammar, TREE-DOP is able to parse by choosing trees from the rule

book that are likely to match the sentence. At each stage of the parsing process it will have

a ‘partial parse’ representing the sentence which has been matched so far. This, and the

current input word, are matched against trees in the rule book.

If only one tree matches the current partial parse then this is selected as the final parse.

Typically more than one tree matches the input word and the tree which is selected is the

one which has occurred most frequently with this input word. In Figure 2.19 the word Mary

is being added to a rule in the rule book to derive the partial parse shown. One nice feature

of this approach is that whole phrases such as idioms are easily recognised because their

tree matches the input perfectly.

S ◦ NP = S

NP VP Mary NP VP

V NP Mary V NP

likes likes

Figure 2.19: Using DOP to parse Mary likes

Parsing the input produces all parse trees, which Bod and Scha (1996) refer to as the

derivation forest. The actual process is shown in Figure 2.20. Most applications are only

6‘DOP’ stands for ‘data-oriented parsing’.

47

repeat until one derivation is clearly best

for k := 1 to n do

for i := 0 to n - k do

for chart-entry(i,i+k) do

for each root-node X do

select at random a subderivation of root-node X

eliminate the other subderivations

add derivation to possibilities

Figure 2.20: Idealised pseudocode for Bod’s statistical parser

interested in the most likely parse tree, so Bod would like to use some sort of Viterbi op-

timisation. However, his state-space is too large to search for the optimal solution as we

discussed above. Instead TREE-DOP uses Monte-Carlo sampling (Hastings, 1970) to esti-

mate the most likely parse in O(n3) time. Monte-Carlo sampling involves deriving multiple

parses from the derivation forest and selecting the most commonly occurring parse. The

weak law of large numbers guarantees that the most likely parse will be the one most likely

to be selected by Monte-Carlo sampling. Unfortunately, sampling requires extremely large

numbers to be reliable, which makes parsing using TREE-DOP impossibly slow.

One solution to the slow parsing time was proposed by Goodman (1996) in which Bod’s

trees are transformed into equivalent PCFG rules. Goodman’s reimplementation is five hun-

dred times faster than Bod’s version (Goodman, 1996). However, Bod takes issue with call-

ing Goodman’s approach equivalent, stating that it does not find the globally most likely

parse (Bod, 1996); a revised version is presented in Scha and Bod (2003).

Because of the slow parsing time, Bod was unable to provide precision and recall figures

derived from exactly the same testing corpus as used by other researchers. On the sub-

corpus he used, he obtained 80% precision and recall.

2.7.3 Collins’ statistical parser

Michael Collins’ parser (Collins, 1996) has been an extremely influential contribution to the

field of statistical parsing. Since 1996 it has been the best performing statistical parsing

system, with an initial precision, recall of 86.3% and 85.8% respectively. Later improvements

in (1997) and (1999) kept it ahead with a precision and recall of 88.7,88.6%. Collins’ system

is essentially an extension of Magerman’s approach, and it performs better due to a number

of tweaks such as BaseNPs.

I will base my own implementation on Collins’ system, since it remains the state-of-the-

art statistical parser. A detailed overview of Collins’ system will be given in Chapter 3.

48

2.8 Summary and future direction

We have seen that statistical parsing is the way to go in natural language processing. It

provides much more flexibility, without the brittleness associated with deterministic pars-

ing. However, there are some questions about where statistical parsing should be headed.

The performance of current statistical parsers is asymptotic, with the difference between a

generic implementation and the best implementation only a four percent reduction in errors.

Further, Klein and Manning have shown that the approach being taken for lexical informa-

tion is not working — including lexical information leads to far more work and complexity,

yet is providing us with only a tiny reduction in errors. Despite this apparent problem,

there has been no serious work yet to shift statistical parsers out of the field of carefully

edited financial text and into useful domains.

The performance of current statistical parsing systems

A simple PCFG parser obtains about seventy-five percent accuracy, while a top statistical

parser obtains a little over eighty-five percent. However, since Collins released his first

model in 1996, the state of the art has only improved about one and a half percent. To put

this in real terms, the best statistical parsers make about three parsing errors on a typical

input sentence, and parse perhaps twenty percent of sentences with no errors. Further-

more, these disappointing figures are when the parser is both trained and tested on care-

fully edited financial text. It is extremely unlikely that these performance figures would

be maintained if we were to test on less carefully edited text such as the Usenet group

biz.marketplace.investing or text outside the WSJ domain, such as an arts maga-

zine.

There are a number of approaches that can be taken to improving parser performance.

Perhaps the next logical approach would have to concentrate on increasing their ability to

generalise to new domains of text. Later, we could look into methods requiring less super-

vision so that they could be trained on raw text. This is all uncharted territory; there are

no corpora available in other fields for cross-validation, there are not even any metrics for

measuring parser generalisation. The golden rule has been to always train on the same kind

of corpus as testing. Regardless of the approach taken, it is clear that something has to be

done. So this thesis will start with an existing state-of-the-art parser, and then look at ways

in which it can be improved with different domains in mind.

Software engineering is important

A number of statistical parsers have been written, and the source code for at least three is

available as free software. However, they are all documented at a theoretical level rather

49

than an implementation level. Since much of the work in designing and building a lexi-

calised statistical parser relates to implementation details, this presents a significant hurdle

to anybody attempting to write a new statistical parser.

To address this, the next two chapters describe in considerable detail exactly how Collins’

lexicalised statistical parser was implemented. It is hoped that this will give readers not

only enough information to re-implement the parser, but also describe the techniques and

software engineering issues that are important in implementing a statistical parser.

What is the future of lexicalised parsing?

Klein and Manning note that most improvements are not lexical in origin; in experiments

disabling lexical information for Collins’ parser, only a tiny reduction of model accuracy

was seen. However, the intuitive appeal of lexical information remains; the difficulty is in

combating Zipf, the WSJ corpus is simply too small to have useful lexical counts.

The main goal of my thesis is to combat Zipf by looking at ways of clustering words

into categories. It is expected that when we start using word categories instead of words,

we will have sufficient counts for useful statistics while retaining useful lexical information.

This is the topic of Chapter 6. Another important aspect is looking at how these clusters

should be used in backoff. The linear nature of deciding what to back off next is wasteful of

useful information, and a smarter approach with possible applications in almost all areas of

probabilistic modelling will be discussed in Chapter 7.

50

Chapter 3

A description of Collins’ parser

In order to address the issues identified in the previous chapter, I need a parser to experi-

ment with. At the start of my project, there were no publicly available lexicalised statistical

parsers, so I had to write my own. I decided to reimplement an existing system, rather

than design my own parser from scratch. An existing parser provides a good guide: we can

expect that a parser designed along the same lines should be able to achieve similar levels

of performance. As for the choice of a parser to implement, I decided to reimplement the

parser developed by Michael Collins, simply because this parser is the one which performs

best. A specific version of Collins’ model had to be implemented, and Model 2 from Collins

(1997) was chosen as striking the best balance between complexity and accuracy. 1

This chapter provides a detailed overview of Collins’ grammar formalism and parser.

The chapter serves two purposes. Firstly, it provides the background necessary to under-

stand my reimplementation of Collins in Chapter 4. Secondly, and equally importantly, it

provides a description of Collins’ system which in places is more detailed than that given

by Collins himself. Collins’ own publications about his system tend to concentrate on moti-

vations for the design of his system, and do not always provide a good introduction to the

system for programmers intending to work with the code, or make alterations. My descrip-

tion of Collins will focus more on these latter issues.

3.1 Collins’ grammar formalism and probability model

Collins’ grammar is a HPSG-inspired approach very similar to the examples given in Section

2.4.1. This means that it is a top-down model (although the parser is bottom-up) in which

the head of each category is generated first and then constituents to the left and right are

1The addition of gapping (Model 3 from the same paper) led to increased complexity without increasing

accuracy significantly. Later, Collins (1999) revised his model but it was decided that the benefits this provided,

particularly in punctuation, were not sufficient to justify any modifications.

51

added in any order. It differs principally by modifying the training data to make it more

suitable for training a statistical parser, and also by supporting subcategorisation frames.

3.1.1 New nonterminal categories: NPB and TOP

Perhaps the most obvious addition in Collins’ grammar is the creation of two abstract non-

terminal categories, one for non-recursive noun phrases and one for a distinguished top-

level sentence phrase.

A base NP (NPB) is an NP that does not include any NPs as children. The advantage

of distinguishing these from recursive NPs is that they tend to have very different usage

patterns. Consider the NP shown below, and the NPB it contains:

NP

NPB

Pierre Vinken

, ADJP

61 years old

An NP which has an NP embedded inside it is likely to take arguments such as ADJP,

while the NPB is much more likely to decompose directly into proper or common nouns.

Similarly, there is room for improvement with the sentence category (S). Along with

SBAR, it is frequently used in the WSJ for recursive sentence substructures. This leads to

several problems with using S to terminate the tree. Firstly, it complicates prior probabilities

because we cannot assume the prior probability of S heading a tree is one. Secondly, S may

be generated with subcategorisation frames (see Section 3.1.2) which are then impossible to

discharge for the root node. While neither problem is critical, Collins created a new root

node which he calls TOP to avoid these problems.

3.1.2 A distance metric: adjacency and verbs

Recall that with the HPSG grammar model, each modifier to a head is modelled as a separate

event. However, languages, and especially English, tend to have strong preferences for

modification by nearby words; a model which fails to take this into account will perform

poorly. A distance metric is an attempt to encode the preference for nearby attachments

without significantly reducing counts.

In Collins’ theory, the distance metric is implemented as a number of simple heuristics.

The first two heuristics are tests to see if the words are adjacent, and if there any verbs in

between. These heuristics are easy to motivate by noting that three quarters of attachments

are to neighbouring words and only one in twenty attachments has an intermediate verb

(Klein and Manning, 2003). An advantage of these heuristics over more sophisticated ones is

that they can be each represented using a single bit, so will only halve the expected number

52

of counts for each event. There are two other heuristics used by Collins: two further bits

to represent the presence of coordination or punctuation, and subcategorisation frames to

represent a nonterminal’s complements. These will be discussed next.

Coordination and Punctuation

Coordination and punctuation are treated as a special case by Collins’ model. In particular,

the model as implemented strips punctuation out of the sentence before parsing and adds it

back in when printing the final parse. That is not to say it has no effect on the parser; instead

a Boolean flag is set whenever an operation would cross over certain types of punctuation

so that the only counts used are events from the corpus in which this flag is also set. The

handling of coordination is similar, except that occasionally and is not tagged as coordina-

tion, and so it is not stripped out of the sentence but instead carefully skipped whenever it

is tagged as coordination.

Subcategorisation frames, complements and adjuncts

As already discussed in Section 2.1.2, head words carry with them a subcategorisation list

specifying what complements they require. For instance, John chased sounds a little odd as

a complete sentence, whereas John chased the dog is natural. Collins’ approach to subcat lists

is to include a (typically empty) bag of nonterminals that each constituent requires to be

complete. A bag is used instead of a set because a few words, such as thought, require two

arguments of the same type. In order to reduce its impact on counts, only a few classes of

nonterminal are counted.

Sometimes the phrases that modify a word are optional. For instance, in John chased the

dog in the park, the PP in the park need not have been provided. Such phrases are called

adjuncts.2 There is no distinction between complements and adjuncts in the Penn treebank.

However, Collins uses a simple algorithm to preprocess the treebank to distinguish between

complements and adjuncts. The preprocessor turns every NP node into either an NP-A

(denoting an adjunct NP), or an NP-C (denoting a complement NP). This algorithm will be

discussed in Section 3.1.3.

3.1.3 Preprocessing the Penn treebank

The Penn treebank was discussed in Section 2.3.1. As well as the syntactic markup already

discussed, the treebank markup attaches a significant amount of semantic information to

nonterminals, such as -LOC for locative; nonetheless it does not include all the fields which

are required by Collins’ probability model. For instance, head words are not explicitly

2The term argument will be used to denote both complements and adjuncts.

53

identified as such. To address this issue, the very first step in parsing is to transform the

distributed WSJ into Collins’ format — deleting extra semantic information while adding

headwords, arguments, NPB and TOP. After adding headwords, etc. the final preprocess-

ing step is to convert the treebank into a HPSG-style event file. Figure 3.1 shows how a

simple tree can be converted into an event.

S

NP VP PP

Parent S

Head VP

Left[

NP]

Right[

PP]

Figure 3.1: Conversion from a WSJ style tree to head driven

While this figure shows a conversion of PCFG into head driven, it makes a number of

assumptions that must be resolved in writing an algorithm to perform the conversion. For

example, how did we decide the VP was the head child? And how do we decide which

complements are arguments?

As with the code for the parser, it is worth describing the preprocessing in more detail

than a high level algorithm. Even after releasing his code, Collins did not release his pre-

processor to perform these tasks.3 The pseudocode provided by Collins left a number of

decisions undocumented and impossible to reproduce. This raises the question of under

which circumstances it applies to prepositions. Because of this question, and similar ques-

tions, my reimplementation of Collins’ preprocessor will be discussed in Section 4.2.

3.1.4 Collins’ event representation

So far we have described events in slightly abstract terms such as a tree being attached to

the left of the head phrase. At some point it is necessary to discuss exactly how trees are

represented so that events in the treebank can be counted. It is also necessary to describe

how events are simplified when counts are too low, and how the counts are combined to

give a probability. After this we can derive the probability of any given tree.

All events in Collins’ model are productions associated with a reference parent. There

are three types of production: a unary production is the generation of the head constituent

of the parent, a dependency production is the generation of a left or right sibling of the head

of the parent, and a subcategorisation production is the generation of a subcategorisation

frame for the head of the parent. To explain Collins’ representation of trees, we once again

3Collins recently placed a note on his website stating that Bikel’s parser can perform the necessary prepro-

cessing Bikel (2005).

54

refer to a simple example tree — see Figure 3.2. In this figure, only the nodes with arrows

take part in evaluating the probability.

NP−C

DT NN VB NP−C

a mousechasedcatThe

VP

SLeft nonterminal (L )NT Parent (P)

Head tag (t)

Head word (w)

Head nonterminal (H)

Left tag (L )t

adjacency = true

verb = false

wLeft word (L )

Distance ( ):

Left subCat (LC): {NP−C}

Figure 3.2: Collins’ representation of a left production event

This figure describes the fields used to represent the dependency event of an NP-C at-

taching as a left sister to the head VP of a parent S node. Collins stores this event by en-

coding the terms pointed to with arrows in the figure. At this point, the event simply is the

co-occurrence of these values for these data fields. To represent a unary production, such

as the generation of VP from the parent S node, we only need the terms on the right-hand

side of Figure 3.2. To represent a subcat production, such as the generation of the VP’s left

subcategorisation frame (in this case, a bag containing one item, NP-C), we use the same

terms as the unary event, plus one additional field: a bag of nonterminals.

Having specified exactly which parts of a tree to use, we can now say exactly how to com-

pute the probability of unary attachments (Equation 3.1), subcategorisation frames (Equa-

tion 3.2) and dependencies (Equation 3.3).

Punary(H | P, w, t) (3.1)

Psubcat(LC |H, P, w, t) (3.2)

Pdep(LNT, Lw, Lt | P, H, w, t,∆, LC) (3.3)

3.1.5 Backoff and interpolation

As in all statistical parsers, deriving the probabilities just mentioned is complicated by low

counts. To address this, events are grouped at different levels of backoff, as discussed in

Section 2.5.

Collin’s parser uses three levels of backoff for most events. A complete list of the prob-

abilities derived, at all three levels of backoff, is given in Tables 3.1 and 3.2, with the more

55

unusual events in Tables 3.3 and 3.4. Columns in the table describe the different event types

which are computed. (I have omitted the right events, which are symmetrical to their left

counterparts.) Cells in a given column specify probability estimates for the event at a given

level of backoff.

Table 3.1 contains backoff strategies for unary and subcat productions. Note as the back-

off level is increased, we throw away elements on the right-hand side of the conditional

probability terms.

Back off level Unary Left Subcat

1 P(H | P, w, t) P(LC | P, H, w, t)

2 P(H | P, t) P(LC | P, H, t)

3 P(H | P) P(LC | P, H)

Table 3.1: Collins’ unary and subcat events

FIXED: Shorter versionFor dependency events, given in Table 3.2, the situation is somewhat more complicated.

Firstly, a dependency event includes both the left word and the head word, so Zipf’s law

plays an even greater role in reducing counts. To address this issue, Collins splits depen-

dency events into two separate events whose probabilities are multiplied together. Recall

from Equation 2.14 that we can use conditional independence to assume part of the depen-

dency is independent of the other part. We can then back-off each part independently. A

second complication in dependency events is the addition of the Boolean flags representing

coordination and punctuation (see Section 3.1.2), represented in Table 3.2 by the terms c and

p.4

Back off

level

Left1 Left2

1 P(Lnt, Lt, c, p | P, H, w, t,∆, LC) P(Lw | Lnt, Lt, c, p, P, H, w, t,∆, LC)

2 P(Lnt, Lt, c, p | P, H, t,∆, LC) P(Lw | Lt, c, p, P, H, t,∆, LC)

3 P(Lnt, Lt, c, p | P, H,∆, LC) P(Lw | Lt)

Table 3.2: Collins’ dependency events

There are several other probabilities that Collins derives. Firstly the probability of stop-

ping parsing (generating TOP) is a variation on unary productions, presumably because

Collins found the basic model produced TOP nodes incorrectly. Again, these probabilities

are computed using two separate events. The backoff strategies are given in Table 3.3. (Note

4Previous versions of Collins’ model do not include these at all, and later versions appear to place these

56

Backoff level TOP1 TOP2

1 P(H, t | P = TOP) P(w |H, t, P = TOP)

2 P(H, t) P(w |H, t)

Table 3.3: Collins’ TOP events

that at level 2 backoff of TOP1, we end up using an unconditional probability to estimate a

conditional probability.)

Secondly, outside probabilities, which are used as heuristics during parsing (discussed

in Section 2.6.1) also need to be computed. Collins does not provide details for how these

probabilities are derived, but by examining the output of Collins’ parser, I have derived

Table 3.4. Note that we again use two sub-events to derive these probabilities. Prior2 in this

case is actually an unconditional probability, which at level 2 is approximated using another

unconditional probability.

Backoff level Prior1 Prior2

1 P(H |w, t) P(w, t)

2 P(H | t) P(w)

Table 3.4: Collins’ Prior events

3.1.6 Smoothing

The previous section shows how three probabilities are derived for each event type, at each

level of backoff. The next problem is to combine estimates at different levels of backoff

into a single probability estimate. The process of smoothing has been discussed in Section

2.5.2. To recap briefly: multiple values are combined using a weighted average based on the

confidence in the probabilities, and counts are modified slightly to provide a few counts for

events unseen during training. The smoothing algorithm that Collins implements appears

to be different to the one discussed in Collins (1997). The implemented version will be

presented in Figure 3.3. Here we will discuss the equations that the paper presents.

Equation 3.4 is used to combine the probabilities.

p = λ1e1 + (1− λ1)(λ2e2 + (1− λ2)e3) (3.4)

In this equation λi is the weighting at backoff level i, and ei is the probability estimate at

level i. This equation simply says to compute a weighted average, the interesting part is in

the computation of λi.

inside the distance metric, and generate them as a separate event altogether.

57

The equation Collins uses to compute λi appears to have changed throughout his work.

For instance, the equation given in (Collins, 1996) is presented in Equation 3.5; Collins (1997)

does not state how λi is computed but, as shown in Figure 3.3 it is different to the 1996

implementation, and finally Collins (1999) describes several methods, including Equation

3.5.

λ1 =δ1

δ1 + 1

λ2 =δ2 + δ3

δ2 + δ3 + 1(3.5)

In these equations, δi is the denominator of the count. Collins notes that these equations

correctly increase λi as the denominator for the more specific event increases, but they are

not as sophisticated as the approaches discussed in Section 2.5.2.

In implementing Collins’ model as just described, it seemed the probabilities being gen-

erated always differed slightly from those produced by Collins’ parser. Eventually this was

found to be because the smoothing algorithm described in the thesis (Collins, 1999) differed

from that implemented in the parser that Collins released. I do not know whether Collins

forgot he changed his code when he wrote that section of the thesis, or if the implementation

was based on an older algorithm which was published elsewhere, or something else, but the

actual implemented version was reverse–engineered. It is presented in Figure 3.3.

e = ε

for i = 3 to 1

bot = 5ui

top = bot * e + fi

bot = bot * fi

e = top / bot

endfor

Figure 3.3: Collins’ smoothing function as implemented

3.2 Collins’ parsing algorithm

Collins’ parser is a statistical chart parser, along the same lines as Magerman (1995), and

those described in the previous chapter. Like any chart parser, constituents or phrases are

found bottom-up and all phrases are found regardless of their position in the sentence. Re-

call from Section 2.6 that using an HPSG style grammar requires some modifications of the

chart parsing algorithm. We are now in a position to discuss these modifications.

58

As described in Section 2.2.1, a chart parser is essentially three nested for loops. A

greatly simplified version is presented in Figure 3.4. (Note that no chart parser would be

implemented quite this inefficiently; much of the rest of this chapter will discuss optimisa-

tions.)

for start = 0 to length

for end = start + 1 to length

for split = start + 1 to end

left = get_edges_spanning(start,split)

right = get_edges_spanning(split + 1, end)

combine(left,right)

endfor

endfor

endfor

Figure 3.4: Simplest possible chart parser pseudocode

The key to this algorithm is the combine function, which is given in Figure 3.5. combine

combine(left, right)

foreach (l left)

foreach (r right) {

joined_edge = join_two_edges(l,r);

expanded_edges = add_singles_stops(joined_edge);

chart.add(expanded_edges);

}

Figure 3.5: Pseudocode for combine

takes two sets of edges, and adds a new set of edges produced from these edges to the chart.

This happens in two steps. Firstly, we attempt to join every pair of left and right edges

to create a set of compound edges, using dependency productions (in the sense defined in

Section 3.1.4). Secondly, we expand each of these compound edges by adding parent nodes,

using unary and subcat productions (again as defined in Section 3.1.4). These two operations

will be described in turn in the next two subsections.

3.2.1 Dependency productions, and the use of a reference grammar

The core operation in the parser is the joining of two edges. In a HPSG approach, one edge is

the parent which means that its head will be the head of the new phrase, and the other edge

59

is a child. The child edge must be complete (it cannot be enlarged after it is grafted into the

parent), while the parent edge must be incomplete (because it grows in this operation). A

simple figure showing the operation is presented in Figure 3.6.

join two edges (VP

V

likes

,

NP-C

Mary ) →

VP

V

likes

NP-C

Mary

Figure 3.6: Simple example showing how an (incomplete) VP can have a

(complete) NP-C added as a right sibling.

One aspect of the join operation that is significantly different with a head-driven gram-

mar is that siblings can be added either to the left or to the right of the head. This contrasts

with normal chart parsing where the grammar is normally written to be either left-recursive

or right-recursive, but not both.

Pragmatically, we can note that every join operation takes a complete edge and an in-

complete edge, so there is no point trying to join two complete edges. Having noted this,

we can split the left and right edge sets based on whether or not they are complete and

then we only have to merge the complete left edges with the incomplete right edges (the

left edges become children), followed by the operation of merging the incomplete left edges

with the complete right edges (the left edges become parents). This split halves the num-

ber of compatibility checks we have to make. Collins refers to adding a sibling to the left

of the head as join 2 edges precede and adding a sibling to the right of the head as

join 2 edges follow .

Either way, the core operation is the creation of a new edge. This new edge is identical to

the old parent except that it has another child. The probability of the new edge also has to be

calculated. It will be equal to the probability of the parent multiplied by the probability that

this child node is added in. The probability of the child node being added can be computed

by looking up the probability of the appropriate dependency production in the probability

model discussed in Section 3.1.4.

There is one additional modification necessary to support subcategorisation frames. If

the head nonterminal of the child is marked as a complement then it must be removed

from the edge’s subcategorisation bag. If it is not present in the bag, then the new edge

is invalid and can be rejected. This whole operation actually works out to very little code,

and Figure 3.7 presents it in pseudocode form. The get dep prob function is presented to

make it clear how the data fields in a node become the parameters in looking up the event

probability. This code shows an important optimisation: if the new edge has an invalid

subcategorisation frame, then it is immediately rejected. This operation is optional; without

60

join_two_edges_follow(parent,child) {

new_edge = new edge(parent); /* Copy the parent */

new_edge.add_child(child);

retval = new_edge.rc.remove(child.parent);

if (retval == k_failure) return NULL;

new_edge.info += get_dep_prob(parent,child,k_follow);

return new_edge;

}

get_dep_prob(parent,child,direction) {

delta = make_delta(parent,child,direction);

return prob(child.parent,child.headtag, child.headword,

parent.parent, parent.headnt, parent.headword,

parent.headtag, delta, new_edge.rc);

}

Figure 3.7: Pseudocode for joining two edges (dependency events).

it the probability of zero would be generated (since this frame has never been seen) and the

edge would ultimately be rejected, but by shortcutting the expensive probability generation,

we save a lot of time. There is another optimisation not shown in which we can also skip the

probability calculation and immediately reject the edge, and that is when the combination

of nonterminals being considered was never seen in the training corpus. To implement

this, we need to build a grammar of all combinations of nonterminals seen in the corpus:

if the combination being considered ‘violates’ this grammar, the edge is not created. One

way of understanding this operation is in relation to backoff events. Recall that at backoff

level three, events are basically combinations of nonterminals. So the decision to filter using

a grammar of nonterminals effectively means that at level three, we assume there are no

unseen events. If we implemented this principle explicitly, we would build the edge being

considered and then assign it a probability of zero; again, consulting an explicit grammar

just gets rid of the edge at an earlier stage.

Finally, it is worth stating explicitly that all of these operations create new edges rather

than modifying existing edges; and we most certainly do not discard edges just because we

have used them to create another. Competition between the new edges and the existing

edges is how ambiguity is resolved. For instance, when we say The child cannot be enlarged

after it is grafted into the parent, it is likely that elsewhere in the chart is an incomplete version

of the child which will in due course be enlarged, marked as complete and then grafted with

61

the parent. To permit modification of children would lead to unnecessary duplication of

work.

3.2.2 Unary productions

The combine algorithm is intended to merge increasingly large spans of the sentence. How-

ever, in an HPSG approach we need an additional step to ‘grow’ individual edges upwards

by giving them parent edges. To see why this is necessary, consider the steps involved in

parsing the man as a noun-phrase. First we tag the as a determiner, and man as a noun. If we

next join the to the left of man then we would build a two-word noun, not a noun phrase. In-

stead we should first create a unary production for man of a incomplete noun-phrase. Later

we can attach the to the left end of this noun-phrase, to create the correct interpretation. This

example is illustrated in Figure 3.8.

DT

the

NN

man

DT

the

NP

NN

man

NP

DT

the

NN

man

Figure 3.8: Simple example showing the steps in building an NP con-

stituent the man.

The function Collins uses for unary productions is called add singles . It works by

creating all possible parents for the input edge and assigning a probability to each. As with

dependency events, unlikely edges are rejected and a grammar is used to avoid generating

impossible edges; pseudocode is presented in Figure 3.9.

It was noted above that the man noun-phrase was incomplete. But how do we know

this? And in sentences such as dog bites man — common in newspapers — there are no other

words in the phrase so man forms a complete noun-phrase. This ambiguity is handled by

the possibility that we should stop expanding the phrase. The function Collins uses to create

a new complete edge for each incomplete edge is called add stop . Note that the concept of

complete nodes is used outside unary productions. This function is actually implemented

as two dependency events, as shown in the pseudocode in Figure 3.10.

Frequently multiple unary productions will be required without any intervening combine

operations. The most obvious example is with noun-phrases because of Collins’ NPB cat-

egory, but co-ordination and TOP provide other examples. This process of interleaving

add singles with add stops is called add singles stops . Pseudocode for this pro-

cess is presented in Figure 3.11.

62

add_singles(child) {

foreach (parent possible_parents(child.parent)) {

foreach (lc possible_lc(child.parent,parent)) {

foreach (rc possible_rc(child.parent,parent)) {

new_edge = new edge(child);

new_edge.stop = false;

new_edge.headnt = new_edge.parent;

new_edge.parent = parent;

new_edge.lc = lc;

new_edge.rc = rc;

new_edge.prob += get_unary_prob(new_edge);

new_edge.prob += get_subcat_prob(new_edge,k_left);

new_edge.prob += get_subcat_prob(new_edge,k_right);

result_edges.add_one(new_edge);

}

return result_edges;

}

Figure 3.9: Pseudocode for add singles. The previous parent is demoted

to a head, and new parents are generated.

add_stop(edge) {

new_edge = new edge(edge);

new_edge.stop = true;

new_edge.prob += get_dep_prob(new_edge,stop_edge,k_precede);

new_edge.prob += get_dep_prob(new_edge,stop_edge,k_follow);

return new_edge;

}

Figure 3.10: Pseudocode for add stop

63

add_singles_stops(edges, depth) {

if (depth == 0) return edges;

foreach (e edges)

edgeset.add_many(add_singles(e));

foreach (e edgeset)

edgeset2.add_one(add_stop(e));

return add_singles_stops(edgeset2,depth - 1);

}

Figure 3.11: Pseudocode for add singles stops

3.2.3 Search strategy in Collins’ parsing algorithm

It is clear that the above processes for generating new edges can result in a very large number

of edges being produced. In particular, the call to add singles results in three nested for

loops, so calling it inside what is essentially another for loop to generate recursive unary

productions results in an algorithm with a high degree of complexity.

To constrain the state-space expansion which add singles stops creates, Collins men-

tions that he makes use of a beam search. Beam search is a variation on best-first search in

which the list of nodes in the search graph which have to be expanded (referred to as the

fringe in Section 2.6.2) is kept ordered by decreasing heuristic value, and truncated to a

fixed maximum length after each new node expansion (Ney, Mergel, Noll, and Paeseler,

1992). (The resulting truncated list of nodes is termed the beam.) Restricting the number

of nodes in this way enables beam search to avoid the combinatorial explosion of breadth-

first search. The goal of a beam search is to constrain the search space to a manageable size

without a significant loss in accuracy, by throwing away the nodes with the lowest heuristic

scores no matter when they were generated during the search.

The problem of developing a good heuristic is more serious for beam search than other

search algorithms. The algorithm discards any elements that are locally given a low prob-

ability, which means that they will not be part of the overall solution even if the locally

likely solution is later rejected. This contrasts with normal search techniques that would

come back to the locally unlikely solution after considering alternatives. The heuristic which

Collins implements is, naturally enough, related to the probability of nodes. But recall from

Section 2.6.1 that we can assess the probability of a construction both using the inside prob-

ability that it is locally a likely production to apply, and the outside probability that it fea-

tures in a complete parse tree. Collins’ heuristic value for a node during parsing is simply

the product of its inside and outside probabilities (estimated as shown in Section 3.1.5).

Despite the attractions of beam search, there are some difficulties when it is applied to

64

lexicalised statistical parsing. The problem is that there are a very large number of new

nodes created by the innermost operation in the parsing algorithm—add singles stops .

And it happens that most of these nodes have very similar heuristic evaluations, especially

at higher backoff levels. The effect of this is either that all of these nodes fall off the end of the

beam, because they are not likely enough, or that the nodes generated ‘swamp’ the beam,

and throw everything else off, discarding alternative explanations before the local situation

has a chance to be tested over the whole sentence. What we would like is to select a range

of the most likely nodes generated by add singles stops to be considered for further

processing by combine . Collins’ solution is effectively to implement two beam searches:

one locally within add singles stops , and one globally within combine .5

3.2.4 Summary of Collins’ parsing algorithm

We have now covered the entire parsing process, but in a piecemeal fashion. Figure 3.12

(loosely based on page 189 of Collins’ thesis), presents the same operations we have already

described but as a single coherent function. This is the pseudocode that will be referred to

when discussing implementation details.

Very briefly, in words, here is what happens. We initialise the chart with a set of edges,

each of which is one word. Then, we attempt to join every pair of adjacent edges using a

dependency production, to create complex edges. We add each of these edges to the chart

twice, once as an unfinished edge (which needs more daughters) and once as a finished edge

(all of whose daughters have been found). We then expand all the new finished edges using

unary productions (considering both a single unary production and chains of two or three

unary productions). The resulting edges are added to the chart. Then we iterate by again

attempting to join every pair of adjacent edges (except those pairs that have already been

processed). Search heuristics are applied in two places: firstly, only a selection of the most

likely unary productions are generated; secondly, we only store the most likely edges for

each span of the input string. The algorithm finishes when there are no more pairs of edges

to consider. If there are edges spanning the whole input string whose parent nonterminal is

TOP, the one with highest probability is returned as the parse of the string. If not, the parse

fails. This is the essence of Collins’ parsing algorithm.

5In fact, Collins only uses the term ‘beam search’ to refer to the local search strategy. To constrain the number

of nodes generated by the combine operation, he uses an algorithm for constraining the maximum number of

edges for each span of the input string. But again, he uses the same heuristic, related to the probability of nodes.

Effectively, he implements a number of independent beam searches at this outer level, one for each span in the

input string.

65

edge parse(sentence) {

initialise(sentence); // init all word/tag pairs

for span = 2 to n { // n = number of words in the sentence

for start = 1 to n - span + 1 {

end = start + span - 1;

for split = start to end - 1 { // combine pairs of edges

foreach e1 in incomplete_chart[start][split]

foreach e2 in complete_chart[split+1][end] such that

check_grammar(e1,e2,follow) {

e3 = join_two_phrases(e1,e2,follow);

set1.insert(e3);

for i = 1 to max_unary { // add_singles_stops

incomplete_chart.insert(set1);

foreach e in set1 {

set2 = add_stops(e));

complete_chart.insert(set2);

}

foreach e in set2

foreach P in nonterminals // (in the grammar)

foreach LC in subcat // (in the grammar)

foreach RC in subcat // in the grammar

set1.insert(add_singles(e,P));

}

foreach edge e1 in complete chart[start][split]

foreach edge e2 in incomplete chart[split+1][end] such that

check_grammar(e1,e2,precede) {

e3 = join_two_phrases(e1,e2,precede);

set1 = {e3};

... // add_singles_stops is as described above

}

}

}

}

return complete_chart[0][n].best("TOP")

}

Figure 3.12: Collins’ parsing algorithm

66

Chapter 4

A reimplementation of Collins’ parser

This chapter describes my reimplementation of Collins’ statistical parser. The goal of the

chapter is to document my system at a level of detail which permits a precise reimplemen-

tation, including descriptions of pseudocode and data structures.

It is impossible to present this information at the appropriate level of detail for every

reader; a much shorter version of this chapter has been published (Lakeland and Knott,

2004). In the other direction, additional technical information is presented in Appendix B,

which includes a more precise definition of the data flow, data structures and class hierarchy

as actually implemented. Naturally, the ultimate reference for any implementation is the

source code itself. This is too large to include in its entirety, but key portions (with a link to

the rest) are included in Appendix C.

Much of the published work on statistical parsing describes systems at a fairly high level

of abstraction — but many of the challenges in implementing a lexicalised statistical parser

relate to implementation-level details. This is an area which has not been well covered in the

literature on lexicalised statistical parsing, and yet I would estimate that most of the work

involved in building a lexicalised parser is in addressing the software engineering issues

which arise. Before continuing, I will briefly motivate this idea.

4.1 The complexity of Collins’ parsing algorithm

From reading Collins’ pseudocode in Figure 3.12, it is hard to understand why implement-

ing a statistical parser is difficult. However the problems become clear counting the loops:

for instance Collins’ algorithm iterates over every possible span, start and split for O(n3);

within this loop he iterates over every edge on the chart which has an unclear asymptotic

complexity but is at least O(m) (m is the number of rules in the grammar); within this loop,

Collins iterates over every unary rule in the grammar, again O(m). This results in a final

time complexity of at least O(n3m2), making sentences unparsable without a supercomputer.

67

Similar problems occur throughout statistical Computational Linguistics. Consider a de-

scription of Finch’s unsupervised thesaurus generation algorithm (Finch, 1993): “Count the

co-occurrences of each word with every other word and merge words with similar counts”.

Or again, consider Bod’s pseudocode in Figure 2.20 (page 48): all of these algorithms are

simple to describe, but hard to implement efficiently.

The problem is not restricted to time complexity. Memory usage is also unacceptable

with a naive implementation of such algorithms. For instance, many potential phrases need

to be stored for a long time before they can be rejected. Another area where memory usage

is a problem is the storage of training data. Consider how many events need to be derived

from the training corpus if we are building a lexicalised probabilistic grammar. There are

many words in the language, and (naturally) many more pairs of words, and thus a huge

number of events involving pairs of words in specific grammatical constructions.

Efficiency is not just an important factor in the parser we finally produce; it is also cru-

cially important during the process of developing the parser. Debugging a system which

takes half an hour to parse a sentence is an extremely tortuous process. This means that the

program must be efficient from the very outset, rather than developed without concern for

efficiency and then optimised afterwards.

The conclusion from these considerations is that the difficulties in statistical parsing are

not as much related to linguistics or artificial intelligence but to software engineering. How

can a system reject incorrect interpretations before they swamp resources, and how can it

handle so much data efficiently? The remainder of this chapter will discuss not just the

final implementation of the parser but also the process that was used in developing this

implementation. It is hoped that the former will be of use to people interested in the details

of Collins, while the latter will be of use to people intending to write a statistical parser or

similar system.

A data flow diagram for my reimplementation of the entire parsing system, encompass-

ing both the preprocessor for the WSJ and the parser itself, is given in Figure 4.1.

A data flow diagram providing an overview of the whole parser is presented in Figure

4.1. The first step is to convert the WSJ treebank into an event file suitable for fast com-

putation of probabilities, and a grammar file used for pruning the search space. During

actual parsing, a sentence is input to the system, and a part-of-speech tagger is used to

initialise the chart . The parsing algorithm is then executed. It is important to note there

are two separate loops occurring in the parsing algorithm. A large loop (shown in blue) cor-

responds to the combine operation given in the pseudocode in Figure 3.12, with the chart

passing edges to the parser’s control structure, which then joins and expands them, and

puts the result back into the chart. Within this loop is a smaller loop (shown in green) cor-

responding to add singles stops in the pseudocode in Figure 3.12. This loop expands

68

individual edges, adding all plausible parents.

Figure 4.1: Simplified data flow diagram for my implementation of

Collins’ parser

It is in these two loops that the system spends most of its time. Therefore much of

implementing the parser comes down to implementing these two loops efficiently.

In the remainder of this chapter, I describe the components of this diagram in more

detail. Section 4.2 describes my implementation of the preprocessor; Section 4.3 describes

my implementation of Collins’ probability model; Section 4.4 describes my implementation

of a (new) part-of-speech tagger using this probability model; and Section 4.5 describes my

implementation of the ‘chart’ data structure. Within the parser, there is nothing especially

interesting in the control structure implementing the combine loop; it very closely follows

the pseudocode already presented in Figure 3.12. However, it is worth discussing beam

search in more detail; Section 4.6 describes my implementation of the beam search function

from add singles stops . In Section 4.7, I provide some practical software engineering

advice about how to go about building a lexicalised statistical parser. (This mostly concerns

debugging and program verification, and it was mostly learned the hard way.) Finally, in

Section 4.8, I present the results of my parser.

69

4.2 Implementation of the treebank preprocessor

Before any parser can be implemented, the WSJ must first be converted from a corpus of

trees to a simple enumeration of events.

The WSJ as distributed by the LDC is a large collection of parse trees. These trees do not

include information about headwords, or about complements or adjuncts, both of which are

fundamental to Collins’ approach. The trees also include a lot of information, such as the

semantic locative mark, which would be useful in parsing, but leaving it in reduces counts by

too much. So the first step in preprocessing the treebank is to add the information Collins’

model needs, and delete any extra information. The second step is to transform the trees

into a flat file of events, more suitable for insertion into a hash table.

Transforming the treebank into Collins’ model turns out to be harder to do accurately

than it appears, although most of the problems faced related to Collins under-specifying his

model. The transformation code is implemented in Lisp because it seemed a natural choice

for tree processing, although Perl or python would also be good choices. The high level

code is given in Figure 4.2. This shows that the steps can be performed independently and

so can also be discussed in isolation.

(defun process (tree)

(output-for-collins 0

(add-headword

(first (add-npb

(add-complement

(first (drop-none

(convert-to-numbers tree)))))))))

Figure 4.2: Actual high-level code for preprocessing the treebank

There are five functions in this algorithm, which will be discussed in the order in which

they process the sentence. To begin with, convert-to-numbers transforms words, tags

and nonterminals into enumerated types, which can be stored and processed more effi-

ciently. (From now on, everything in the preprocessor and parser will be based on numbers

instead of words and grammatical symbols.) drop-none then removes gapping informa-

tion. (If model three was implemented rather than model two, then a -G marker would

have to be added here instead of simply deleting gaps.) add-complement decides which

children are complements and adds a -C to their head nonterminal. add-npb finds the ba-

sic noun phrases and adds an extra level in the tree — i.e. (NP ...) goes to (NP (NPB ...)).

add-headword chooses a head child for each phrase and also transforms the tree so that

information about this head child is stored in the parent. Finally, output-for-collins

70

produces output in a format that is easy to parse.

Of these steps, adding complements and headwords are sufficiently complicated to war-

rant further discussion. The algorithms Collins uses are based on Magerman (1995)’s al-

gorithm for identifying headwords. The idea of using an algorithm to automatically select

headwords is interesting and empirically it appears to work well; interested readers are

referred to Magerman (1995). The basic method is to take each sequence of sister nodes

in a tree, and determine which is the head, and which are the complements and adjuncts.

Collins decomposes this process into two separate stages: somewhat surprisingly, he begins

by identifying complements, and only after this identifies heads. The remaining nodes are

classified as adjuncts. Implementing the algorithm involves fairly complex tree manipula-

tion, and so detailed pseudocode is provided in Figure 4.3 to give a more precise description.

(defun add-headword (tree)

(if (terminal-p tree) ; are we down to words?

(list (first tree) (second tree) (first tree)) ; a word is a headword

(add-headword-internal (first tree) ; do the work

(mapcar #’add-headword (cdr tree))))) ; recurse

(defun add-headword-internal (head children)

(let* (left-to-right (get-direction head))

(priority-list (get-priority-list head))

(search-children (if left-to-right children (reverse children)))

(found (remove nil

(mapcar #’(lambda (item)

(find item search-children :key #’first :test #’equal))

priority-list))))

(if found

(append

(cons head

(list (second (first found)) (third (first found))))

children)

(append ; headword not found assume the first/last child

(cons head-for-output

(list (second (first search-children))

(third (first search-children))))

children)))))

Figure 4.3: Pseudocode to implement Magerman’s headword algorithm

We add a complement tag to any nonterminal matching any of several constraints relat-

ing to its NT and its parent’s NT. One constraint, for instance is that “[the] nonterminal must

71

be: an NP, SBAR, or S[,] whose parent is an S”. As already mentioned, Collins derives com-

plement information before deriving headword information. It seems counter-intuitive to

derive information in this order. But again the algorithm seems to work empirically, perhaps

because both processes are so similar. Because the algorithm is so similar to the headword

algorithm, pseudocode will not be presented here.

I have implemented a preprocessor based on Collins’ description of his algorithms. How-

ever, as mentioned in Section 3.1.3, there are many areas in which Collins’ description of his

preprocessor is incomplete. For instance the kinds of arguments taken by prepositions are

not given (perhaps implying they take none). But it is clear they do take arguments, be-

cause the event file presents numerous examples of them with arguments. To circumvent

these problems, my preprocessor is also designed to reproduce the event file which Collins

distributes. Rather than present the minutiae of my preprocessor code here, Section C.4

presents it in full.

4.3 Implementation of the probability model

The core of Collins’ probability model is a function called genprob1 (for ‘generate prob-

ability’), which takes an event and returns a probability computed by calculating relative

frequencies of events found in the WSJ at different levels of backoff, and smoothing/inter-

polating between these appropriately (see Section 3.1.6). The key data structure supporting

this function is therefore a table of these WSJ events associated with counts. Each event is di-

vided into nine sub-events (numerator, denominator and weighting, each at three different

levels of backoff). Each sub-event is associated with a count of how frequently it occurs.

How should we store the table associating sub-events with counts? We could store it as

an array, but this would be impossibly large and very sparsely populated. It is obviously

better to use a hash table to store these counts. In a hash table, we use the event to generate

a hash key, which operates as an index into the table.

The table itself can be implemented as a simple array. More sophisticated hashing algo-

rithms exist, usually based on a tree for the first few bits of the key followed by an array for

the rest, but they are unnecessarily complex here since their benefit is dynamic resizing of

the hash-table, and here the size of the hash tables does not change during parsing. Even

if events were to be added to the hash table during parsing, it is unlikely so many events

would be added as to justify a complex data structure. So my hash tables are simple arrays.

There are two different ways of organising the hash table data structure. We could sim-

ply have one big hash table, in which the key specifies not only the event’s description, but

also what kind of event it is (i.e. a unary numerator); or we could use a separate hash table1It is our convention throughout this thesis to refer to functions in teletype; while genprob is a function it is

referred to so often that it is easier to read in a roman typeface.

72

for each kind of event. The latter approach leads to slightly reduced efficiency because it

becomes impossible to keep the hash tables at the same density, but it makes debugging

easier since each entry can be verified. I adopted the latter approach.

It is normal when discussing a hash-table implementation to justify the collision resolu-

tion algorithm used. Since events are many bytes long, but array indices are only two bytes

long, it is inevitable that different events will map to the same array index. The normal

method for resolving such collisions is to store the full event along with the value (in this

case, the frequency) in the hash table and then either store different events in a linked list,

or use a separate hashing algorithm to look in a different part of the hash table. However,

neither approach was used here; instead we silently ignore collisions. This leads to counts

being incorrectly combined whenever a collision occurs.

Why is it expedient to ignore collisions? Storing the count takes perhaps four bytes,

but storing the key takes at least ten bytes, and so any collision detection would triple the

parser’s memory requirements. It would also significantly decrease the parser’s speed since

a ten byte comparison would need to be performed on every lookup. Furthermore, the

benefit gained by avoiding false collisions is small. Occasionally a collision will cause an

incorrect interpretation to be given a higher probability, or the correct interpretation to be

given a lower probability. These cases occur very rarely, perhaps just a few times per sen-

tence. Even when they do occur, the effect is almost always a tiny change in the probability

since most events have a very low count, and so will not affect the final parse of the sen-

tence. Finally, in the few situations where a significant effect is seen, the second-best parse

usually has only slightly lower precision and recall than the best parse and so the error does

not significantly affect accuracy.

The final implementation issue in hash-tables is how to transform the event into an array

index; this is known as key generation. Ideally this process should be at least partially

reversible so that surprising lookup results can be traced back to the event which caused

them, but at the same time performance is critical. My key generation function takes the

event’s components (e.g. head tag, head word etc), which are each represented by an integer

(as mentioned in Section 4.2) and computes their product, modulo the hash table size. This

function achieves a reasonable tradeoff between performance and verifiability. Collins’ own

implementation uses XOR on the event’s components, leading to increased efficiency but a

process that is harder to debug. If the parser’s efficiency becomes an issue, this would be an

easy way of increasing speed.

A further hash-table was used for each event. This tenth hash-table cached the final prob-

ability computed from previous calls to the lookups. Recall that each hash-table just holds

the raw counts of numerators, denominators and weightings for sub-events; to compute a

probability, nine of these counts must be looked up, and a smoothing/interpolation formula

73

applied. Since over 99% of hash table lookups are performed more than once, it makes sense

to store the results of this computation in a separate hash table. This means only one lookup

is needed almost all the time. This single optimisation increases the parser’s performance

by almost two orders of magnitude.

4.4 Implementation of a POS tagger

Before parsing can begin, we must first perform part-of-speech (POS) tagging on the input

sentence. A POS tagger maps words to their parts-of-speech. For instance, walked might be

tagged VBD to say it is a verb in the past tense. For most words this is a trivial process since

they can only be interpreted as one particular part-of-speech. But a simple lookup table does

not suffice since many words are ambiguous, and some words were unseen during training.

To obtain higher accuracy than a lookup table can achieve, the context of previous words

and/or their tags is used by various different methods.

In this section, I discuss the POS tagger I developed on top of my reimplementation of

Collins’ probability model. I begin in Section 4.4.1 by discussing the relationship between

tagging and statistical parsing in general. In Section 4.4.2, I describe the algorithm used by

my tagger — a standard hidden Markov model — and in Section 4.4.3 I describe how the

algorithm was implemented. Finally, in Section 4.4.4, I present some results. The tagger I

developed is also documented in a separate publication (Lakeland and Knott, 2001).

4.4.1 The relationship between POS tagging and lexicalised statistical parsing

When parsing, a tagger is needed to initialise the parser from the sentence. This is be-

cause parsers do not have sufficiently large grammars to write specific grammar rules for

every word, and so write their grammar rules in terms of POS tags instead. Non-lexicalised

parsers then discard the sentence entirely and just parse the tag sequence, while lexicalised

parsers still use the words to guide the parser. Collins’ parser uses the words when statis-

tical counts are sufficiently high, but it is clear that the tags control the parse much more

than the words do (discarding them entirely only slightly decreases parser accuracy, as we

discussed in Section 2.4.4). Because the parser is so dependent on the tags, it is important

the tags are correct.

The normal process of POS tagging involves resolving the ambiguity in the sentence

and selecting a particular tag for every word. However most parsers, including Collins’,

support ambiguity in the way the sentence is tagged. Being a statistical parser, Collins’

parser also allows us to assign a confidence to each tag for a word being tagged ambigu-

ously. Essentially, a normal tagger will produce one tag per word, but the optimal input for

a parser like Collins’ is a probability distribution of tags for each word. Somewhat surpris-

74

ingly, Collins himself does not use a tagger which provides such probability distributions.

In fact, it turned out to be quite hard to find exactly what Collins did for POS tagging since

it did not appear to be documented anywhere. But by reading the code it was found his

parser supported two different modes: in ‘oracle’ mode the parser assumes the input has

been perfectly tagged and consequently assumes no ambiguity, while in ‘evaluation’ mode

the parser assigns every word all the tags it was seen with in training at equal probabil-

ity. I therefore decided to build my own POS tagger to initialise the chart, which provides

probability distributions on POS tags.

There are several good reasons for developing a POS tagger of my own rather than using

an existing one. Firstly, we can ensure thereby that there is a smooth coupling between the

parser and the tagger; in particular, we can ensure that the tagger uses the same tagset

and tokenisation scheme as the parser. Secondly, in writing the skeletal structure of the

parser we have already implemented a lot of the code that a tagger needs, such as reading

sentences from a file, mapping words and nonterminals into enumerated types, hash-table

lookup, backoff and smoothing. Moreover, the tagger contains none of the computational

complexity of a parser and so is extremely simple to debug. Not only does this make the

tagger a useful practise-step towards developing the parser, it also significantly reduces the

amount of code in the parser which might contain bugs. Finally, if we implement a tagger

based on the existing probability model, it is very simple to make it return a probability

distribution for each tag, rather than simply the best tag or an undifferentiated set of possible

tags.2

4.4.2 Part of speech tagging using hidden Markov models

Consider a simple sentence such as The can can hold water. The job of the tagger is to deter-

mine that The is a determiner, the first use of can is a noun while the second is as a verb,

and so on. A simple mapping between words and their most common POS tag will usually

obtain the correct answer but such an approach would have get one of the uses of can wrong

in the example sentence. To obtain better results than a simple lookup, we use the context

of the word in order to better predict the tag of the current word. For instance, can is usually

a verb (99.5% of the time) but verbs almost never occur after determiners, especially at the

start of a sentence (only 5% of the time). Further, identical verbs almost never follow one

another. So through context we can determine when to use the conventional tag, and when

to make an exception.

2The POS tagger was one of the components of the implementation which I completed, with my publication

describing it in 2001 (Lakeland and Knott, 2001). At the time a literature search did not find any POS taggers that

produced a distribution of tags — or even that produced a set of tags. However, I have since found an excellent

paper by Charniak, Carroll, Adcock, Cassandra, Gotoh, Katz, Litman, and McCann (1996) which describes such

a tagger. Charniak et al.’s tagger performs similarly to the one I describe.

75

Aside from the simple lookup, there are two classes of approaches to POS tagging: rule-

based and stochastic. There is some debate in the literature about which of these approaches

is more suitable. However, we have already addressed an identical question when deciding

to produce a statistical parser rather than a deterministic one, and so it makes sense to be

consistent. As Charniak et al. (1996) puts it: “we are simply more familiar with the statistical

tagging technology.”

Within stochastic tagging, there are a number of approaches but the most common is

to use a hidden Markov model (HMM). In this approach, the sentence is viewed as the

product of some model. Because we cannot see exactly what is going on we say that the

model is hidden, but we can observe its behaviour and based on these observations make

assumptions about the internal state of the model. Specifically, we assume that the internal

states correspond to the sequence of parts-of-speech and words already seen, and the model

has a probability of going from the current state to the next state based on the likelihood of

producing the next word given the current internal state. More formally, we say:

P(W |T) = Πi=1...nP(wi |w1, · · · , wi−1, t1, · · · , ti)

Since this model has far too many parameters to compute for sentences beyond a couple

words, we make a Markovian assumption that only a certain amount of context is necessary.

The amount of context retained is known as the order of the Markovian assumption and in

practice it is chosen according to the amount of available training data, because if we have a

lot of training data then we do not need to make broad assumptions. The important thing to

realise reading this is that we are discarding all context words from history and only using

the previous tags.

P(W |T) ≈ Πi=1...nP(wi | ti−2, ti−1, ti)

Another important point to note is that we are predicting the probability of each tag in

isolation, we are not maximising over the sentence. This means that since an unknown word

at the start of a sentence is best interpreted as a proper noun, we will interpret it as such

even if looking at the later words implies that it is really a verb. This problem has been re-

solved in POS tagging by considering candidate paths rather than simply selecting the most

likely for history. The code modifications necessary to perform argmax over the sentence

are small, essentially a very simple chart is needed instead of just an array. An evaluation of

the tagger’s errors implies this enhancement will result in a significant improvement on the

tagger’s results, approximately halving the errors. We did not implement this improvement

because an analysis of the parser’s errors (see Section 4.8) implies the tagger’s accuracy has

virtually no bearing on the parser’s.

As with the equations in the parser, it is probably more useful to present these equa-

tions in terms of the events that must be counted rather than mathematically. This is pre-

76

sented in Table 4.1. Consider again the sentence The can can hold water. The first step will

Backoff level POS

1 P(tag |word, prev tag, prevprev tag)

2 P(tag | prev tag, prevprev tag)

3 P(tag |word)

Table 4.1: Part of Speech event representation

be to take The, and compute for every tag P(tag,"The","#STOP#","#STOP#") . This

means we will compute the counts of the tuple (DET, "The","#STOP#","#STOP#") ,

(NN, "The","#STOP#","#STOP#") , etc. Since a determiner typically starts a sentence

(follows "#STOP#","#STOP#" ), and The is almost always a determiner, we will select de-

terminer as the best tag by a large margin. Next we shift the window and consider can with

the context P(tag,"can","DET","#STOP#" . Because it is following a determiner, we can

classify can as a noun instead of the more common verb.

4.4.3 Implementation Details

The work involved in implementing a basic HMM tagger is quite small, comparable to a

senior undergraduate assignment. As with the parser, the steps are: (i) preprocessing a

tagged corpus into a normalised structure and saving the events in this corpus into a simple

event file, and (ii) implementing a probability model which uses the events and writing

some control structure to read sentences and pass them one word at a time to the probability

model.

Since the tagger I implemented is designed to work with a parser based on the WSJ,

it makes much more sense to use the WSJ as a training corpus instead of a larger tagged

corpus such as the BNC. While using the BNC would enable the development of a more

accurate tagger, it would necessitate the use of a different tagset and tokenisation scheme

and the mapping between these would destroy any benefits gained by the extra accuracy.

Instead, we take the WSJ treebank and strip out all nonterminals to obtain a flat structure.

Again, we could obtain higher accuracy by further processing, such as proper name detec-

tion, but such transformations would again make the parser more complex. Converting this

tagged sequence into an event file is simply a matter of iterating over the sequence, convert-

ing words and tags to their enumerated types, counting events in a hash-table and saving

the hash-table to a file. This process is so close to the pseudocode for the tagger’s control

structure presented in 4.4 that it will not be presented separately. The only difference is that

instead of using P pos to derive a probability, the hash-table values which it looks up are

incremented.

77

Implementing the tagger’s probability model involves exactly the same steps as in the

parser. A request is made by the control structure to derive the probability of a tag event

using the parameters provided. The parameters are then converted to nine separate hash-

table keys using the standard modulus approach, looked up in the nine tagger hash-tables,

and the results are combined using the smoothing algorithm already presented in Section

3.1.6. Finally, the control structure is little more than a read-eval-print loop. Pseudocode is

presented in Figure 4.4.

output[0] = output[1] = "#STOP#";

for (i = 2; i < len+2; i++) {

prevprev_tag = output[i-2];

prev_tag = output[i-1];

current_word = sentence->word_as_enum(i-2);

for (tag_nr = 0; tag_nr < num_tags; tag_nr++) {

current_tag = possible_tags[tag_nr];

prob = probs->P_pos(current_tag,current_word,

prev_tag,prevprev_tag);

if (prob > maxp) {

maxp = prob;

maxt = current_tag;

}

}

output[i] = maxt;

Figure 4.4: The tagger’s control structure

4.4.4 Results

This tagger was never intended to be state-of-the-art, but it is still expected to perform well.

Specifically, it was intended to give the parser greater accuracy than simply selecting the

best tag, without giving the parser the complexity of selecting all plausible tags. We will

defer an analysis of the tagger’s effect on the parser’s efficiency until our analysis of the

parser’s results in Section 4.8. However it is worthwhile here at least contrasting the tagger

to others that are available.

The tagger is evaluated based on its accuracy on the testing section (Section 23) of the

WSJ. This method could be argued as slightly optimistic since it means the training and the

testing data are very similar, but since the tagger’s job is to provide tags to the parser, and

the parser is going to be tested on the WSJ, this is a reasonable test. On this test the tagger

78

achieves an accuracy of 95.6%. This is above a basic tagger, but below the very best taggers

which can achieve 98% accuracy. Predictably, this figure is close to Charniak’s figure for his

basic model of 95.9%, which was generated using a very similar method.

As a baseline, it is useful to measure a POS tagger’s accuracy when it is given no context

— that is, how accurately can you predict the current tag given just the current word. It

turns out that such a simple model obtains about 88% accuracy when tagging the WSJ using

the WSJ tagset. The inclusion of several simple rules such as a tagging dictionary raises

this accuracy to 94%. Since the parser is not especially dependent on the tagger’s accuracy,

it would be reasonable to conclude that the baseline approach would have been perfectly

adequate. This is the conclusion Charniak comes to in Charniak et al. (1996), but it is worth

noting that the amount of extra work required to implement a simple HMM based tagger is

very small.

Normally the evaluation of a POS tagger concentrates only on whether or not its chosen

tag is correct, the rank of the correct tag is rarely presented. However in this case the tagger’s

job is to pass a set of tags on to the parser and it is highly desirable for this set to include

the correct tag while preferably being as small as possible. Therefore Table 4.2 presents the

Rank Percentage

1 93.1%

2 3.9%

3 0.9%

4 0.4%

5 0.3%

6-9 0.7%

10-19 0.4%

20+ 0.1%

Table 4.2: Actual position of the tag that should be in first position

relative position of the correct tag. Clearly the tagger gets the right answer almost all the

time, but what about the cases where the tagger gets it wrong? If the correct tag is given a

high probability then we may be able to detect the possibility of an error and compensate

for it using ambiguity. Figure 4.5 presents the probability associated with the correct tag

in the cases where the tagger has made an error. Clearly a significant number of the errors

have a probability over 0.2, and yet Table 4.2 tells us that very few incorrect tags are given

probabilities over 0.2. Based on these results, we can conclude that using just the tags with

return by the tagger with a probability over 0.2 will almost always give us the correct tag.

In those few cases where it does not work the correct tag is usually assigned a probability of

79

Probability estimate

Probability

Fre

quen

cy

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

050

0010

000

1500

0

Figure 4.5: Histogram of the probability assigned to the correct tag, in the

cases where the tagger chooses a wrong tag as best

80

near zero and so we just accept the parser will make an error on this word.

The computational efficiency of the POS tagger was not important here since it is only

used in the initialisation phase of the parser, rather than inside any loops. This means the

difference between a fast and a slow tagger will be lost as noise when measuring the parser’s

performance. However, for the sake of completeness it is worth mentioning that the tagger’s

complexity is linear in respect to the length of the input, with the tagging of a single word

requiring approximately thirty hash-table lookups. The initialisation of the tagger takes

one minute. After initialisation, the tagger processes a little under five hundred words per

second, which is significantly slower than average.

As always, it is useful to compare my approach to Collins’. In this case, his approach is

not as effective as mine. Since the parser’s probability model is written using conditional

probability, this approach means that it will take the parser a long time to reject a poor

choice of POS tag. This is because the probability of any given parent is computed using

P(parent|tag)× P(tag), and it can be extremely high for a poor choice of tag since all tags

have by definition a probability of one in Collins’ model, and that parent may be the only

parent ever seen with this tag. The effect of this is that a locally likely constituent will be

inserted into the chart, and only rejected when it is found not to work well globally. By

contrast, my method will assign a very low probability to a poor choice of tag, and so even

though the conditional probability P(parent|tag) is high, the overall structure is given a low

probability and the parser can reject it much faster. It is hard to imagine situations where

my approach will increase overall parser accuracy since these local errors are easy to correct

at a global level, but my approach will allow the parser to reject incorrect interpretations

much faster.

4.5 Implementation of the chart

We now come to the parsing algorithm itself. In this section, we discuss implementation of

the chart, the core data structure used by the parser.

The goal of the chart is to store all the nodes for each span. When describing determin-

istic parsers, our description of the chart data structure was rather vague. This is because

we have been talking about systems with small grammars and so the chart data structure

is just an implementation detail where any design would work. However for a statistical

parser the grammar is so huge that the wrong choice of data structure will make parsing

impossibly slow. Given our emphasis on software engineering, it is important to describe

the chart data structure in detail.

The most natural way of implementing the chart would be as a three-dimensional ar-

ray, in which the first two dimensions specify the start and end of the span, and the third

81

dimension stores the actual edges. Unfortunately, we do not know how many edges will

be needed for any given span of the input string. There are two possible strategies in this

situation. Firstly, we could preallocate a third dimension of some fixed size. But the third

dimension must then be big enough to hold the maximum number of edges stored for any

span. Secondly, we could replace the third dimension with a linked list, in which there is

no need to preallocate memory. However, allocating memory during run-time is compu-

tationally extremely inefficient. Moreover, the indirection involved in traversing a linked

list is also a little inefficient. To get around both these problems, note that the control struc-

ture of the parsing algorithm means that edges with a given start and end position in the

input string are added consecutively. This means that we can store the chart as a huge

one-dimensional array of edges, with a two-dimensional index array of pointers indicating

where the set of edges associated with each span are stored. There is then no wasted space

in the chart, and we still have constant time access to any span.

There are four further optimisations made in the chart data structure. Firstly, recall that

there is an optimisation which requires consultation of a grammar to avoid generating pro-

ductions which never occur in the training corpus. The grammar lists all the nonterminals

that are allowed to combine with each other in dependency productions. To implement this

optimisation, we add a third dimension to the index array, to hold the parent nonterminal

associated with each edge. The system can then ask for only edges which are ‘grammatically

possible’.

A second optimisation comes from noting that the control structure of the complete

function means that we always process one complete edge and one incomplete edge, so it

would be more efficient if we could loop over all complete edges and all incomplete edges

separately. It thus makes sense to have two separate charts, one for complete edges and one

for incomplete edges.

A third optimisation is to avoid adding generated nodes to the chart. The use of cutoffs

by Collins is discussed in Section 4.6.1; but we can also reject nodes that our probability

model says are equivalent to a higher probability node already in the chart. This is the

Viterbi algorithm which was discussed in Section 2.6.3 and it speeds up the code by two

orders of magnitude. This is a nice example of the adage that it is better to optimise the

algorithm than the code.

A final optimisation in the chart data structure is to store only the most probable edges

for any given span in the chart. While the Viterbi algorithm guarantees not to delete the very

best parse, this final optimisation is more of a heuristic, simply ensuring that it is unlikely

that the best parse ends up being deleted. This optimisation is implemented by treating

the list of edges stored for any given span as a priority queue, ordered by the probability

of edges. For any given span, we can define the best edge as the edge at the head of this

82

queue. The number of edges on the queue is determined by a global cutoff threshold, which

specifies how close to the best edge’s probability an edge’s probability needs to be in order

to remain in the chart. Edges whose probability is beyond this threshold are marked as

unusable, and are not considered in further productions. A further benefit of the priority

queue is that add singles stops is called with the best edges first, and can therefore

more quickly reject edges in later calls.

This completes the description of my ‘chart’ data structure. To contrast my approach

with the one Collins used, he represented the chart as a single huge array but used a two

dimensional array of pointers iterating over this array which provide the next node with

the correct start and end points. This representation gets most of the benefits of the one just

described — for instance it is trivial to preallocate the array. The main disadvantages of his

representation compared to mine is that his did not include the complete/incomplete opti-

misation, or the ability to iterate only over a particular nonterminal. However it did make

his implementation significantly easier; my chart management code is around a thousand

lines of code.

4.6 Implementing add singles stops and beam search

The add singles stops function has the task of finding parents for each new edge and

creating more new edges, one for each of the possible parents, which it adds to the chart. Its

input comes from join two edges , which produces incomplete edges and so its first task

is to see if these edges could be considered complete using add stop . All of the edges that

can be considered complete are added to the ‘complete’ chart ready for the next complete

loop, and they are then extended to new parents using add singles . The output from

add singles is then added to the ‘incomplete’ chart, and also processed again recursively

in order to support edges with no siblings.

The difficulty in implementing add singles is that it includes three nested loops and is

itself called recursively by add singles stops about five times. While none of these loops

is dependent on the size of the input sentence (i.e. the function is O(1)), an unconstrained

implementation would result in approximately 20005 edges being created. Even if these

edges were discarded by the chart on creation, the time taken to create them would make it

impossible to parse a simple sentence. To resolve this, Collins only expands edges likely to

be part of the final parse. As already mentioned, Collins’ uses a constrained best first search

known as a beam search for this process. The edges produced by add singles are not

added directly to the chart, but to a fixed-size data structure called the beam, which holds

a list of edges ordered by decreasing probability. A new edge generated by an invocation

of add singles is inserted into the beam at an appropriate point, to keep it ordered. If

83

its probability is lower than that of the last edge on the beam, it is not added to the beam

at all — which is what happens most of the time. The benefit of this is that instead of an

unmanageable number of nodes being created, perhaps only a few hundred are created (of

which optimisations in the chart will still discard all but a handful). I will discuss the beam

search algorithm in more detail in Section 4.6.1.

Searching generally involves creating new nodes for each child being expanded. As was

mentioned in Section 4.5, allocating memory at run-time is a computationally expensive op-

eration and is undesirable in a program where efficiency is critical. Preallocation of memory

to store nodes is clearly a preferable option. Since a beam always has exactly n nodes on

it, it seems intuitively obvious that beam search could be implemented with preallocated

memory — but it proves surprisingly difficult to do efficiently. To support preallocation,

my implementation of beam search uses skiplists (Pugh, 1989). I will discuss skiplists in

Section 4.6.2.

4.6.1 Beam Search

The search space traversed by the beam search for a given application of add singles stops

has as its root node the edge generated by join two edges . Children of this node are all

the possible productions featuring parents of the edge, as found by add singles . Each

child node itself has children, corresponding to the possible productions adding grandpar-

ent edges — and so on. The complexity of the search space is in its huge branching factor,

not in its depth.

To manage the search space we use a heuristic search algorithm called beam search.

Unlike exhaustive best-first search, this sacrifices the requirement of finding the best parse

in order to reduce the computation required. The search is initialised by putting the root

node at the head of the beam. We then iterate, taking the node at the head of the beam and

generating its children, and adding these into the beam so as to keep it ordered. The general

idea of a beam search is that, just like best-first search, we add all children generated at each

step to a priority queue. But since the priority queue is a fixed size in beam search, some

children fall off the end and are silently discarded. Code for a beam search is given in Figure

4.63.

It is worth mentioning explicitly that beam search implements a sort of cutoff, whereby

any new edge whose probability is further than the width away from the best edge will be

automatically discarded.

3Programmers not used to functional programming are reminded lazy evaluation means the outermost take

will prevent generate children from generating unneeded children

84

beamsearch :: Int [Node] -> [[(Node,Real)]]

beamsearch width nodes =

[ nodes :

beamsearch width

(take width ; only keep some children

(sort snd ; get comparison on second field

(fold ++ [] ; [[1,2],[3],[4]] -> [1,2,3,4]

(map generate_children nodes))))]

Figure 4.6: Code for a beam search

4.6.2 Skiplists for implementing beam search

From an implementation perspective, it would be desirable to use an array to implement

beam search since the beam is a fixed size, insertion into a sorted array involved moving

a lot of data. A linked list is another possibility, although it reduces most operations to

O(n). In the literature, there appears to be very little discussion on how to implement beam

search more efficiently. This is perhaps because most beam searches are implemented with

a beam of perhaps ten or twenty elements, for which the difference between a computation-

ally efficient implementation and a simple implementation is insignificant. However, in a

statistical parser the beam needs to be very large because the heuristics are not especially ac-

curate. Collins notes that his beam is ten thousand nodes long; at this point considerations

of efficiency certainly override considerations of simple coding.

In order to implement beam search efficiently there are two operations that need to be

efficient. Firstly, it is hard to find the appropriate place in the beam to add a child node that

has just been generated. Secondly, we wish to avoid allocating new memory to store the set

of child nodes generated at each iteration before they are placed on the beam. The search

generates a large number of children which are immediately thrown away, and if care is not

taken, this will waste a lot of time in calls to malloc and free .

To address both the slow operations, I implemented the beamdata structure as a skiplist.

Skiplists (Pugh, 1989) are a variant on linked lists in which a number of ‘next’ pointers are

kept on each node instead of just one. The number is a function of the length of the list: for

a list of length n, we keep lg n pointers. These extra pointers allow the algorithm to ‘skip’

along the list. An analogy often used is that a skiplist is like a highway, with the different

‘next’ pointers providing high speed lanes along the list, see Figure 4.7. How much benefit

do we gain from skipping? Following Figure 4.7 we can see the process for finding the next

item is directly isomorphic to binary search, and code to do so is presented in Figure 4.8.

From this we know that access to any item in the list can be achieved in O(lg n), while access

85

11 2 3 11 124 5 6 7 8 9 10 13 14 15 16

... head

Figure 4.7: A simple skiplist showing the first sixteen items

to the best item is O(1). Insertion into a skiplist is similar to insertion into a linked list,

except that the ‘next’ pointers need to be tidied up, as is shown in Figure 4.9. One side-effect

of the pointer management is that insertion is no longer O(1) operations. Each node has

on average lg(lg n) pointers, and these need to be reattached during insertion and removal,

making it an O(lg(lg n)) operation. However, even for huge values of n this is so close to

O(1) as to be indistinguishable. Throughout the rest of this section, O(lg(lg n)) functions

will be referred to as O(1). Insertion into an arbitrary location of the list is O(lg n + lg(lg n))

which can be simplified to O(lg n).

node * skiplist::find(node * n) {

node *cur, *next;

cur = next = head;

for (int step = logn; step >= 0; step--) {

do { // iterate until next < n.

cur = next;

next = cur->next[min(step,cur->depth)];

} while (next->priority > n->priority);

}

return cur;

}

Figure 4.8: Code to find the highest node in a skiplist with priority ≤ n

In most applications, the size of the skiplist is unknown at construction time. This com-

plicates the process for deciding how many ‘next’ pointers to use. To resolve this, Pugh

assigns the number of ‘next’ pointers as a random variable with a probability distribution

designed to give the correct number of pointers on average if the final list size was n. This

clever hack is not needed here since the beam size is known and does not change.

86

void skiplist::insert(node * n) {

node * place = find(n);

int i, curdepth, depth;

// fill up place->next

for (depth = 0; depth < min(place->depth, n->depth); depth++) {

n->next[depth] = place->next[depth];

place->next[depth] = n;

}

// fill up n->next

if (n->depth > place->depth) {

node * cur;

curdepth = place->depth + 1; // we’ve filled this far

cur = next = n->next[0];

do {

if (cur->depth > curdepth) {

for (i = curdepth; i <= cur->depth; i++)

n->next[i] = cur;

curdepth = cur->depth;

}

} while (curdepth < n->depth);

}

}

Figure 4.9: Code for inserting a node into a skiplist

87

Doubly-linked skiplists

As a small but extremely useful extension to Pugh’s idea, I implemented doubly-linked

skiplists (analogous to doubly-linked lists). This gives O(1) access and insertion to both

the start and end of the list rather than just the start. As with conventional skiplists, doubly-

linked skiplists have O(lg n) insert and lookup. Because the first item can be popped in O(1),

the list can be iterated over in O(n). This iteration is the main operation in add singles stops .

Evaluation of doubly-linked skiplists for beam search

Having developed a suitable data structure, we apply it to beam search. The benefit of

having a pointer to the last element in the skiplist is in memory management. The definition

of a beam search means we always need at most n elements in the beam, so the back pointer

gives O(1) access to an unused element. The operation of the beam search is then to pop this

unused element, fill it with the new child, and push it back to the beam. Because most new

children turn out immediately to be below the threshold, no elements in the beam need to be

moved and so constant time performance is usually obtained. By allocating n + 1 nodes for

a beam of length n, we can provide add singles with an empty node in almost constant

time by simply returning the last node in the list. This node is filled with values of the child

being generated, including a priority. It is then reinserted, most commonly at the start or

end of the list which is almost constant time, but even an insertion into another location is

O(lg n).

For a more quantitative analysis, Figure 4.10 shows the time taken by the skiplist to pop

an element off the front of the beam and insert it with a random priority. Normally it would

be appropriate to show a number of different graphs, examining the time if elements are

popped off the back, or inserted with non-random priority. But this graph clearly demon-

strates that the timings are so close to being independent of beam size, and linear to the

number of insertions, that there is no benefit in examining performance further. Similarly,

profiler results are as would be expected: three-quarters of the time is spent inside find ,

with approximately one-eighth of the time spent inside both push and pop . When used in

beam search, these figures are expected to change so that push and pop take a greater pro-

portion of the time. This is because the heuristic evaluations are typically either high or low

rather than uniformly distributed, which will make execution of find significantly faster,

without affecting push or pop .

Another parser-related issue is how to interleave the calls to add stops with the calls to

add singles . One choice is to embed the call to add stop within the call to add singles .

This will mean that children are inserted onto the beam ahead of nodes at the first layer, so

that some probable grandchildren will be generated before all the children are generated.

Another choice is to place all generated children onto a second beam and then process them

88

0

10

20

30

40

50

60

70

0 5000

10000 15000

20000 25000 0

1e+07

2e+07

3e+07

4e+07

5e+07

0

10

20

30

40

50

60

70

Time

Beam size

Iterations

Time

Figure 4.10: Time taken by the skiplist to insert random elements with

different beam sizes

all at once using add stops . I took the latter choice mainly because it makes it easier for

the probability model to cache intermediate values.

Overall, doubly-linked skiplists have proven to be a novel and efficient method of imple-

menting beam search for large n. Contrasting this approach to the standard in the literature,

a heap-based approach should outperform the skiplists for small values of n, but be outper-

formed by the skiplist for large values of n. As with the chart, it is interesting to compare

my approach to to Collins’. It turns out Collins does not actually implement classical beam

search, but instead uses an array of edges being expanded with a threshold — if an edge is

a certain amount worse than the best edge then it is discarded. This is significantly simpler

and somewhat more efficient than my approach. However, calling it beam search is stretch-

ing the definition. In order to more accurately simulate Collins’ results I also implemented

an array approach. From a wall-clock perspective the two approaches perform identically

— that is, inserting n items into the skiplist is sufficiently close to linear that the difference

between it and true linear is not a performance concern.

4.7 Some software engineering lessons learned

This concludes discussion of my reimplementation of Collins’ parsing system. However,

before presenting some results to summarise the performance of my system, it is useful to

summarise some of the software engineering principles which are important in the devel-

opment of a piece of software as complicated as a lexicalised statistical parser. Most of these

principles were learned the hard way, and anyone who wants to implement a similar sys-

tem of their own would do well to take them on board from the outset. This section doesn’t

89

address coding issues, but rather focuses on the software engineering processes used. It

largely concentrates on the mistakes made, and how to avoid them, but occasionally men-

tions something which worked well, especially if Collins did it differently.

There is a famous quote: “Plan to throw one away; you will anyhow.” (Brooks, 1982).

I don’t have time to write the system again, but will put here how I would implement the

system better. This is not how I would improve on the system (see Section 8.2.1 for that) but

how the results of this chapter could have been achieved with less work.

4.7.1 Programming languages for statistical parsing

Initially my parser was implemented in Lisp, with the naive implementation taking only

slightly more lines of code than the pseudocode. It was far too slow; millions of incorrect

interpretations were created, rejected and garbage collected for every correct phrase found.

It would have been possible to preallocate structures to reduce the garbage collection but

this would have defeated the point of using Lisp. Since time preprocessing the corpus is

irrelevant, the Lisp preprocessor from this implementation is still used.

The second incarnation was in the language Clean (Plasmeijer, 1998), a functional lan-

guage similar to Haskell. This language supports lazy evaluation and it was hoped this

could be used to avoid generating unlikely phrases. It appears this is possible and it would

be a very promising project, but while it took virtually no time to hack up prototypes in

Clean and writing efficient Clean is entirely possible, it took me longer to write good code

in Clean than it took to write good code in other languages. My conclusion from this is

that Clean is naturally suited to tasks where you understand what you have to do, and un-

suited to tasks where you have little idea how you are going to approach the problem. One

interesting component completed before this implementation was abandoned used the C in-

terface to Clean to access a SQL database. This approach turned out to be an excellent way

of performing tasks Clean was not well suited to (I/O and memoisation). Were I to reimple-

ment the parser again I would use Clean with the C interface to handle memoisation. This

is because I now know exactly how to design the parser and well designed Clean code is a

joy to read compared to well designed C code.

A number of languages were then considered for a third reimplementation now that

some properties of the problem were known. The language had to be capable of fast execu-

tion, of permitting breaking the rules in the core loops. At the same time the language had

to be relatively readable and modular since my goal after writing the parser was to make

extensions and so I did not want to be locked into Collins’ design. Java was rejected because

it provided no means of avoiding the garbage collection problems that plagued the Lisp

implementation. Specifically, it is easy to write simple Java code and it is possible to write

efficient Java code with my own memory management but it is certainly not easy to write

90

simple and efficient Java code. The same argument would apply to C#, Python, Perl, and

any similar language that abstracts away pointers.

The final program was written in C++. The object oriented approach was chosen over

ANSI C since it was more suited to reimplementing components. Having already written

the program twice I hoped to only have to rewrite components rather than the entire system.

4.7.2 Revision control

Anybody building a nontrivial program will use a source code control system such as CVS

or subversion. We found that simply using version control is insufficient since, for instance,

improvements to the preprocessor would often break the parser since it depended on the

older format for the data files. What became necessary was to branch the code so that

the parser was developed with a stable version of the preprocessor while extensions to the

preprocessor were developed in a separate branch. Later, when the preprocessor and the

parser were relatively stable, the preprocessor could be switched to a new version and all

resulting incompatibilities fixed.

Another related step was the development of a build script. There are a large number of

steps involved in converting the treebank and other data into a format suitable for parsing.

It is relatively easy to perform these steps sequentially. However that means any change to

one of the earlier steps (such as a tweak to the tokeniser) requires every subsequent step to

be repeated. Since there is usually output from the previous version lying around, it was

often the case that output files from different versions of the code would be used at the same

time, leading to subtle errors.

Finally, version control applies to files rather than subroutines, but I often found that I

needed to write almost identical blocks of code but the differences were such that I could not

write a general function, perhaps because the differences could not be expressed as function

arguments in C++. Initially I simply wrote the same code twice but this invariably leads to

bugs being fixed in one version but not in another. My solution to this was to use source

code preprocessing so that our single ‘meta’ version generates multiple functions, each with

slightly different logic. I used the tool funnelweb (Williams, 1992) for this purpose. A nice

advantage of this over general functions is that the resulting code is much easier to read

than a highly generalised function full of if statements for its various options.

4.7.3 Efficiency and debuggability

Premature optimization is the root of all evil

– C. A. R. Hoare

Tony Hoare’s quote is frequently used to discourage optimisation before profiling. Through-

91

out this implementation, I found the opposite to be true. Every time I wrote code for cor-

rectness instead of efficiency I found the parser could not complete a single sentence. Even

when the parser still worked, the decreased efficiency made the parser harder to debug be-

cause it took longer to test other parts of the unoptimised parser than it would have taken

to optimise the current part. To take one random example, the grammar was initially im-

plemented using the STL set class since it saved me having to write my own set classes.

However, grammar lookups are performed in both join two edges and add singles

and so while the STL implementation took virtually no time to verify, it took two hours to

parse a sentence which made it harder to concentrate on other parts of the parser than if

I had written and then debugged my own. Essentially, because I am pushing the limits of

what can be achieved with available technology, it is essential to optimise prematurely, even

though it is still undesirable to do. It is because of this that throughout this chapter I have in-

terleaved implementation and optimisation techniques; perhaps some of the optimisations

are superfluous but delaying optimisation was not possible.

A closely related point is that the most efficient data structure is harder to debug. For

instance, my hash keys in Lisp are arbitrary precision, making it very easy to map keys back

to data values and detect bugs in key generation. However in C, and therefore in Collins’

implementation, we are constrained to thirty two bits which is likely to be more efficient

but cannot be mapped back so easily. Similarly, Collins uses array offsets to refer to edges

in the chart where we use pointers. Depending on compiler optimisations our code may be

slightly faster as a result, but tracking an edge through parsing is much easier in Collins’

system.

‘Magic numbers’ are another area in which bugs can easily creep into the system — for

instance, setting the maximum number of nonterminals to 100 might be correct at first, but

later adding -C complements could easily overflow this and lead to data corruption. I man-

aged to avoid many of the problems here by automatically generating the declarations of

constants from the input files, so any change to the input files will automatically appear in

the source code. Similarly, many functions in the probability model take a dozen or so pa-

rameters and getting these in the wrong order will not cause any typecast errors since they

are all integers, it will just generate invalid output. This problem was avoided by imple-

menting basic datatypes as different classes so that incorrect orders does result in typecast

errors. Curiously, Collins uses magic numbers everywhere.

4.7.4 Debugging methodology and test suites

Debugging the parser turned out to be extremely difficult. It is not so hard to detect the

presence of a bug, but isolating where in the process this bug is introduced could take a

week. In a normal program a bug can be isolated by stepping through its operations on

92

simple input but with a statistical parser there are far too many operations to do this for even

the most trivial input. The best approach I found was to spend a lot of effort detecting bugs

as soon as possible after they are introduced. For instance, if a bug in the tokeniser leads to

a small number of events not being generated then it is critical to detect this problem during

the generation of the event file rather than during the execution of the parser.

In order to facilitate this, after testing every function I wrote an automated test suite that

rechecks functions every time the system is built. For example, the probability model can

be checked by comparing the counts it derives to those produced with grep . If a bug is

later introduced in the input to this function then it will likely cause some test-case to fail.

Similarly, the system is liberally scattered with assert statements that perform everything

from internal bounds checking to checking that the skiplist is in sorted order and still has n

elements. As a last resort, I also made extensive use of the memprotect kernel call to lock

any data that was not currently being edited (such as the hash tables). This allowed me to

catch a number of bugs where I had missed an assertion.

A related technique that proved to be useful was to design with debugging in mind.

For instance, the first implementation of the chart did not include any functions to query

or summarise the chart because such functions were not needed by the parser. However, in

debugging other aspects of the parser I frequently found myself wondering if certain edges

had made it into the chart or if they had been culled before then. By implementing these

extra print features it is easier to debug other parts of the program.

A final comment is that I found high-level debugging to be much less useful than low-

level debugging. For instance, by examining the sentences the parser performs poorly on

it may be possible to infer it has a problem, perhaps one related to coordination. But this

approach turned out to be significantly more time-consuming than simply verifying every

function independently, mainly because the parser was too big to find where the bug was

after the high-level approach found the existence of a bug.

4.7.5 Naming of variables and parameters

Collins’ code is frequently hard to read, because functions are called with parameters whose

meanings are hard to remember. For instance, to access a left dependency event, we call the

function get dep prob with the fifth parameter set to 1, while to access a right depen-

dency event, we set this parameter to 0. The code would be easier to read if constants were

declared for ‘left’ and ‘right’, and used in invoking the function. There are many other ex-

amples in Collins’ code: start nonterminals, subcategorisation frames, nonterminal names

are all accessed using nonintuitive parameter names. Collins also has a mapping between

nonterminals/words and numbers which is inaccessible outside the program — the com-

puter doesn’t care, but the person using the debugger has very little idea which NT is meant

93

by 53. I did my best to store all magic numbers in explicit global constants. Later I put them

into environment variables to make it even easier for different programs to use the same

numbers.

A similar point is the explicit use of types. These aren’t supported particularly well by

current programming languages but the ability to have words stored in a ‘Word’ type that

is explicitly made incompatible with the ‘Tag’ type prevents a lot of errors such as calling the

left production with the arguments (lw,lt,lnt,P,H,w,t,delta,lc) instead of (lw,lt,lnt,w,t,P,H,delta,lc).

This is a very painless way of catching otherwise very hard-to-find bugs.

One area in which Collins got it right and I got it wrong, was in referencing edges. I used

pointers, while Collins used array indices. Again, either is equally simple for the computer

but Collins’ indices make it easy for a person debugging the code to trace where every node

comes from — the operation of finding which operations Collins’ performed in the best

parse is trivial in his parser, but quite a chore in mine.

4.8 Results of the parser

We are finally at a point where we can evaluate the parser. All evaluation will be performed

using the ‘evalb’ tool that Collins produced to evaluate his own parser. We start with a

brief evaluation of Collins’ parser and then examine both the preprocessor and the parsing

algorithm using my systems, evaluating after each step. At this point we have my compete

system and so we analyse the time complexity of the system, and finally we analyse the

errors made by my parser, starting with the differences between my parser and Collins’,

and then examining the errors made by both parsers. For all of the evaluations, we will use

Collins’ convention of ignoring sentences whose length is greater than forty.

4.8.1 A re-evaluation of Collins’ parser: precision and recall

Before we begin to analyse the performance of my system, I will present an analysis of

the performance of Collins’ system in Table 4.3. This table was generated using the parser

Collins released to support (Collins, 1997), parsing Section 23 of the WSJ. These results do

include the effects of some modifications I made in order to ease later modifications. These

modifications included the addition of tracing and debugging code, as well as hooks for

future extensions. The benefits of these extensions will be examined later, but one effect of

them is that the results presented here do not precisely match those published in (Collins,

1997). My goal throughout this chapter has been to produce a parser which can reproduce

this table.

Later Collins improved on his model, largely by correcting the generation of coordina-

tion and punctuation. Results from running the improved version are presented in Table 4.4.

94

Number of sentences = 2245

Number of Error sentences = 2

Number of Skip sentences = 0

Number of Valid sentences = 2243

Bracketing Recall = 85.18

Bracketing Precision = 85.05

Complete match = 24.83

Average crossing = 1.01

No crossing = 65.05

2 or less crossing = 85.38

Tagging accuracy = 96.50

Table 4.3: Results from Collins’ 1997 parser including my code hooks

(Predictably, these are almost identical to the performance given in (Collins, 1999, p. 190)).

Because my parser was already designed before this version was released, it was decided

not to modify my parser to match the newer version. Some initial work has been completed

Number of sentences = 2245

Number of Error sentences = 2

Number of Skip sentences = 0

Number of Valid sentences = 2243

Bracketing Recall = 88.52

Bracketing Precision = 88.68

Complete match = 36.07

Average crossing = 0.92

No crossing = 66.70

2 or less crossing = 87.12

Tagging accuracy = 96.74

Table 4.4: My evaluation of the parser in Collins’ thesis (Collins, 1999)

in copying improvements in the probability model to Collins’ newer version, but for now I

note that the goal was to build a parser which could later be modified and so it was not es-

pecially important that the best version of Collins be used (for example, Model 2 was chosen

rather than Model 3 as Model 2 is simpler to implement).

95

4.8.2 Evaluation of my preprocessor and parser: precision and recall

The preprocessor and the parser are two totally separate pieces of code. Because there were

a lot of decisions in the development of the preprocessor that Collins did not mention (see

Bikel (2004)), it is important to evaluate it independently of the parser. For instance there

is ambiguity in the headword rules and if these are executed incorrectly then this can be

expected to lower the parser’s accuracy but it does not mean there are any errors in the

parser. In order to test this I evaluated my system while using Collins’ event file in Table

4.5. It would be desirable to also evaluate Collins’ parser using my preprocessor but in

doing so I was unable to obtain over 60% precision/recall so there is clearly still a bug in my

preprocessor. It is interesting that even with well over ninety percent of the events generated

correctly, the parser’s performance is abysmal.

Number of sentence = 2245

Number of Error sentence = 4

Number of Skip sentence = 0

Number of Valid sentence = 2241

Bracketing Recall = 84.91

Bracketing Precision = 85.30

Complete match = 24.59

Average crossing = 1.06

No crossing = 64.66

2 or less crossing = 85.10

Tagging accuracy = 96.44

Table 4.5: Results from my parser using Collins’ preprocessor

Since Table 4.5 is extremely similar to Table 4.3, we can conclude that the parser is work-

ing correctly. To verify this, we also evaluated my parser’s results while using Collins’

parser’s results as a gold standard in Table 4.6. Essentially this table shows that my parser

works almost identically to Collins’. Unfortunately, I was unable to precisely reproduce

Collins’ preprocessor output. Since the important part of a statistical parser is the parser, it

was decided to simply continue reusing Collins’ preprocessed events.

There are still a few errors highlighted by this table (a perfect reimplementation would

obtain an 100% exact match). However an analysis of these errors showed they occur almost

exclusively in sentences containing awkward coordination and punctuation. This strongly

implies the discrepancies are caused by conditions that Collins treats as special cases. We

are not interested here in reproducing Collins’ parser to the level of detail of having to reim-

plement all special cases, and so consider this implementation is more than accurate enough

96

Number of sentences = 2245

Number of Error sentences = 6

Number of Skip sentences = 0

Number of Valid sentences = 2239

Bracketing Recall = 94.33

Bracketing Precision = 94.92

Complete match = 76.33

Average crossing = 0.57

No crossing = 84.77

2 or less crossing = 91.69

Tagging accuracy = 98.32

Table 4.6: Results from my parser using Collins’ output as a gold standard

as a basis for extensions.

One curious result that I discovered in the creation of the above tables was that precision

and recall drops as more of Section 23 is parsed. That is, the performance on the first quarter,

half, etc. of Section 23 is invariably higher than the final precision and recall figures. I

do not know if this is coincidental, a side effect of people developing the treebank being

inconsistent, or a side effect of Collins’ probability model being tuned based on results from

earlier parts of Section 23. Regardless, it is an interesting result which has not been reported

elsewhere. It is also a result worth remembering when debugging the parser since the parser

needs to be doing better than expected at first in order to end up at the expected value.

4.8.3 The complexity of Collins’ and my parsers

As well as showing that my parser is able to reproduce Collins’ results, it is important to

show it has the same time complexity. Efficiency is not a major concern in this project,

except inasmuch as is necessary for the parser to produce results fast enough for debugging

and evaluation. However if the parser has different complexity then it implies the system

will not scale.

Figure 4.11 shows a scatter-plot of time taken versus sentence length for my parser. It is

plausible that this figure is the O(n3) that Collins predicts, but it is hard to be sure without

linear regression. Plotting the logarithm of the time against the logarithm for sentence length

is shown in Figure 4.12, produce a graph which is clearly linear and so proving complexity

is polynomial — had the graph been nonlinear then the complexity would be exponential.

Linear regression on the log values gives k = 3.5 with very little error (residual = 0.55, R2

= 0.93), so we can conclude the polynomial is O(n3.5). Performing the same calculation

97

0 10 20 30 40 50 60

050

010

0015

0020

00

data$length

data

$tim

e

Figure 4.11: Scatter-plot of time taken by my parser to parse sentences of

different lengths

98

0 1 2 3 4

−4

−2

02

46

8

ll

lt

Figure 4.12: Scatter-plot of log(time) versus log(sentence length) — the

gradient is the parser’s complexity

99

on Collins’ parser gives a similar but slightly better result, k = 3.05 at a similar confidence

level. It is likely that Collins’ slightly better complexity is due to my reimplementation being

much more cautious in its dynamic programming, keeping edges which do not take part in

the final result in order to make it easier to debug any errors.

4.8.4 Evaluation of my parser with my new POS tagger

In the evaluation of the POS tagger, I noted that its true benefit was not in its accuracy,

but in how it reduced the number of edges in the chart. Any errors in accuracy can be

trivially resolved by selecting the second-best tag, while making the chart smaller should

make virtually every loop in the parser shorter and so speed up parsing. Since Charniak has

published results showing the inclusion of multiple tags did not significantly increase his

parser’s accuracy (Charniak et al., 1996), I did not expect to see any improvement outside of

efficiency.

This hypothesis proved to be correct, with the number of edges in the chart dropping by

around fifteen percent which leads to a corresponding drop averaging thirty-five percent in

parsing time. The primary reason my parser is significantly slower than Collins’ is that it

keeps many more candidates and so this relatively large increase in efficiency would be ex-

pected to be much lower if the same technique was applied to a parser with less conservative

dynamic programming.

Parsing precision and recall were 84.98 and 84.91, a tiny and almost certainly coinciden-

tal increase over the parser without the tagger. Precision and recall can also be measured

against the old output of the parser without the tagger, which gives 98.48 and 98.39. This

result tells us that the inclusion of the tagger really does make very little difference, rather

than causing the parser to make a similar number of different errors.

4.8.5 An analysis of the errors in Collins’ parser

Collins’ parser may be the current state-of-the-art, but it still makes many errors; the results

just presented show that the parser makes some error in three out of every four sentences.

So, what are those errors, and what can we do to eliminate them?

Intuition would say that longer sentences lead to more errors. If this is so, then we should

concentrate on extending the parser’s beam and otherwise keeping alternatives available

for longer. However, as Figure 4.13 shows, there is very little correlation between sentence

length and parsing accuracy. More formally, we can state that the correlation coefficient

between sentence length and accuracy is somewhere between -0.22 and -0.14, so there is a

slight correlation, but not enough to justify further work (approximately three percent of the

variance in accuracy is caused by sentence length).

100

60

65

70

75

80

85

90

95

100

0 10 20 30 40 50 60 70

Pre

cisi

on

Sentence Length

"sentencelength-vs-accuracy"

Figure 4.13: Parsing accuracy versus sentence length.

Instead we have to look for other sources of errors. Eyeballing the output implies that

sentences containing rare words are frequently parsed incorrectly. See for instance Figure

4.14 which shows the sentence He’s a NOUN parsed correctly and incorrectly respectively.

TOP/’s

S/’s

NP-A/’s

PRP/He

VP/’s

VBZ/’s NP-A/fool

DT/a NN/fool

TOP/bore

S/bore

NP-A/’s

PRP/He VBZ/’s DT/a

VP/bore

VBD/bore

Figure 4.14: Two parse trees showing that changing ‘bore’ to ‘fool’ corrects

the parse.

The principal difference between these two sentences is that in the correct parse the parser

had available accurate statistics for all the words used, while in the second parse it had to

resort to using the POS tag for the key word bore. Further anecdotal evidence for this hypoth-

esis is provided in Tables 4.7 and 4.8 which show a random sample of correctly parsed and

poorly parsed sentences respectively. In these tables, id is the reference number for the sen-

tence, Rank is a simple metric for measuring the frequency of the words in the sentence (the

rank of a word is defined as the relative frequency of the word, so the most frequent word

101

has a rank of one). Finally, (in Table 4.7) cause is my interpretation of the most fundamental

error in the parse. It is clear from these tables that rare words are a significant problem. The

id Sentence Rank

1838 When selling is so frenzied, prices fall steeply and fast. 2657

704 Revenue gained 6% to $2.55 billion from $2.4 billion. 12770

105 The $409 million bid is estimated by Mr. Simpson as representing 75% of

the value of all Hooker real-estate holdings in the U.S.

4455

391 Although Mr. Pierce expects that line of business to strengthen in the next

year, he said Elcotel will also benefit from moving into other areas.

21

135 The Boston firm said stock-fund redemptions were running at less than

one-third the level two years ago.

76

837 In the previous quarter, the company earned $4.5 million, or 37 cents a

share, on sales of $47.2 million.

12770

1354 But when something is inevitable, you learn to live with it,” he said. 21455

Table 4.7: A selection of correctly parsed sentences

advantage of this metric over frequency stems from Zipf’s law, extremely frequent words

would tend to swamp all comparisons and the difference between a word occurring once

and ten times will be invisible. Note that this metric causes the exponentially decreasing

frequency of words to produce a linear sequence which is the same effect as computing the

logarithm of the word’s frequency. For sentences, the most natural approach would be to

define the sentence in terms of its highest ranked word but this turns out to be a poor choice:

non-headwords are not used in unary productions and are only used once in dependency

productions. So instead we look for the highest ranked headword.

In order to test this hypothesis more formally, we need to measure how the rank corre-

sponds to parse accuracy over the whole corpus instead of just a few sentences. A graph of

this is presented in Figure 4.15. It turns out that graphing this to highlight the trend is quite

difficult; there are more sentences with low-ranked headwords and so it is natural to see a

greater variation in accuracy. However, the peak on the left side grows disproportionately

faster than would be expected. This growth means that sentences with high-ranked words

are parsed correctly more often.

Since it is hard to see the trend, it is useful to analyse the data more formally by per-

forming linear regression on the data points. It may seem counter-intuitive to use linear

regression on data that is clearly nonlinear, but linear regression is still useful on nonlinear

data because it tells us how the mass of the data changes as we increase one variable, which

gives us the global trend we are looking for. Specifically the gradient is −120 (with an error

of up to 40) which proves that less frequent words lead to a worse parse. If the least frequent

102

Figure 4.15: Rank of the sentence’s least frequent head word versus parse

accuracy

103

id Sentence Prec. Rank Cause

94 Earlier the company announced it

would sell its aging fleet of Boeing

Co. 747s because of increasing main-

tenance costs.

26% 9044 because PP very poorly

attached

1758 My colleagues and I fully realize we

are not a court . . . etc.”

31%. 13742 adverbial completely

misinterpreted

1111 Call it the “we’re too broke to fight”

defense.

42% 12365 to fight not attached in-

side the quote!

1767 Of course, Mr. Lantos doth protest that

his subcommittee simply seek infor-

mation for legislative change.

43% 46203 protest... interpreted as

a NP

1550 Here’s what Ronald Reagan said after

the 1987 crash: “The underlying econ-

omy remains sound.

47% 32234 Not interpreted as two

phrases around the

colon

1370 But that was all of three months ago. 50% 20701 Many errors, including

adverbial

Table 4.8: A selection of poorly parsed sentences

headword in the sentence is one hundred words less frequent, we would expect an average

for the final precision/recall to be one percent worse. From this, we can finally conclude

that sentences containing rare words are parsed less accurately. Note that this result does

not contradict Klein and Manning’s results; for instance one hypothesis that supports both

results is that tags are usually sufficient but when they are not, words are necessary.

4.9 Summary

At this point it is reasonable to ask what has been achieved. In brief summary: this chapter

described a reimplementation of Collins’ parser. The reimplementation takes twice as much

code, obtains very slightly lower performance and is significantly slower. On the other hand,

the modular way in which the parser has been written makes it easier to change how the

system works.

The remainder of this thesis will describe one such change, which addresses the difficul-

ties with rare words just outlined in Section 4.8.5.

104

Chapter 5

Thesaurus-based word representation

The results of the previous chapter show Collins’ algorithm performs sub-optimally with

uncommon words. The reason for this can easily be seen by examining the backoff rule in

Equation 3.4. Put simply, if a word is uncommon it is discarded and the parser essentially

reverts to using a probabilistic context-free grammar. The significance of this in practice

is unknown, but since Zipf’s law says most words are uncommon and to date the parser

has only been tested in the training domain it has the potential to be very significant. I have

therefore decided to concentrate the second phase of my PhD on resolving this problem. My

hypothesis is that poor backoff is the largest remaining problem in statistical parser accuracy

on sentences in the same genre as the training corpus, and a significant reason why statistical

parsers do not perform well with unedited text or outside the domain of their training data.

To solve the sparse data problem, it is not possible simply ‘to build a bigger training cor-

pus’. Firstly, this would be extremely expensive; WSJ-style corpora need to be constructed

by human annotators. Secondly, Zipf’s law means even a slight improvement would require

a much larger corpus. Thirdly, it does not help with shifting domain. And finally it has little

academic interest. The main suggestion I want to pursue is that we should back off rare

words by grouping words into categories of semantically related words — in other words, by

adopting a level of backoff in between single words and parts-of-speech.

In this chapter and the following one, I consider the question of how to generate rep-

resentations of words which allow the semantic similarities between words to be made ex-

plicit, so that they can be grouped for the purposes of backoff. (The process of using these

word representations in a backoff scheme for the parser will be dealt with in Chapter 7.) The

present chapter motivates the ideas of thesaurus-based backoff, and gives a review of the

literature on automatic generation of word representations. In Section 5.1, I provide support

for the general idea of backing off by grouping semantically related words by giving some

concrete examples. In Section 5.2, I provide some criteria for kinds of measures of semantic

relatedness which will be useful for our purposes. In Section 5.3, I will survey the literature

105

about how to produce measures of semantic relatedness between words, and evaluate sev-

eral different proposals according to these criteria. I will argue that the best measure for our

purposes is one devised by Hinrich Schutze (1993).

5.1 An example of the benefits of grouping similar words

The normal method for representing words in a statistical parser is as a simple enumeration.

This might be alphabetical or in the order the words happened to occur in the training cor-

pus, so that for instance cat might be encoded as 4424 , while cataclysms is encoded as 4425 .

Using this method there is clearly no semantic correlation between words with a similar en-

coding and so the parser does not use the word encoding as anything more than part of the

hash-table key.

Since many words do not occur frequently in the training corpus, we do not have useful

counts for them. For instance, cataclysms only occurs once in the WSJ: ... such short-term cata-

clysms are survivable ... and yet forcing every use of cataclysms to this grammatical structure,

or even strongly favouring it, would be completely incorrect. Collins’ probability model,

and that of every other statistical parser, would replace cataclysms with its POS tag NNS, (or

to be more technically correct, generate a new hash-table key which does not include the

word, but still includes the encoded POS tag). However, discarding the word discards a

significant amount of information useful to parsing. For instance, cat has the same POS tag

but being both concrete1 and animate has significantly different usage.

An alternative encoding might still give cataclysm the value 4425 but give 4426 to

calamity , 4427 to convulsion , and so on. A more general word, such as disaster may

get an encoding of 442 . Such an encoding would mean that if the counts for cataclysms

were insufficient, it could be replaced by or grouped with those for disasters or calamity ,

etc. depending on the context. This new encoding provides more information then discard-

ing the word entirely while significantly increasing counts.

It may seem that this example is somewhat contrived; after all, how much does mis-

understanding cataclysm affect real-world performance? However this view ignores many

things. Firstly, Zipf shows these rare words make up the bulk of the language, and secondly,

the WSJ is a terrible sample of real-world language usage – For instance, it only has seven

uses of the word cat, and kiss occurs exactly once: Unamused, residents burned Rand McNally

books and wore t-shirts that said: “Kiss my Atlas.” It would be hard to infer useful generalisa-

tions about the usage of kiss based on this example, but replacing it with VB wastes a lot of

usage information compared to replacing it with a simple category word.

1While the dictionary has the first definition of cataclysm as concrete, it appears to be used in the abstract

more often.

106

5.2 Criteria for semantic relatedness measures

In this section, I consider what kinds of semantic representation of words we want to pro-

duce. Clearly, we want a representation from which we can read off the degree of semantic

relatedness of any two words in the system’s lexicon. But there are several different schemes

which would allow this, and not all of them are equally relevant for the purposes we have

in mind.

5.2.1 Attention to infrequently occurring words

For the most common words we already have excellent usage statistics, and so there is no

need to back off. Noting this, the area that better word representations can improve is with

less common words. That is, we are interested in good representations for the less-used

portions of the lexicon. However, to say we are trying to improve the representation of rare

words is simplistic, we do not have sufficient counts for accurate statistics on perhaps ninety

percent of the lexicon. Unfortunately, as will be seen throughout this chapter, concentrating

on less frequent words is directly contrary to almost all popular word representations, which

concentrate on producing extremely good representations for common words and ignore

rare words entirely. Since the use of the representation here is to deal with low word counts,

that makes most representations useless. Therefore our main criterion in deciding which

representation to use, will be determining how well it represents rare words.

5.2.2 Multidimensional representations of word semantics

One problem with the simple example presented was that replacing a word by its category

only works for words with one clear category. Cataclysm , and many others, have at least

two categories. The approach also assumes that there is only one dimension of backoff

when for instance it would very likely be useful to back off between singular and plural

independently from semantic type. Another useful dimension, though not as important in

English, would be formality. Because of this, a vector-based (or n-ary) representation seems

potentially much more useful than a simple mapping between words and categories. This is

not a strict requirement, but given the choice of a vector-based representation and a simple

hierarchical representation, we would prefer the vector.

Another consideration is choosing a word representation which is appropriate for the

particular backoff technique we choose to implement in the parser. At the time when the

word representations were being developed, there were a few alternative techniques being

considered: keeping everything simple by discarding information in a fixed order; adding

a new level of backoff between words and tags much like the categories already discussed;

and replacing genprob with a neural network using the vector as input and the existing gen-

107

prob function to generate training data. I wanted to be able to compare different techniques,

which means the word representation should ideally support all three approaches.

5.3 A survey of approaches for computing semantic similarity be-

tween words

5.3.1 Hand-generated thesauri: WordNet and Roget

One solution is simply to use a manually compiled thesaurus like Roget (Chapman, 1992).

Consider the word formation. If the counts for formation are insufficient then instead of sim-

plifying formation to noun and including many unrelated words, we can simplify it to only

include members of its superordinate category (constitution, setup, build-up, etc.), which are

likely to indicate the correct usage of formation much more accurately

However, Roget is not especially careful at ensuring all members of a category are inter-

changeable. This is because the thesaurus was written for humans, and so it is reasonable to

assume that readers are not going to make poor substitutions. WordNet (Miller, 1995) pro-

vides a similar approach to Roget in that it is a hand-written thesaurus, but it was designed

for processing on a computer and so is more careful to ensure all members of a category are

interchangeable.

For our purposes, both Roget and WordNet can be immediately rejected because they

have no entries for rare words which are what the system was intended to solve (or, cu-

riously, for extremely common words like he). What this means is that we have to move

towards methods for learning word clusterings.

5.3.2 Unsupervised methods for thesaurus generation

Since there are no suitable thesauri available, it is necessary to generate one automatically.

The field of unsupervised thesaurus generation is currently a big topic in NLP, probably

larger than statistical parsing, as will be demonstrated by the large number of approaches

examined below. The review in this chapter is not exhaustive, but covers the main alterna-

tive approaches, with a focus on those potentially suitable for our backoff application.

The basic method behind all thesaurus generation techniques is to use statistics about the

words which occur in a window around a given word to derive information about which

words are similar to one another. This is very similar to how the previous chapter used

counts for determining how likely an event is. We are using bigram statistics (counting how

often two words occur together) to estimate the mutual information between two words,

and looking for words with high mutual information. For example the words cat and dog

should be similar because you expect them both to occur close to words like collar, food, pet,

108

vet, and so on. The techniques in this chapter could be described as representing words

by separating out their mutual information from their dissimilar information. For example

we would first encode the the common information between dog and cat, and then encode

whatever is peculiar to cat. While the techniques being discussed do not use mutual infor-

mation directly, it frequently underlies the technique and therefore it is useful to explain the

technique now.

Information Theory is a branch of statistics which studies the amount of useful informa-

tion contained within an event. Its relevance here is that a word can be viewed as an event

and so the tools from information theory can be applied to help decide on the best format.

In particular, Information theory includes the concept of a bit of information, where each

bit can be either true or false. All information can be represented by a number of bits, and

so every word can be represented as a string of bits. Intuitively, this is obvious since we are

already representing words as either a sequence of letters or a number, both of which are

just a string of bits.

Imagine for instance a HMM trained on the WSJ sentences. Running with no input, the

model would generate a random sequence of words that almost appear to be from the WSJ.

But by providing just a small amount of information at each decision point we can gently

push the model into generating any sentence from the WSJ. This extra information provided

could be considered to be an extremely efficient encoding of the sentence, and the more the

sentence we want to generate differs from the WSJ, the more information we will need to

provide to the model. Viewed this way, it is easy to see that if the model was most likely to

produce cat next, it should require very little extra information to produce dog instead, but

a significant amount to produce essence. The relevance of this is that many of the techniques

use an extremely similar approach, measuring the amount of surprise at seeing the next

word, and the amount of surprise is exactly the same as the minimum number of bits to

encode the event.

5.3.3 Finch

Finch (1993) implemented the first successful approach to thesaurus generation. He pre-

sented an algorithm for automatically building dendrograms (i.e. hierarchical cluster dia-

grams) based on the similarity of usage in a corpus. Unlike the hand–generated approach,

Finch’s method derives a representation for every word in the input lexicon. The general al-

gorithm is to start a two-dimensional array of bigrams, such as in Table 5.1. Then each row

in the array is considered for similarity and all rows within a certain (hamming) distance are

combined. Since each row corresponds to a word, combining rows is equivalent to combin-

ing words to form a cluster. This method leads fairly easily and naturally to a dendrogram

(tree) representation, several subtrees of which are given in Figure 5.1. The algorithm is very

109

simple and is presented in Figure 5.2.

bought company large yesterday

computer 313 7825 1386 388

new 1174 19430 3386 7929

traded 63 849 66 500

at 1905 28881 10401 6508

Table 5.1: An example of bigram counts

An interesting property also present in Table 5.1 is that some words have high bigram

counts but low mutual information, because both of the words are very common. Take

for example at and company, one of the highest bigram counts in the table. If this value

was missing from the table and we were to infer it from the other values, we would note

that fifty-five percent of bigrams are in the company column and sixty-five percent in the at

row. Based on this we would estimate the ‘missing’ value at 34,700 ±190. The actual value

of twenty-nine thousand is significantly less than this, showing that seeing at reduces the

chance of seeing company. This shows the importance of scaling values, and a very similar

argument can be used to show the importance of using the hamming distance rather than

absolute distance.

Unfortunately, Finch’s algorithm was not able to be used to solve our backoff task be-

cause it could not scale sufficiently. Ignoring time complexity, Finch’s algorithm uses a

thousand-by-thousand matrix for a thousand word lexicon. Since counts in the cells have to

be added, four bytes per cell is an absolute minimum, leading to a four megabyte matrix in

Finch’s original paper. However the WSJ has a lexicon of fifty thousand words, leading to

nine gigabytes of required RAM to store co-occurrence statistics. At the time this approach

was investigated this was an impossibly large amount of memory, and no way of perform-

ing the task in two passes could be conceived. What is more, a dendrogram is not exactly

the form of word representation we want; as already discussed, we would prefer vectors

supporting multiple independent ways of clustering words. So Finch’s approach was not

pursued.

5.3.4 Brown et al.

Another early approach which showed considerable promise was that of Brown et al. (1992).

Their approach was to attempt to predict the next word based on history in exactly the

same way as a Markovian POS tagger works (such as the one described in Section 4.4).

Very briefly, we are predicting the current word wk given all previous words wk−11 , and we

are making the standard Markov assumption that this can be approximated by an n-gram.

110

itheywehesheyoui'veyou'vewe'vethey'vei'di'llyou'llyou'dwe'llit'si'myou'rethey'rewe'rehe'sshe'sthat'sthere'swhat'swhosewho'sthismethemhimherusmyselfyourselfthemselvesitselfhimselfmineanyoneanybodysomeonesomebodyeveryoneeverybodynobodysomethinganythingnothingeverythingit

toofinonatforwithfrombyintothroughagainstaboutbetweenwithoutunderwithinduringviaupontowardstowardacrossamongbeyondregardingnearoutsidebehindinsideoutupdowno�awaybackoveraroundalongtogetheraheadagaintwiceforwardexceptabovefollowingbelowpastFigure 5.1: A figure from Finch’s thesis showing the internal structure

from several parts of the dendrogram

111

For every word in the corpus

For every word near this word

increment the counts of these words co-occurring

Endfor

Endfor

For every row (=word) in the table

For every row x in the table

For every row y in the table

If the hamming distance between x and y is small

Add y to x’s combine list

Endif

Endfor

Create a new row z

For every row in x’s combine list

Add its columns of y to z

Delete its rows/columns from the table

Endfor

Add z as a new row/column to the table

Also save to the dendrogram that z is

the parent of x and its combine list

Endfor

Endfor

Figure 5.2: Finch’s dendrogram generation algorithm

112

Brown et al. used a deleted-interpolation trigram model and a corpus of a third of a billion

words.

This model is only directly useful in generating nonsense text which appears superfi-

cially like English. However, Brown et al. note that similar words will have a similar prob-

ability model, and that if we create some classes then we can assign words to the classes so

that words with a similar probability model get placed in the same class. He starts with one

class per word, and then merges classes based on their similarity until only one class is left.

Results were good, as shown in Figure 5.3.

question

charge

statementdraft

casememo

request

letter

plan

Figure 5.3: Sample clusters from Brown et al.’s algorithm

Expanding an approach to handle a large vocabulary is always problematic. Figure 5.3

was based on a vocabulary of just one thousand words. To handle larger vocabularies,

Brown et al. first clusters the one thousand most common words, and then assigns every

other word to the category where it best fits. Again, results were good but this approach

was not used here as it does not provide any measure of difference between words, and the

hierarchical classification only applies to the most frequent words. Also, we would prefer to

derive a vector-based word representation rather than trees, as mentioned in Section 5.2.2.

5.3.5 Smrz and Rychly

Smrz and Rychly produced an approach similar to Finch in that it forms dendrograms based

on hierarchical clustering. The key difference is that they demonstrated their approach

working on a lexicon of forty thousand words (Smrz and Rychly, 2002). From the perspec-

tive of this thesis, this is an extremely useful improvement – there are thirty-two thousand

lexical entries in the WSJ so there is a strong likelihood that this approach can be used di-

rectly.

From a technical perspective, the algorithm used is presented in Figure 5.4 (copied from

Smrz and Rychly (2002)). Contrasting this to Finch there are only minor differences. The

reason that Smrz and Rychly were able to process so much more data is a more careful ap-

113

proach to the data representation. Rather than simply implement a two-dimensional array

as Finch did, they used a full corpus processing tool supporting sparse arrays called CQP

(Christ, 1994).

function locateclust (id):

Path⇐ ∅while clusters[id] not closed:

Path← Path∪ {id}id← clusters[id]

foreach i ∈ Path:

clusters[i]← id

return id

function hierarchy():

foreach 〈rank, id1, id2〉 ∈ sortbgr:

c1← locateclust(id1)

c2← locateclust(id2)

ifc1 6= c2:

clusters[c2]← c1

hierarchy[c1]← [c1]∪ {< c2, rank >}hierarchy[c2]← [c2]∪ {< c1,0 >}

return hierarchy

Figure 5.4: Pseudocode of Smrz and Rychly’s clustering algorithm

Since Smrz and Rychly’s results were representations of Czech words, they could not

be used directly for backing off the WSJ. Instead the algorithm had to be run again on an

English corpus. Rychly sent me their code, which was a python wrapper around the corpus

toolkit CQP. CQP is a large toolkit that is useful for a number of areas in language analysis.

I eventually managed to get CQP working but was unable to get Smrz and Rychly’s code to

interface with it. It seems CQP had changed too much since the code was written. Given that

Smrz and Rychly’s method also only produces a dendrogram, rather than the vector-based

representations we are seeking, I eventually abandoned this method.

5.3.6 Lin

Lin has developed an automatic word clustering algorithm with some very impressive re-

sults (Lin, 1997). The distinguishing feature of Lin’s approach is that syntactic dependency

is used to resolve lexical ambiguity. For example, fence (sword fighting) and fence (selling

114

stolen goods) are different words. Different words with the same spelling are referred to as

either homographs (unrelated meaning) or polysemous (related meaning). Lin’s hypothesis

is that an automatically constructed thesaurus should have different entries for each sense,

much like manually written thesauri do.

Studies of human language use have shown that people can distinguish between word

senses using very little context (Choueka and Luisgnan, 1985). The local context used by

Lin is defined in terms of the syntactic dependencies between the word and other words in

the same sentence. Lin uses an HPSG-inspired approach similar to Collins’ which looks at

the word’s subject, adjunct(s) and complement(s). For all of these he stores the result as a

triple containing the word, the relationship (sbj, adj, cmp), and the word it has this relation-

ship with. For example, in the sentence The boy chased a brown dog, the context stored about

boy is [boy, sbj, chase] , and for dog [dog, adj, brown], [dog, cmp, chase] .

As with needing to select the correct sense given just the spelling, this information is not

present in the training corpus and must be automatically derived. Lin implemented a

broad–coverage parser for this purpose.2

With the output from the parser, Lin is finally ready to derive the thesaurus. The ap-

proach is to look at the raw triple list produced by the parser. Remembering that the result-

ing triples are of the form [word, relationship, object] = count , the similarity

between two words is defined as the number of identical triples, divided by the number of

triples that are not present for the other word. This gives the proportion of the time that the

two words are used in the same way. Below is the output from running the algorithm on the

word brief (Taken from Lin (1998)):

brief(noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, de-

position 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04, . . .

brief(verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name 0.05, em-

power 0.05, summon 0.05, overrule 0.04, . . .

brief(adjective): lengthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.09, extended

0.09, daylong 0.08, scheduled 0.08, stormy 0.07, planned 0.06, . . . .

In a ‘future–work’ section, Lin describes how his algorithm can be used to form a kind

of lexicalised dendrogram in which every word has its own dendrogram, of which it is the

head. However Lin’s approach is still fundamentally for finding a word’s nearest neigh-

bours and it is not obvious how it use it for backoff. If the set of words above (affidavit,

petition, memorandum, motion, . . . ) occurred frequently in the results then a token could be

2Note something interesting: we are using a parser to get better word representations so as to get a better

parser. There is no paradox here; Lin’s parser only has to do a rough job of identifying word senses, rather than

delivering high precision and recall in every respect.

115

used to represent and count it. However affidavit will have a similar but subtly different

representation, which makes counting concepts impossible. Lin’s word similarity metric is

impressive, the resolution of homographs and polysemy means it avoids many of the mis-

takes made by other approaches. Ultimately, the need to redesign genprob from scratch just

to use this information, resulted in this approach being abandoned.

5.3.7 Elman/Miikkulainen/Liddle

Elman

The first application of neural networks to thesaurus generation was Elman networks, a

modification of back-propagation to use a context layer (Elman, 1990). A neural-network is

an algorithm for learning arbitary function mappings, such as between a sequence of words

and the next word in the sequence. Elman trained one of these networks on the task of

predicting the next input word so that each letter in the sentence is presented to the network

sequentially (letters have random representations). After training he was able to show that

the network’s internal representation (its hidden layer) contained semantic information, as is

shown in Figure 5.5. Elman produced this figure by providing different words on the input

VerbsNouns

AnimatesInanimates

Breakables

food

AnimalsHumans

sandwichcookie

bread

plateglass

carbook

rock

boy

man

girlwoman

dragon

lionmonster

cat

transative (always)

likechase

smash

break

sleep

intransitive

thinkexist

dog

catmouse

see

smellmove

Figure 5.5: Analysis of the weights in Elman’s network, showing the lin-

guistic knowledge which had been learned

and measuring the hamming distance between the different activations produced. These

distances were then clustered. The demonstration was a very significant result because it

shows that a supervised training algorithm is able to learn the unsupervised task of word

representation. One obvious problem with this approach is that either every new lexical item

requires an extra output node, or an impossibly large hidden layer is needed. Either way,

116

it limits the maximum lexical size to the maximum number of learnable outputs (perhaps a

hundred).

While Elman’s network has several good properties, it is not appropriate for use here as

a basis of my word representations. There are two main reasons for this: the knowledge is

not encoded in the representation, and the representation requires a node per lexeme and so

cannot scale.

Miikkulainen

Miikkulainen solved a very similar task to Elman, using a neural network to extract linguis-

tic information by presenting a sequence of words. The difference is that instead of analysing

the hidden layer, Miikkulainen had an extra input layer (the word representation). The idea

is that if the network can learn a better representation, then it can use this to learn an even

better one (Miikkulainen, 1993). Initially the system has no idea what the optimal represen-

tation is, so it gives every word a random representation. Next it is trained in the same way

as any other feed-forward network, feeding forward activation, comparing the output to

(its current representation of) the target, and backpropogating errors. Since one of the lay-

ers is the word mapping, we have implicitly updated our representation. Miikkulainen has

found an extremely elegant way of learning the representation using the same mechanism

as is usually used to predict the output. Miikkulainen was able to get excellent results using

his method, but closer investigation showed his approach had some serious problems. The

largest of these is that Miikkulainen’s approach does not use any form of recurrent network.

This means the system is only able to process sentences using a set of rigid template struc-

tures, such as Det NN VT Det NN (which would fit the sentence the dog ate a steak). This is

a regression from previous approaches such as Elman’s, which predict the next word in the

sentence and so can cope with any sentence structure.

This makes Miikkulainen’s network unusable in this project because we have hundreds

of different sentence templates. It should be noted that Miikkulainen has moved on since

then and done some very interesting work in using a neural-network for parsing rather

than sentence representation. This will be discussed in the future work section of this thesis

(Section 8.2.5).

Liddle

A solution was presented by Liddle who combined Miikkulainen’s extra input layer with

Elman’s network architecture (Liddle, 2002). The combined architecture is given in Figure

5.6.

Liddle’s experiment was moderately successful, some output from his program has been

clustered as a dendrogram and is shown in Figure 5.7. He was able to expand on both Mi-

117

Modified Input Hidden Output

Cat

Sat

On

Context

Cat

Input

Figure 5.6: Liddle’s network architecture

118

ikkulainen’s and Elman’s lexicon. Furthermore, he was able to show that the basic prop-

erties like the noun–verb distinction are quickly learned, but given sufficient time his algo-

rithm can learn quite subtle distinctions, such as the difference between meat and sandwich,

and the animate – inanimate ambiguity of chicken. While Liddle’s approach produces excel-

lent results and scales better than Miikkulainen, I was unable to make it scale well enough to

be used with the WSJ. A network with only ten nodes cannot hope to represent every word

in the WSJ (since 210 is much less than 50,000), but networks with more than ten nodes

showed no signs of even beginning to train. I made investigations into seeding Liddle’s

network with the representation I produced, but these have not been successful to date.

5.3.8 Bengio

While Miikkulainen’s approach is a simple proof of concept, Bengio developed a neural

model that was intended to scale (Bengio, Ducharme, Vincent, and Jauvin, 2003) (Bengio and

Bengio, 2000). Rather than treating Bengio’s approach as a thesaurus generation technique,

it is easier to treat it as a part-of-speech tagger in which a thesaurus is generated as a side

effect. With this mindset, consider a classic HMM-based tagger, perhaps using trigrams,

such as the one implemented in Section 4.4 of the previous chapter. When training this

tagger on the sentence the cat is walking in the bedroom we would store triples like:[[the,cat,is],

[cat,is,walking], [is,walking,in], [waking,in,the], [in,the,bedroom]].

As discussed in the previous chapter, even with a hundred million word corpus we

would not have enough events for an accurate probability model — it is a basic corollary

of Zipf’s law. Wouldn’t it be wonderful if instead of just these training examples we store

{the,a} {cat,dog} {is,was} {walking,running} in {the,a} {bedroom,room,bathroom}? A sin-

gle training sentence has suddenly become a hundred. Bengio refers to this as turning the

curse of dimensionality against itself.

So how do we achieve this? Bengio believes the best approach is to replace the Markov

model with a neural-network. This allows the word representation to be distributed and

so allows learning of similar events to occur automatically through their similar distributed

representation. The approach can be summarised as:

1. Map each word into a feature vector.

2. Express the joint probability function (that a HMM simulates) in terms of these feature

vectors.

3. Learn simultaneously the feature vectors and the probability function.

The key difference between this work and that of Elman, Miikkulainen, or Liddle is that

Bengio concentrates on an approach that scales, and instead of predicting the next word he

119

man

woman

hammer

doll

vase

ball

hatchet

bat

rock

paperwt

the

hit

with

carrot

ate

START

window

curtain

spoon

fork

broke

moved

STOP

boy

_

girl

dog

sheep

chicken

pasta

wolf

lion

plate

cheese

desk

0.0 0.5 1.0 1.5 2.0

Word

Distance

Figure 5.7: Clusters of Liddle’s output

120

learns a statistical model. Unfortunately, despite developing an approach that is intended

to scale and using a large computer cluster, Bengio’s approach still does not quite scale high

enough to be useful here. Specifically, the largest corpus that has been successfully used is

around a million words (about the size of the WSJ). This is fine for deriving representations

for the common words but not useful for deriving the representation for rare words (which

is our concern here). Bengio mentions that they are attempting to expand to fifteen million

words but they have not yet managed to scale the algorithm sufficiently to achieve this

(Bengio, 2003). Even if they are to achieve this, my experiments implied that a corpus of

five hundred million words was not adequate, and one and a half billion words is about the

minimum. So in a few years I expect Bengio’s approach to outperform the one presented

here, but in the meantime we will need to look for another solution.

5.3.9 Honkela (Self Organising Maps)

Self organising maps (SOMs) seem like the obvious solution to the problem since they use

unsupervised learning. This is the logical approach because there is no obvious training

data and fixed requirements on the output format. Additionally their output is in vector

format, making them ideal for all the backoff methods being considered. However, the

correct design of the SOM is not obvious. The intended output is a vector which means only

one word is represented but all of the training methods are based on showing two words

co-occurring. To explain, a simple method of training SOM would be that when two words

co-occur, they should be presented simultaneously to the SOM. The algorithm should then

generalise between the bigrams it is trained on, to produce an internal representation of

bigrams that can predict the probability of any two words co-occurring. However, we want

an encoding for words, not for bigrams. Another problem with using a SOM is that the

number of necessary hidden nodes is unknown. Since I couldn’t work out the representation

or how to train the network, this approach was also abandoned.

Of course, I am not the only person to investigate the use of a SOM in language pro-

cessing. For instance Finch experimented with a SOM in Chapter 8 of his thesis (1993),

using K-means to build the categories. Mayberry and Miikkulainen have also done some

work with respect to dependencies (Mayberry III and Miikkulainen, 1999). Probably the

most complete work in this area is by Honkela, whose thesis contains a number of different

applications for SOMs, including word representation (Honkela, 1997b).

The approach taken by Honkela is to provide the SOM with a wide window, rather than

simply the next word, although his earlier work used a much smaller window (Honkela,

Pulkki, and Kohonen, 1995). The wide window approach is remarkably similar to how

word bigrams are computed. Some of the parameters used are discussed in Honkela (1997a).

Results from this approach look very promising; a word map generated using this technique

121

has been copied from Honkela et al. (1995) and is presented in Figure 5.8 although it is worth

noting that this figure only shows the most common words and it is unknown how well the

algorithm performs on the less common words.

Overall, I decided not to pursue this approach simply because my goal was to use exist-

ing word representation technology more than to research new word representation meth-

ods. The vector representation I was trying to derive differed somewhat to Honkela’s in that

I wanted a number of dimensions instead of two. It is obvious how to modify Honkela’s to

produce more dimensions but not obvious if the algorithm would continue to work so well.

Compounding this with the concern that my corpus is over a thousand times larger than the

one Honkela used, and my lexicon five times larger, the approach seemed too risky. Hav-

ing said that, Honkela’s results look better than mine in many ways and it would be a very

useful approach to try.

5.3.10 Joachims (Support Vector Machines)

Support Vector Machines (SVM) (Vapnik, 1997) are a classification tool that has proven use-

ful when the number of independent parameters is too high for a neural-network. Their

input is multidimensional data that either has or doesn’t have some property. With this data

they build a classifier for deciding if new data does or does not have the property. They

work by finding the best hyperplane through the input data, so that all the data that has

a property is on one side and all the data that doesn’t is on the other side. Since few real

problems are neatly linearly separable like this, they first transform the data using a kernel

function into another (typically higher) dimensional space where the data is hopefully lin-

early separable. For any given input data, a number of different kernel functions may need

to be tried. SVMs also tolerate some degree of training error, allowing up to n input values

to be incorrectly classified.

SVMs are much better known in the field of information retrieval than thesaurus gener-

ation, but there has been some exploratory attempts at applying them to thesaurus gener-

ation, such as the work of Joachims (2001). In this work, Joachims discusses how an SVM

works and then examines a number of properties common in text processing. Given this, he

discusses the sort of problems in text processing for which a SVM are appropriate, and the

sort of problems for which they would be inappropriate.

Consider a set of training examples Sn = ((x1, y1), . . . , (xn, yn)) where x is an n dimen-

sional vector and y is either true or false. The task that the SVM solves is to select the

hyperplane with maximum Euclidean distance from the closest training example, subject to

the condition that at most ρ training examples are incorrectly classified.

Text classification appears to be an excellent application area for a SVM. It has very

high dimensionality since each word is generally considered a different dimension, and it

122

aboutprep

afterprep

againadv

allpredet

am

andcnj

answered

are

ascnj

asked

atprep

awayadv

backadv

be

beautifuladj

been

beforecnj

began

butcnj

byprep

came

can

child

come

could

cried

daughter

day

did

do

door

downprep

eyes

father

fell

forprep

forest

fromprep

gave

get

give

go

goodadj

got

greatadj

had

Hans

has

have

hepron

head

heard

herdetposspron

hereadv

himpron

himselfpron

hisdetposs

homeadv

house

howadv

howeveradv

Ipron

ifcnj

inprep

intoprep

is

itpron

justadv

king

king’s

poss

know

lastordinal

let

like

littleadj

longadj

looked

made

man

mepron

morequantif

mother

muchadv

must

mydetposs

night

noneg

notneg

nothingpron

nowadv

ofprep

offprep

oldadj

onprep

onceadv

onenum

orcnj

otheradj

outadv

overprep

put

quiteadv

said

saw

see

shall

shepron

should

soadv

somequantif

son

stilladv

take

thatdet

theirdetposs

thempron

thenadv

therepron

theypron

thisdet

thought

threenum

time

toprep

togetheradv

took

tree

twonum

untilprep

upadv

veryadv

was

water

way

wepron

welladv

went

were

whatpron

whenadv

whereadv

whichdet

whopron

wife

will

withprep

woman

would

youpron

yourdetposs

Figure 5.8: A sample of Honkela’s word map

123

is highly redundant. However, what property is the SVM supposed to predict? There is no

obvious binary classification going on. Joachims avoids this problem by marking by hand

some documents as being related or not related to corporate acquisitions. Given this concept

he is able to show that a SVM is easily able to list the words most useful in deciding if a doc-

ument is related to corporate acquisitions (‘assignment’ implies it is, while ‘college’ implies

it isn’t, and ‘lunchtime’ provides little useful information either way). Essentially he has a

system for producing excellent mappings between words and concepts, but no method of

automatically generating the concepts.

SVMs are an excellent technique with a lot of power. They can quickly learn quite com-

plex tasks. However they are not appropriate for solving the current task because it is a

‘clustering’ task rather than a ‘classification’ task. It is possible that future work will extend

their ability to cluster concepts.

5.3.11 Schutze

Another automatic thesaurus generation system was developed by Hinrich Schutze (1993).

His algorithm is quite similar to that of Finch, which was discussed in Section 5.3.3. Schutze’s

algorithm is of more interest here because it was designed to scale to a large lexicon.

One problem with modifying approaches to work on a large lexicon is that the less fre-

quent words result in very sparse matrices. Schutze used a clever trick to increase counts:

rather than count bigram counts for words he tokenised every four letters instead of every

word. For example, Baghdad forms the following fourgrams: Bagh, aghd, ghda, hdad. We can

then represent a word as a set of fourgrams — specifically, the smallest set needed to differ-

entiate this word from all other words. There are around half a million possible fourgrams,

which is far too big for clustering algorithms. However Schutze found only one hundred

thousand different fourgrams occurred in a large corpus and most of these were rare (less

than a thousand) or redundant, where an entry is redundant if the word containing it con-

tains another unique fourgram. After removing rare and redundant fourgrams, we are left

with only five thousand fourgrams, well within the limits of computability.

Next Schutze had to derive a vector representation for fourgrams. As with Finch and

others he started with a collocation matrix which is almost the same as Finch’s bigram

matrix, but is renamed here to avoid confusion with Schutze’s fourgrams). One potential

difference between Schutze’s and Finch’s matricies is in the columns. In Finch’s approach

the columns correspond directly to the rows in the matrix while in Schutze’s approach the

columns, or feature words, may correspond to anything. In practice, they also usually cor-

respond to the rows. A window of two hundred fourgrams was used, significantly larger

than Finch’s. Next Schutze ran the principal component analysis (PCA) algorithm on this

matrix. PCA will be discussed in Section 6.3.3, but for now PCA can be approximated as

124

sorting the matrix based on the importance of each column. The less important columns are

then discarded, and the matrix can now be read one row at a time, with each row providing

the vector representation for that word. By varying the number of columns kept, the length

of the resulting vector can be adjusted so that for instance a two-dimensional map can be

created by keeping just two dimensions. For instance, Figure 5.9 (taken from Schutze (1992))

shows a map of words related to the target word supercomputing.

supercomputing supercomputersupercomputers CraycomputingThinking processorscomputer Computer minicomputersminicomputermicrocomputer RISCworkstationsIBMHillis softwareRollwagen mainframePCsdesktoplaptop microprocessorstechnology peripheralscompatible VaxMicro IntelConvexSparcUnixJobs I.B.M. PoqetNEC MipsFujitsuPSXeroxPC

technologies CISCApple PackardEISACMOSOS SXKaporlaptopsInterface Hitachialgorithms HewlettToshibanetworked MetaphorAST chipserverarsenide Microsystemsoptical microchipgalliummicrochipstechnologically Silicon Dataquestprocessing Digitalcopiers lithography CompaqSoftwareinterface

siliconmegahertzLotuscircuitrytechnological transistorsinterfaces SparcstationSeyboldcompatibles HPmidrange MicrosoftMP LSIrobotics ROM chips MicroprocessorLCMotorola

Macintosh AmdahlDevicesRL

ApplicationsLogic CupertinoAdvancedAdobemicronNCRPresentationharnessing CanonEpson Sunnyvalemegabytes Application microns

graphicalVLSI

dbaseInternetencryption bipolar ICLNext stepperscomputation fontDos Macintoshesspreadsheets TandemprintersDisk modemsystemsfacsimilehackers networking Microchipconfigured Nixdorfcircuits Ricoh MIPSZenithuser Merrinlasersbytes Strategies Headstartdisketteetchdisk hardwareclonesmachine Macs Datausers vector BISscanner handheldAtariXT

AMDMacworldFigure 5.9: Two-dimensional version of Schutze’s output

Schutze describes two useful extensions of his original word-clustering work. In Schutze

(1995), the availability of cheaper memory and disk space enabled Schutze to work with

larger corpora and so to cluster words directly instead of fourgrams. In Schutze’s later work

(Schutze, 1998), he described a method of automatically deriving word sense information

for ambiguous words.

The most important point about Schutze’s results is that the approach has been demon-

strated to work on a large lexicon. Furthermore, the approach generates word vectors rather

than dendrograms. Because this approach was the only one which meets both of my criteria

from Section 5.2, I decided to base my approach on Schutze’s.

5.4 Summary

We began this chapter with an explanation of why it is important to look at word represen-

tation, and noted that it would be beneficial to represent words in a way which allows the

grouping of events involving ‘similar words’. We were not specific about how this group-

ing should be done, but we noted that a vector representation would enable more methods

of combining words than a simple dendrogram. We also noted that the benefits are in ob-

taining a better representation for rare words since the parser already has sufficient usage

125

information for common words.

Having decided to look at word representation, we surveyed existing approaches and

found a diverse range of approaches depending on the intended use of the resulting the-

saurus. Some approaches close to the field of information retrieval only represent words as

a special case; others are designed to give excellent representations for very common words,

but very few are designed to give a vector representation to every word in a large lexicon.

As well as being large, the field is dynamic; some approaches were developed years ago and

contain quirks to overcome limitations of slow hardware, while other approaches are clearly

very much in development.

For our purposes, the approach taken by Hinrich Schutze (1995) is the most appropriate.

It provides a vector representation and so we can defer the choice of how to use the word

representation until later. Additionally, it is well established that the approach works; our

aim is simply to apply this existing technique to the backoff task at hand.

126

Chapter 6

A derivation of word vectors

At the end of the last chapter we decided Schutze’s approach to word representation was the

most appropriate for extension to a larger lexicon, because it delivers a vector representation

of words, and it is well-established research that can be relied on to work in a variety of

situations. This chapter describes the implementation of Schutze’s method, the extensions

that were necessary to build a representation for every word in the WSJ, and it includes an

extensive section where different parameters are varied in order to obtain results that are

suitable for use in a statistical parser.

The general process Schutze followed was to take a corpus of text, find words within

a certain neighbourhood and save this information as a bigram matrix, take this bigram

matrix and reduce its dimensionality using PCA, and finally cluster the reduced matrix as

a dendrogram. In Sections 6.1 and 6.2, I will describe choosing a corpus and preprocessing

it to a suitable format. In Section 6.2.2, I survey some off-the-shelf tools which could be

useful in clustering words. The algorithm I implement for computing word representations

is identical to that described earlier in Section 5.3.11. However, the lexicon I need to compute

word representations for is considerably larger than that computed by Schutze, and this

requires some additional tricks for dealing with large matrices; these will be described in

Section 6.2.3. The process of deciding the best parameters is explained in Section 6.4 and the

final results are presented in Section 6.5.

6.1 Obtaining a training corpus: Tipster and Gutenberg

The WSJ corpus that has been used throughout this thesis cannot be used as a basis for

deriving word representations. The reason is simple: our goal is to replace rare words and

the only way to get an accurate representation for rare words is to have quite a few examples

of their usage in the dendrogram table. The only possible solution is to use a corpus which

is larger than the WSJ. (Recall that we do not need to hand-parse this corpus, since we

127

will be deriving word representations automatically from it.) So, which corpus should we

choose? The main criterion should be that is large, and also that it features text in a register

similar to that of the WSJ. This is important — as always, we want the training corpus to

be as representative as possible of the test corpus. However, as mentioned in Section 2.8,

we are also interested in developing word representations which broaden the coverage of

the parser to domains other than the WSJ. With this in mind, the best corpus for developing

our word representations would be one that includes text in the WSJ style, but also contains

texts from other genres.

The Tipster corpus (Harman, 1992) seems ideal for our purposes, since it includes the

WSJ as one of its component corpora while being much bigger. For an indication of size, the

WSJ is approximately six megabytes while Tipster is approximately one gigabyte.

Even after replacing the WSJ with Tipster, it was found the counts were still insufficient

for about half the words in the WSJ, so an even larger corpus was sought. I considered

writing a web-crawler to generate a large corpus but this involved even more stripping

of markup language along with many problems of non-sentence-like text. So instead it

was decided to use Project Gutenberg (Hart, 2005) — a huge collection of public-domain

books. Every English book in the project produced from 1993 to 2004 was concatenated

into a single huge file. There are obvious problems with this corpus; some texts are in Old

English, others are incorrectly identified as English when they are Latin, and at least one

dictionary is included in the corpus, but none of these errors is significant since all we are

looking for is neighbourhood information. In total, Gutenberg is around triple the size of

Tipster, leading to a combined total of about two billion words, or over one thousand times

the size of the WSJ. In the remainder of the thesis, I will refer to the combined Tipster and

Gutenberg corpus as the T/G corpus.

The T/G corpus is designed to be similar to the WSJ, except much larger. For this reason,

all words not present in the WSJ are replaced by UNKNOWN WORD. This process was

also performed because the workstation does not have enough memory to compute bigram

counts for a lexicon much larger than forty thousand words. In order to visually ensure T/G

conforms to the same basic word frequency distribution as the WSJ, Figure 6.1 presents a plot

of the frequency of every word in the WSJ against that in T/G. This plot is approximately

linear at higher word frequencies, so we can conclude that T/G is approximately a larger

WSJ. At lower word frequencies the graph looks less linear, but at these frequencies the high

standard error of measurement in the WSJ frequencies makes the result little more than

noise. 1

1It may appear from visual inspection that this graph is not going to intersect the origin, as of course it should

This appearance is a side effect of the log-log scale which over-emphasises the difference between one and zero.

128

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

10 100 1000 10000 100000

T/G

freq

WSJ freq

Word frequency, T/G vs WSJ

"wsj-vs-tipster.points"

Figure 6.1: Graph of the frequency of every word in the WSJ against that

word’s frequency in T/G

6.2 Preparing the corpus for clustering

6.2.1 Processing the corpus

Stripping markup

Tipster is not a raw text corpus. It is segmented using a markup language and this markup

must be stripped before anything useful can be done with co-occurrence, or else we end up

with computer co-occurring with .tt because it is frequently marked up teletyped. Unfortu-

nately every section of Tipster is marked up differently so half a dozen different scripts had

to be written to strip markup from each section. Gutenberg was already in text format, so

only needed the preamble deleted.

Tokenisation

Next we find a problem that has not occurred in the WSJ — what is a word? In the WSJ

all words are space delimited, even full stops and the like, making tokenisation implicit.

However T/G includes many hyphenated words, full stops following words, and so on.

and writing code to correctly tokenise these turns out to be surprisingly difficult. Another

problem is the high frequency of numbers in T/G – for example, with 3 as a different word

to 4, the lexicon is hard to manage. The initial approach taken was to replace all numbers

129

by a generic number symbol. It would also be extremely advantageous to perform a similar

operation for proper nouns but publicly available software to perform this was not found

during a brief search. In the end the tokeniser is only two hundred and fifty lines of Perl, but

the code is quite fragile with reordering lines causing large number of errors. It seems there

is no good way of writing a tokeniser. Of course, since the only use of the tokeniser here is

to improve bigram co-occurrence, it does not need be anywhere near as good as a tokeniser

used in a tagger or similar.

After finally getting the tokeniser to produce acceptable results, such as treating Mr. as

one word, it was found that the tokeniser’s idea of what constitutes a word differed signifi-

cantly from the WSJ’s. This is completely unacceptable since the whole point of the exercise

is not to get good representations for the words in T/G, but for the words in the WSJ. It

seems that sometimes the WSJ allows hyphenated words as lexemes and sometimes it does

not, with perhaps semantic or frequency information being the deciding factor, properties

that cannot be evaluated with regular expressions (the simple rules which the tokeniser is

built from). This problem was first tackled by deliberately decreasing the performance of

T/G’s tokeniser in order to considerably increase the similarity to the WSJ, and then the

WSJ was retokenised to match the tokens for T/G. This was performed automatically with

new parts-of-speech being generated when new tokens are created. Obviously rewriting the

WSJ in this way also significantly affects the probability model and so a reverse tokenisation

step has to be performed after parsing, before comparing to the gold standard.

This approach appears to work when we eyeball the results, but quantitative analysis

shows it does not work. Specifically, the retokenised WSJ obtains significantly lower preci-

sion and recall than the nontokenised version (84.5% rather than 86%). Since our evaluation

method is to measure the difference in the performance of the parser, this drop in perfor-

mance makes evaluation of word vectors impossible — it could be any improvements are

simply counteracting losses due to tokenisation — and so the retokenisation of the WSJ was

abandoned. The best solution left was to use the simpler version of the tokeniser which does

not include number tokenisation and makes a number of mistakes, but will reproduce the

WSJ fairly accurately. I was able to evaluate the recall of this simplified tokeniser by reto-

kenising the WSJ and comparing it to the untokenised version, since any differences would

be an error. It was found that about 98% of the tokens did not change, a promisingly high

result. However, since any errors will be consistent we also ensure there are at least some

occurrences in T/G for every word in the WSJ by adding the WSJ directly to T/G, bypassing

the tokeniser.

130

6.2.2 Off-the-shelf tools for clustering: a brief survey

Schutze mentions that he did not write the PCA code used by his algorithm, but instead

used the Buckshot algorithm developed separately. So I attempted to find an off-the-shelf

implementation of bigram counting and PCA that would be suitable for such a large prob-

lem.

There are a number of statistical packages available that perform at least some of the

necessary tasks. Among others, the following packages were considered: ‘bow’ is a power-

ful natural language library that is integrated with the programming language rather than a

separate tool (McCallum, 1996), which makes shifting it to a particular task painless. ‘BSP’

is a bigram counting package with support for a number of different tests and extensions

in Perl (Banerjee and Pedersen, 2003). ‘R’ is a very powerful statistical package modelled

after the S language (R Development Core Team, 2004); it is very popular with statisticians.

Schutze suggested the use of Weka (Garner, 1995) or Autoclass (Cheeseman, Kelly, Self,

Stutz, Taylor, and Freeman, 1990), both general purpose clustering toolkits.

Bow

Bow, or its various components, most notably Arrow, form a large C program. It is primarily

intended for information retrieval and contains tools for inverse document frequency (IDF)

rather than simple bigram counting. However Bow is implemented as a powerful library

with associated programs, so the library can be reused without having to modify any code.

Unfortunately I found the library functions were not sufficiently powerful. It contains no

useful statistical tools and its ability to convert a corpus of words to a corpus of integers,

while useful, can be implemented easily without Bow.

BSP

BSP is a bigram tool developed by Banerjee and Pedersen (2003). It is very easy to use but it

is written in Perl and was unable to cope with the large corpus being used.

Weka and Autoclass

Weka (Garner, 1995) and AutoClass (Cheeseman et al., 1990) are two clustering suites, not

specific to language processing. They include a number of high level algorithms. They

have been used successfully by many different projects. Both suites were tested with my

training data but both performed too slowly and were unable to complete training on even

the simpler cases. It may be possible to refactor the training data so that Weka or Autoclass

could classify it, but it seems the interfaces they present are too high level for making this

easy.

131

R

The statistical toolkit R is very popular throughout statistics and is taught to undergradu-

ates as the standard way of performing statistical analysis on a computer. It has a very wide

array of functions built in, including literally dozens of clustering algorithms. Addition-

ally its use on research projects means it has been designed to go moderately fast and scale

relatively well. Finally, being open–source means portions can be replaced if they are per-

forming too slowly. R had already demonstrated that it was relatively fast on dendrogram

generation; by using R I was able to hierarchically cluster Liddle’s word vectors several

orders of magnitude faster than Liddle’s java implementation could.

Another benefit of R is that it provides direct access basic algorithms rather than the

higher level programs in Weka which means they could be reimplemented in C if they could

not scale sufficiently. I thus decided to use R for PCA.

6.2.3 Dealing with large matrices

The WSJ contains approximately fifty thousand distinct words. Even after the retokenisa-

tion just mentioned, there are still thirty-two thousand distinct words. As has already been

mentioned, PCA strongly prefers square matrices. However a thirty-two thousand by thirty-

two thousand matrix is unrealistic on the computer hardware available — it would require

four gigabytes of RAM per matrix, for a total of at least twelve gigabytes. This problem is

compounded by various remaining inefficiencies in R.

A simple solution would be to split the bigram matrix into manageable chunks. If the

thirty-two thousand words are split into four thousand word chunks then we could count

their co-occurrence with the four thousand most common words to obtain manageable

square matrices. Unfortunately this approach does not work: every run of PCA transforms

its input into a reduced space ideal for that data — so there would be no correlation between

the vectors produced between different chunks.

This problem can be overcome by noting that the output of PCA is not the transformed

vectors, but a rotation matrix which when multiplied by the input matrix gives the transfor-

mation vectors. This is significant because this rotation matrix can be multiplied not only

by the input matrix, but by any other input matrix, transforming it into the optimal space

for the first data set. By multiplying the other twenty-eight thousand input matrices by this

transformation matrix we keep all data in the same space. Of course, this is not as good

as using all the words in the first place — not least because the transformation matrix im-

plicitly ends up optimised for placing frequent words well — but the method given works

and given a more powerful computer it would be worthwhile running this section of the

program again.

Another problem was noted when eyeballing the output data. It seemed that very large

132

counts were not being scaled correctly: similar but less frequent words were being clustered

apart. So the scaling normally built into PCA was removed and all scaling is done by a

separate Perl program. This also had the advantage that many parameters (logarithm, shift-

ing probability mass, RMS, centring on zero, and column normalisation) could be adjusted

easily.

6.3 An implementation of Schutze’s algorithm for word clustering

We have already presented an informal overview of how Schutze clusters words (or four-

grams), in Section 5.3.11. In this section, we will describe the algorithm in more detail. There

are three steps: firstly processing the corpus to build a table of bigram counts; secondly scal-

ing the counts in this table to normalise them; and thirdly running the PCA algorithm on

the normalised table.

6.3.1 Building a table of bigram counts

The first step is to take a corpus of text and transform it into a long sequence of words. Next

each word is mapped into a number, which will serve as the reference for this word into the

arrays. In the previous chapter the mapping between words and numbers was somewhat

arbitrary, and I used the order that the words happened to be seen by the preprocessor2.

This method proved to be a poor choice here because there were far too many words to

enumerate them all. Instead the words were sorted by frequency and then enumerated

using their relative frequency. This makes it much easier to vary the cutoff points.

Having obtained a sequence of numbers, we compute bigram counts by looking, for

each word, a certain number of words to the left or the right and incrementing the count of

this co-occurrence. Pseudocode is given in Figure 6.2. This pseudocode skips a number of

for WordPos = 0 to CorpusLength

for WindowPos = max(0,WordPos - WindowSize) to ref

count[corpus[WordPos],corpus[WindowPos]]++;

endfor

endfor

Figure 6.2: Pseudocode to count all co-occurrences in the corpus

implementation details, such as it being impossible to store the corpus in memory due to its

size (or even to store it using mmap!) Because of this, the corpus is loaded from the file into a

2Incidentally, Collins used the same method, although his different preprocessor design meant words were

seen at different times and so his mapping differs significantly.

133

circular array. None of these details are particularly surprising, and would just complicate

the figure if included here.

6.3.2 Normalising the bigram table

The PCA algorithm described next assumes that every row in the matrix is a unit vector

(that is, sums to one with a mean of zero.) There are a number of methods by which we

could modify the bigram table so that it sums to one. The simplest would be to treat every

row separately, sum it to find the mean, subtract this from every cell to give a mean of zero,

and then divide every cell by the mean to give a radius of one. However, this method was

found not to work especially well because the co-occurrence counts for very frequent words

dominated the results. Because of this, several alternative methods were examined. These

will be discussed in Section 6.4.3 along with their effects on the results. However, they are

briefly summarised here:

1. Add one to every single count, as an estimation of held-off probability mass.

2. Compute the logarithm of every count. This is meant to counteract Zipf’s law so that

instead of counts increasing exponentially, they will increase linearly.

3. Centre the data on zero by subtracting the mean of each row from every cell. This is

required for PCA, but may not be required for other clustering algorithm.

4. Should row normalisation use RMS or simply dividing by the mean. It is probably

desirable to always use RMS, but when first implementing this code it contained an

error and this parameter allows reproduction of the error for the recreation of previ-

ously published results.

5. Control whether or not to normalise columns. Column normalisation is dividing every

column by the RMS of that column. It was implemented after it was noted that the

features which occur extremely frequently or extremely infrequently were controlling

the final output too much (because PCA attempts to reproduce every result, and these

results differ by more.) After normalising columns, it can be expected for every feature

to have equal weighting.

6.3.3 The PCA algorithm

The idea of the PCA algorithm is to transform an n by n matrix into another space, so that

it is still n by n but the first row in the matrix is the best discriminator of that column, the

second row is the second best, etc. For instance, the first row may have positive values for

noun-type words and negative values for verb-type words.

134

More formally, the algorithm first finds the vector through the n dimensional data with

the most variance, which is called rot1. Because this vector has the most variance with

respect to the input matrix, it encompasses a lot of the information that was present in the

whole matrix, but of course it cannot encompass it all. Perhaps the best way of visualising

this is to view the matrix as a large number of points in n dimensional space, and rot1 as the

hyperplane that best splits those points into two. If n was just three, then the points could

be viewed as points in a cube, and rot1 is then a simple plane through this cube. Next the

algorithm finds the vector perpendicular to rot1 that encompasses the most variance, calling

it rot2. This process is repeated until rotn. All rotational vectors are unit vectors, it is only

their direction that is of interest.

To determine which angle encompasses the most variance we compute a covariance

matrix. Next we compute the eigen decomposition of this matrix to produce a matrix of

eigenvectors and a diagonal matrix D composed of a list of eigenvalues3. In mathematical

notation, our square matrix is A is decomposed into eigenvalues Λ1 . . .Λn and eigenvectors

R such that

AR = RD

It is perhaps best to view D as simply a scaling factor representing the relative impor-

tance of each row in the eigenvector table. That is, the highest eigenvalue corresponds to

the eigenvector with the highest covariance, or the principal component in the matrix. Sim-

ilarly, the second highest eigenvalue to the second most important component, and so on.

There are very many textbook explanations of principal component analysis available. One

that is specifically written for computer scientists is Smith (2002).

Returning to clustering words, we can multiply our bigram counts by R to transform

the counts in such a way as the first row of the output matrix corresponds to the best dis-

criminator for differentiating words. Reading down the column of this output matrix then

gives us an excellent vector representation for the word, where we can cut this vector at any

point up to its full length n and still get a good approximation of how the word differs from

other words. That is, words that have very similar usage will have very similar values for

their first components. I refer to the output of this transformation as the word’s position in

word space; two words that are nearby in word space will have similar (normalised) bigram

counts.

Another important result is that the matrix R can be used independently of the input

bigrams. This means that if we compute bigram counts for any new words, we can multiply

them by R to compute the new word’s position in word space. This position will not be

exact, in that if this word had been present in the original matrix A, its variations would

3Recall that all values in a diagonal matrix are zero except along the diagonal

135

have resulted in a very slightly different rotation matrix R# being computed, but the position

should be extremely close. The relevance here is that it is impossible, given the current level

of computer power to compute R for a matrix A the size of the whole input lexicon. This is

because a single matrix A the size of the whole lexicon would have around 50,0002 entries,

or roughly a quarter of a billion. At eight bytes per entry this would take roughly two

gigabytes of RAM. We would also need to compute R in RAM for another two gigabytes

of memory, and it would be extremely hard to complete the process without storing the

output matrix for a further two gigabytes. No machines available in my department have

six gigabytes of memory and even if such a machine was available, these values are optimal

cases and unlikely to be realised inside an interpreted language like R.

However, we can temporarily forget about A and instead work with a sample of A,

which I will refer to as A′. Performing PCA on A′ gives a rotation matrix R′. Multiplying A

by R′ will give AR′ which, because of the mathematical property outlined above, is a very

good approximation of the uncomputable AR.

Within the R environment there are a number of different ways of computing PCA. These

differ mainly in the iterative process used to compute the eigenvectors, but also in the nor-

malisation process applied to the input matrix. The simplest method available is singular

value decomposition (SVD). This does not perform any normalisation and so makes it eas-

ier for me to perform normalisation before passing the matrix to R.

Readers may have noticed a number of the algorithm’s parameters implicitly included

above. For instance, the window in the pseudocode looks to the left but looking to the right

gives subtly different results. Other parameters that can be varied are the window size, the

cutoff at which words are considered so rare they are best treated as unknown, the size of

the matrices used in PCA, the method for converting the bigram counts to unit vectors, and

several more. These, along with their effects on the results, will be discussed next.

6.4 Tuning the clustering process

We are now in a position to evaluate the word representations generated by the clustering

algorithm, and to decide on the best values for the various parameters which are defined. In

this section I will present results for various different parameter values, and discuss which

values are likely to be best.

6.4.1 Evaluation methodology

Evaluating a set of word representations is difficult on its own. Most self-contained studies

on word clustering use the measure of perplexity: effective clustering solutions reduce the

perplexity of a language model (see for example Goodman (2001)). However, since our word

136

vectors are intended to improve the performance of a parser, it is more appropriate in our

case to evaluate clustering solutions indirectly, by observing their effects on the precision

and recall of the parser. A more formal evaluation of word representations in these terms

is thus deferred until Section 7.2.6. In the present chapter, we will nonetheless provide an

informal evaluation of the results of different parameter combinations.

Our informal method is to look at the results generated: do words with similar meanings

receive similar representations? Since eyeballing the raw feature vectors generated is essen-

tially impossible, we pass these vectors to a hierarchical clustering algorithm, and generate

a dendrogram containing fifty randomly-chosen words to express the results. Word dendro-

grams are relatively easy to eyeball to get a rough impression of the quality of word vectors,

pending the more formal analysis in Chapter 7. For all evaluations described in this chap-

ter (unless otherwise stated), the input to the clustering algorithm is bigram counts based

on the T/G corpus using a window of two words to the left plus the current word and the

output is a dendrogram created for fifty randomly chosen words. We look two words to the

left because the main use of words is in right dependency events and so we need our word

representation to look from the perspective of what headwords the dependent words will

see to the left. The fifty randomly chosen words are the same for each dendrogram which

has the advantage that it is easier to compare outputs, but the disadvantage that it is easy

to tune the algorithms based on small samples. (More extensive testing was also performed

using larger samples, but these are too large to present in the thesis.)

6.4.2 Dimensions of the bigram matrix

The input to the PCA algorithm is a two-dimensional bigram matrix. The rows in this matrix

correspond to the words, while the columns correspond to features of words. Since the

features are themselves words and we are counting co-occurrence statistics, it would be

reasonable to assume the matrix will be diagonal — the number of times a co-occurs with b

should be the number of times b occurs with a.

However, there are a few complicating factors. Firstly, the PCA algorithm only works

on square matrices. For non-square matrices, it is conventional to convert the matrix to

square by wrapping data around. I found this transformation always led to extremely poor

results and very quickly discarded the use of non-square matrices. Secondly, the amount

of memory on the most powerful computer I had available limited the number of matrix

elements to approximately twenty thousand. Had I hand-coded the PCA algorithm instead

of using R, it is likely this limit could be increased significantly, to perhaps one-hundred

thousand.

Because of the limit to twenty thousand cells, a 4000-by-4000 matrix was used. Smaller

matrices are possible, and can be computed significantly faster. However, smaller matrices

137

lead to inferior results and so presumably larger matrices would lead to superior results,

were hardware available that could process such matrices.

6.4.3 Normalising bigram vectors

Before PCA is run on the bigram vectors, it is important to normalise them. This normalisa-

tion can be achieved in several different ways. In this section, I will consider several different

approaches in succession, culminating in the one I eventually use. The intermediate (and ul-

timately discarded) approaches are presented in more detail than in other parts of the thesis

for several reasons. Firstly, we are moving into unfamiliar territory and so the approaches

that do not work are of almost as much interest as the approaches that did. Secondly, in

order to perform useful qualitative analysis of the final dendrograms, it is illustrative to see

the ways in which earlier iterations of the approach produced inferior output.

Normalising word counts

The most natural way of normalising the bigram counts involves two stages. First, we gen-

erate centred vectors by subtracting the mean of every row from each cell. Second, we

generate unit vectors the counts by dividing them by the root mean squared (RMS) for ev-

ery row. This process is illustrated in Table 6.1 which shows a matrix of bigrams generated

for a window of two words to the left, plus the current word. (In this table, rows are words

and columns are features.)

In this section we are more concerned with the technique for deriving the values than

their meaning, but it is useful to recall their meaning because a good technique will make the

correct meaning more pronounced. A positive number shows a positive correlation between

two words, so when we see company, it is likely that we have seen computer somewhere

within the previous three words. A negative number shows a negative correlation, so when

we see bought it is unlikely that we have seen at within the previous three words. If the cell is

zero then the presence of one word does not affect the probability of seeing the other word,

so after seeing yesterday we should not be surprised by the presence or the absence of new

within the previous three words.

A dendrogram generated from the unit vectors in the final section of Table 6.1 is shown

in Figure 6.3.

This dendrogram has a number of positive features; it is starting to develop a sensible

hierarchical shape, connectives and numbers are both detected and are kept away from other

more common categories, and plurals are separated out quite well.

138

BeforeMeanwhile

376

ChaseDavisPaper

ArtFidelitygrown

promisedindexhost

statesite

influencecenter

previousconfidencedifficulties

restrictionsmodelsutilities

familiestaxesillegal

leveragebid

plungechange

guaranteecontrolling

flatstep

severesimilarorder

3/8AT&T

carryingfailing

chemicalsdeclining

expertsoperates

Chairmansince

George250

50

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Cluster D

endrogram

hclust (*, "complete")

dist

Height

Figure 6.3: Word dendrogram with RMS scaling

139

Vectors of raw bigrams bought company large yesterday mean

computer 313 7825 1386 388 2478

new 1174 19430 3386 7930 7980

traded 63 849 68 500 370

at 1905 28881 10402 6508 11924

Centred vectors bought company large yesterday√

Σ(x2)

computer -2165 5347 -1092 -2090 6232

new -6806 11450 -4594 -50 14,090

traded -307 479 -302 130 657

at -10019 16957 -1522 -5416 20,483

Unit vectors bought company large yesterday

computer -0.35 0.86 -0.18 -0.34

new -0.48 0.81 -0.33 -0.00

traded -0.46 0.72 -0.46 0.19

at -0.49 0.83 -0.07 0.26

Table 6.1: Bigram counts in the process of being normalised

Normalising feature counts

Looking again at Table 6.1, it should be noted that using RMS to normalise word counts

results in the relative weight being strongly decided by the feature with the highest counts.

For example, company occurs much more than any other feature in the table so, while the

positive correlation between company and computer is undeniably correct, it is perhaps un-

desirable for this correlation to dominate the row simply because company is a very common

word in the corpus. This property is undesirable because it means that words with low

counts are treated as virtually irrelevant and will all cluster together. What is needed is to

normalise the feature frequency using the frequency of each feature before normalising the

word co-occurrence counts.

Two different methods were examined for normalising feature frequencies. The first

method was to use RMS on the features, in the same way as was just demonstrated on words.

The second method was to take the natural logarithm of each count before normalising the

words. Neither method can be especially well justified using theory: using RMS on the

features means we are simply looking at the relative surprise at seeing the word rather

than the mutual information, and computing the logarithm is justified by noting that the

frequency of words is distributed exponentially (Zipf’s law) and so by taking the logarithm

we can move to a linear distribution, significantly reducing any bias caused by high-count

features. We also experimented with combining these techniques, so that columns were

140

normalised and then the logarithm was taken.

A dendrogram showing log scaling is shown in Figure 6.4. This figure is much more

restrictionstaxes

guaranteeleveragefamilies

hostcenter

siteexperts

chemicalsutilities

indexmodelsgrown

promiseddifficulties

confidenceinfluence

changestep

bidplunge

sinceorderstate

decliningfailing

carryingcontrolling

flatprevious

similarillegal

severeoperates

AT&TChairman

FidelityPaper

3/8250

63750

BeforeMeanwhile

ChaseDavis

ArtGeorge

0.0 0.2 0.4 0.6 0.8 1.0

Cluster D

endrogram

hclust (*, "complete")

dist

Height

Figure 6.4: Word dendrogram with log applied to all counts before pro-

cessing

promising. It is relatively easy to break the dendrogram into nouns (restrictions through to

models), verbs (through to plunge), adjectives (through to severe) with proper names, prepo-

sitions and numbers all being nicely separated out.

Because the log scaling removes the emphasis from frequently occurring features, an-

other test was to apply both log and RMS normalisation of the columns. However, it was

141

found that there are no significant advantages and so RMS scaling of features was aban-

doned.

6.4.4 Choice of feature words

How should we pick our feature words? Should the four thousand most common words

be used as parameters, or a randomly chosen four thousand? I think the better answer

is the first four thousand. The criticism of this answer is that it is non-representative, but

while the first say five hundred words are very different to normal English in that they

contain few nouns, the first four thousand include quite a few words of every type. More

important, the advantage of using frequent words as features is that they occur more often

with infrequent words and so the relative counts are accurate. A dendrogram generated

using random feature words is slightly inferior to Figure 6.4, and so we will continue to use

the most common words.

6.4.5 Window size

Another question is how big should the window be before two words are no longer consid-

ered neighbours. With a small window, the algorithm generates good representations of a

word’s syntactic characteristics, while a larger window brings in semantically related words

and increases overall counts which is useful for rare words. The small window has already

been shown (Figure 6.4). A dendrogram with a window of fifty words was generated but,

because semantic relationships are harder to verify than syntactic ones, cannot be usefully

presented here. The dendrogram did show potential, with much better semantic relation-

ships in the clusters, but at the expense of syntactic relationships. For instance, promised,

promises are a category here where they would not be in the previous dendrograms. Other

categories (for example illegal, lawyers) also show good semantic relationships with limited

syntactic relationships. Overall, the significant loss of syntactic information means that it

cannot be used to aid the parser and so we must return to our very small window.

A similar question is in what direction the window should be. During experimentation I

was unable to find either direction to be measurably better than the other, and chose to look

backwards like this because it fits better with right branching structure of English.

6.4.6 Iterated clustering

The dendrograms that have been presented so far look quite good. However, if we examine

a greater range of the input, it is apparent that there is no coherent high level structure. In

particular, several small clusters that really should be next to each other are a long way apart.

For example, after the first large category of about two dozen words, some of the numbers

142

are represented. However, in the middle of the first category we find a mini cluster of larger

numbers. Either of these clusters looks good, but they really should be joined immediately

instead of joining first with words like Warsaw.

It would be very useful if we could assign meaningful semantic or syntactic labels to

nonterminal nodes high up in a dendrogram. The lack of high level structure is not espe-

cially important if a later system using the word representations can be trained to use the

low level data correctly, because there are enough counts in the local structure. However

the lack of high level structure is likely to make training a future system very hard, if not

impossible.

One new approach I attempted, to try and impose high-level structure on the data, was

iterative clustering. The approach used in iterating the training is similar to that used by

Miikkulainen, as discussed in Section 5.3.7. After generating the vectors in the manner

already discussed, the whole process of bigram counting is repeated. However this time

whenever a word is found to be within the window of a feature, not only is the bigram

between this feature and this word incremented, but also the bigrams between this feature

and every word that is ‘similar to’ this word. ‘Similar to’ is defined in terms of the hamming

distance between the current vector representations of the words.

To give a better idea of the problems with local structure, a dendrogram produced from

iterating SVD four times is presented in Figure 6.5. Contrasting this with Figure 6.4 shows

some improvements as well as some regressions. The plural category is now complete, but

the -ing category has been split in two. Art, George have incorrectly slipped into the number

category, but illegal probably makes more sense with failing than it did with severe. Overall,

it is hard to make a definitive statement but perhaps the iterated result is slightly inferior.

6.4.7 Integrating POS tag representations

The iterative clustering has improved the global structure somewhat, but it has led to prob-

lems with the local structure. What is really needed is to impose on the dendrogram an

explicit hierarchy such as the part-of-speech of the words. In the next chapter (Section 7.6)

we also need a hierarchy of POS tags and so here we have ‘reused’ the hierarchy created in

the next chapter. To integrate the hierarchy with the existing bigram counts, the new data

is added as additional features associated with each word. A total of fifteen features from

tags was used, which in the next chapter will be seen to be too few to completely capture all

tag information. However, every tag feature we include here will be one fewer word feature

and experimentation showed that using too many tag features resulted in the dendrograms

over-emphasising POS information.

An alternative method for integrating the counts would be to tag the word corpus and

then use this tagged corpus for training since the correct tag would then precede the current

143

restrictionstaxes

expertsfamilies

chemicalscenter

sitemodels

indexutilities

operatesgrown

promiseddeclining

failingillegal

confidenceleverage

difficultieshost

carryingcontrolling

sincestate

previoussimilar

bidplunge

flatsevere

changeguarantee

stepinfluence

orderBefore

Meanwhile6

ArtGeorge

373/8

25050

AT&TFidelityChaseDavis

ChairmanPaper

0.2 0.4 0.6 0.8 1.0 1.2

Cluster D

endrogram

hclust (*, "complete")

dist

Height

Figure 6.5: Dendrogram from iterating SVD four times

144

word in the bigram counts. The latter approach would almost certainly lead to better results

since it would elegantly solve the problem of homographs causing ambiguity, but it was not

undertaken because tagging a multi-gigabyte corpus takes too long with my tagger.

A dendrogram with POS tags is shown in Figure 6.6. This dendrogram shows consider-

6250

3750

ChairmanPaperAT&T

FidelityChaseDavis

ArtGeorge

operatespromised

Beforesince

decliningcontrolling

carryingfailinggrown

Meanwhileillegal

flatsevere

previoussimilarorderstate

centersitebid

plungehost

confidenceinfluence

changestep

indexguarantee

leverage3/8

expertschemicals

modelsutilities

restrictionstaxes

difficultiesfamilies

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Cluster D

endrogram

hclust (*, "complete")

dist

Height

Figure 6.6: Dendrogram where POS tags are used as extra features

able promise, with the inclusion of tags strongly encouraging words with the same POS to

cluster. Not only that, but where we previously had two good sub-clusters that did not join,

such as two sets of proper nouns, we now have one larger cluster. Essentially, all errors in the

previous figures have now been corrected, including relatively good global structure. There

145

Row normalisation RMS

Feature normalisation Log

Features used First four thousand words

Number of iterations One

Integrated POS tags Yes

Window size Twenty words

Window direction Left only

Table 6.2: Parameters chosen for the generation of word vectors

are still two faults in this figure: the global structure is still suboptimal, and polysemous

words are categorised poorly.

6.4.8 Windows revisited

Previously we decided that smaller windows work better, and hypothesised this was be-

cause large windows overemphasise semantic information at the expense of syntactic infor-

mation. However in the last section we produced syntactic information through separate

features and so it is appropriate to revisit the question of window size. If the window size

can be enlarged then this should have two major benefits: it should increase our robust-

ness with low frequency words due to increasing their counts, and it should increase the

amount of semantic information. Preliminary investigations showed that the window size

can indeed be enlarged if POS information is included.

I experimented with many combinations of window size and window direction. In each

case, I examined the resulting dendrogram, and also calculated the nearest neighbours in

word space for a random selection of words. Interestingly, these two measures did not al-

ways coincide. A dendrogram produced for fifty words is obliged to create relations between

words, even if these words are not close in word space. However, since we are ultimately

interested in neighbours rather than dendrograms, the neighbours measure was preferred.

The best window scheme I found was twenty words to the left. A dendrogram for this

scheme is presented in Figure 6.7.

6.5 Results

The previous dendrograms were intended to give the reader an idea of why particular pa-

rameter values were chosen. In summary, we decided to adopt the values given in Table

6.2. A dendrogram produced using these parameters has already been presented in Figure

6.7. In the remainder of this chapter, we will look at the quality of the output generated by

146

chemicalsmodelsutilities

difficultiesfamilies

taxesexperts

restrictionsindex

leveragebid

plungeflat

severeprevious

similarillegalcenter

siteguarantee

stateconfidence

influencehost

orderchange

step3/8

2505037

6Art

DavisGeorge

AT&TChairman

PaperChaseFidelitygrown

Meanwhiledeclining

controllingcarrying

failingoperatespromised

Beforesince

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Cluster D

endrogram

hclust (*, "complete")

dist

Height

Figure 6.7: Dendrogram using the final parameters (a window of twenty

words and tag information).

147

this combination of parameters in some more detail. Rather than looking at dendrograms at

this point, we will move to looking at nearest neighbours, because when we are backing off

from individual words, we will effectively be looking at neighbours in word space.

6.5.1 Results for the first four thousand words

An alternative method of evaluating word vectors is to print their nearest neighbours in

Euclidean space. So for instance, picking the closest neighbours to ship is, naturally ship but

the next few closest neighbours are train, road, foot, hour, spot, sea and boat. Most of these seem

quite good as alternative forms of transportation, although hour and spot are peculiar. For

comparison, a manually compiled inverse dictionary gives send, address, consign, dispatch,

forward, remit, route, transmit which are more like synonyms of send. The key difference

is the inverse dictionary concentrates strongly on synonyms, while the bigram approach

seems to identify words which occur in texts on the same topic. The nearest neighbours for

a small selection of the four thousand most common words is presented in Table 6.3. (Note:

we are selecting neighbours from the whole set of words, not just the first 4000.) There are

some interesting properties in this table. Firstly it is unsurprising that numbers are clustered

together, but it is nice to see that large numbers are clustered away from twenty-something

numbers, which are also clustered away from numbers including decimal points. It is also

nice to see that the months are clustered seasonally. It is also useful to note clear errors in the

table, such as set having a nearest neighbour of called since this might have been put down

to coincidence in the dendrogram. It is also useful to note that the further away neighbours

tend to be less related, so that while heads is a perfect match for heads, it is a poor match for

leaves or turns.

6.5.2 Results for the second four thousand words

Up to now, we have been producing dendrograms that were generated directly from the

output of SVD. We noted back in Section 6.3.3 that it should be possible to perform SVD

on a sample of the bigram matrix, store the rotation matrix, and then apply this rotation

matrix to the entire bigram matrix, effectively achieving an approximation of applying SVD

to the entire bigram matrix. However, that approximation should be near perfect for the first

four thousand words and significantly worse for latter words. Table 6.4 presents the nearest

neighbours for these words. Since this looks very good, we can reasonably conclude that

applying the rotation matrix is working.

148

Moreover Nevertheless Hence Therefore However

popular powerful successful personal free

mortgages borrowers premiums dividends certificates

NASA SPAN AT&T Telecom BellSouth

29 27 26 23 28

By Of From With On

mill packing wash lighting diamond

7.5 8.5 9.5 10.5 0.2

heads faces stands leaves turns

formal immediate strict permanent temporary

set called made given changed

employee accounting filing insurance coverage

deficit inflation unemployment economist budget

associate responsible frank formal junior

December January November February September

facilities sites standards areas centers

100,000 30,000 50,000 20,000 200,000

well only now rather even

collapse crash crisis speculation uncertainty

continues remains suggests represents believes

Table 6.3: A sample of nearest-neighbour words from the first four thou-

sand words

149

editorial editor magazine bureau publication

prosecutors defendants indictments allegations attorneys

stocks traders investors declines Stocks

benefits employers standards improvements costs

returns numbers accounts years Others

different single particular simple useful

rose fell climbed jumped dropped

minority majority membership voting initiative

pact agreement timetable impasse moratorium

know tell let come go

lose keep happen break suffer

nine six seven eight five

book story reading picture writer

comments recommendations requests regulations reviews

holdings assets stockholders partnerships shareholders

slow steady strong weak rapid

chairman executive president vice director

minutes hours seconds feet yards

ought Would Will might shall

genetic biological clinical reproductive therapeutic

Table 6.4: A sample of nearest-neighbour words from the second four

thousand words

150

abounding teeming overflowing laboring imitating

thrusts twists mazes scars props

disapproves deplores displeases persuades errs

disillusionment disfavor rancor passivity savior

functional dynamic static analytical numerical

spores anthers contractions thermometers hybrids

halfhearted self-congratulatory unpolitical Influential earthbound

rumble crackle scamper graze hurl

newscast anchorman newsroom talk-show footage

scratched whistled tucked smelt plucked

cheerleading fifth-grade biking moonlighting scorecard

activism racism backlash homelessness environmentalist

grenades Witnesses gunmen commandos loudspeakers

les se deux des jour

profiled Located patterned latched Coupled

ham chocolate roast jam steak

crucially Collectively meaningfully Insofar Conceivably

Table 6.5: A sample of nearest-neighbour words from the last four thou-

sand words

6.5.3 Results for the last four thousand words

The first eight thousand words all occur extremely frequently in the T/G corpus. Even the

eight thousandth most common word (consume) has half a million co-occurrence counts with

the four thousand features. This compares to the word cataclysms which occurs exactly once

in the WSJ and only has fifteen thousand co-occurrence counts. Other words occurring once

fare even worse — Reykjavik has only five thousand co-occurrence counts.

Since the whole point of the word vectors was to generate quality results for rare words,

it would be very desirable for the least frequent words to cluster well. Table 6.5 presents

nearest neighbours for a selection of these least frequent words. There are still errors in this

table, but the results are surprisingly successful for such rare words. (This is probably due

to the fact that rare words are less polysemous than common words, which suggests that

finding a problem to the polysemy issue for common words would make a large difference

to the quality of word vectors.)

151

6.6 Summary

In this chapter, we have described a method for generating vector-based word represen-

tations using n-gram statistics, which generates similar vectors for semantically and syn-

tactically similar on informal inspection seem to capture similarities between words quite

successfully. There are two main novelties in this approach: firstly the use of the inter-

nal matrix in singular value decomposition to support a larger lexicon than was previously

possible using this technique; and secondly, the inclusion of part-of-speech tags to encode

syntactic similarities. There are a number of further improvements which could be made

to the vector generation algorithm, in particular the use of a preprocessor to differentiate

between separate senses of polysemous words; see Section 8.2.2 for more discussion of this

issue.

In the next chapter, we will discuss how these word representations can be integrated

into my parser.

152

Chapter 7

Improving backoff using word

representations

At this point in the thesis, we have identified the need for improved backoff in statistical

parsing (in Chapter 3), and we have developed a vector representation of words (in Chap-

ter 6). In the current chapter, we will consider how this representation of words can be

of benefit in improving backoff in statistical parsing. But before beginning, it is worth re-

calling Klein and Manning’s (2001a) paper discussed in Section 2.4.4, which suggests that

representing words is not as important for the success of a parser as it might appear.

Klein and Manning’s result could be taken to mean that studying lexicalised probabil-

ity models is not worthwhile. However, there are still several good reasons for considering

word representations. Firstly, Klein and Manning’s paper did show words increased perfor-

mance, just less than expected. Secondly, maybe the reason words are not useful is simply

that their counts are too low. It might still be the case that grouping several words into a sin-

gle word-like category provides a useful level of representation for the parser; this is some-

thing which has yet to be determined empirically. Thirdly, the real benefit of words may be

in allowing a parser trained on the WSJ corpus to generalise to other domains of text. Klein

and Manning were looking at improvements to the parser on the same WSJ domain as it was

trained on, but it may be that deriving word representations from a big corpus including the

WSJ as well as other topics improves the parser’s performance on some of these other topics

too. Finally, I view the Neural Network technique described in this chapter as being much

more general than a word representation. For instance, it could be used to decide whether to

include word information or some other kind of syntactic information, depending on which

is more useful.

So, having concluded that there are good reasons for considering grouping word repre-

sentations in parsing, we now need to consider how we can use the vector-based word rep-

resentations we derived in Chapter 6 to improve backoff in a statistical parser. Essentially,

153

we will be considering modifications to Collins’ genprob function described in Section 4.3

— the function which takes an event representation as input and returns an estimate of its

probability.

There are two obvious approaches to modifying genprob using vector-based word rep-

resentations. Firstly, we could alter the function so that instead of computing the probability

of an event containing a word, it computes the probability of an event containing this word

or any semantically related words — that is, words whose vector representations are close in

vector space. Secondly, we could replace the whole genprob function with one more suited

to a distributed input, such as a neural network. The first approach will be considered in

Section 7.2, and the second will be considered in Section 7.3. As a preliminary to either

approach, however, it is important to ask how tolerant Collins’ parsing algorithm is to a

revised version of genprob . We begin in Section 7.1 by investigating this.

7.1 Feasibility study: Noise in backoff

If we are to modify the backoff algorithm, we can expect (and hope!) to get different re-

sults. Before making these modifications it is important to know how tolerant the system is

to errors in these results. For instance, when the output returned by genprob is completely

wrong it is likely to cause the parser to go down the wrong track and produce the wrong

parse, but it is also possible that the error will be isolated to the current constituent and so

have only a minor effect on the parser’s accuracy, especially if catastrophic errors are ex-

tremely rare. Similarly, our modifications may result in slight random shifts, and so it is

important to know how these will affect precision and recall. Presumably some of these

results will be better, and some will be worse. In order to determine genprob’s tolerance

to different results we can experiment by adding noise and measuring the parser’s perfor-

mance. If genprob is finely tuned so that even tiny changes in probabilities result in large

changes to the parser’s accuracy then we must be much more careful than if the probability

model is quite robust.

There are many different ways in which noise could be added to the parser. Noise could

be added to every probability, or only a certain proportion of probabilities. Also, the noise

could either be additive, so that a probability of say 0.7 gets transformed to 0.70± 0.01, or

else multiplicative, to give [0.70/1.01,0.70× 1.01].

Naturally, it would be desirable to test with noise of the same type as will later be added,

but since that is not yet known we have to make an educated guess as to its properties. The

most obvious property is that we are transforming the probability for words and since every

probability derivation includes a word, we should be adding noise to every derivation. As

for the type of noise, it is less obvious if it should be additive or multiplicative. It seems

154

likely that very low probability events should stay as low probability, implying multiplica-

tive; at the same time what we are doing is essentially adding counts which should lead to

(scaled) additive noise. Since there is no clear answer, additive noise was arbitrarily chosen.

The effect of adding noise is measured by looking at changes to the final precision and

recall figures which it causes. This is because we hope making an error at one stage in the

parse derivation will, on average, get corrected at later stages. Our initial test results showed

a disastrous intolerance of any noise. Genprob was modified to add white noise of ±0.005

to every probability generated, and precision/recall dropped from 85% to just 15%! Further

investigation implied the noise was causing the beam to overflow with unlikely parses, and

changing the noise to only affect probabilities over 0.001 led to significant improvements.

Therefore, the problem is more with Collins’ implementation of beam search than with the

probability model.

After the tweak to continue generating zero probabilities, a graph of noise against error

was derived, presented in Figure 7.1. This shows that any integration of the new word

0

20

40

60

80

100

0.001 0.01 0.1 1

Acc

urac

y

Noise level

Effects of noise on parser accuracy

PrecisionRecall

Figure 7.1: Graph of noise against parser accuracy

representations must keep white noise to within 0.01 to have any chance of a reasonable

precision/recall. In other words, any modifications to Collins’ system are going to be hard. It

seems the probability model is very fragile, with even slightly incorrect probabilities causing

major drops in parser performance.

155

7.2 Parsing by grouping nearest-neighbour words

The previous study tells us that only minimal changes to genprob are safe; anything else

is likely to destroy parser performance. Based on this, the first approach taken to integrate

the word vectors was simply to tweak genprob so that the counts for rare words are sup-

plemented with the counts from similar words (where ‘similar’ means ‘close in Euclidean

distance’). The motivation for this approach is that there are a large number of words, such

as lions which Collins discards entirely, but that we really should know something about.

Recall from the previous chapter that one of our measures of the quality of a vector-based

representation was to look at the nearest neighbour words for a set of test words. In our

final scheme, nearest neighbours seemed to be identifying genuinely similar words a fair

proportion of the time; see for example Table 6.3. We can reuse this result here by grouping

the neighbours of rare words to assist backoff.

7.2.1 Integrating neighbours in parsing

It is undesirable to make any significant changes to the parser. Such changes would risk

breaking the statistical correctness of the probability model, introducing bugs, or otherwise

affecting the parser’s performance in a way that is independent of the neighbours. Therefore

we wish to integrate neighbours a little way away from the core parsing code. One effective

method of doing this is to load the neighbour information into the parser as a simple map-

ping between single word (w) and a set of words (W). Then, whenever an event occurs that

involves a word, we can generate a number of pseudo-events involving the members of W.

This technique has a number of advantages. Firstly, it is simple, requiring no changes to

the backoff or the smoothing algorithm. Since w is guaranteed to be a member of W, we are

guaranteed not to lose any counts. Slightly less obvious is that for backoff levels other than

the most detailed, the pseudo-events will have exactly the same properties as the old real

events and so while the numerators and denominators will change, the ratio between them

will not and so the probability of any given event will not change. Doubling the size of the

corpus will mean that any event matching a|b will now occur twice and so if we had P(a|b)

before, then doubling the size of the corpus will lead to doubling of every numerator and

denominator and so no change to any probability.

Extending this to incorporate smoothing also works provided that alpha is independent

of absolute counts. Collins defines alpha in terms of the cardinality of the set of events that

co-occur with b, and since doubling the corpus will not result in any novel events, the set’s

cardinality will remain unchanged.

The most logical way of integrating the new pseudo-events is to add a fourth level of

backoff, so that levels one, two, and three remain the same but the new level ‘two and a half’

156

contains pseudo-events. Curiously, we found this did not work. Modifying the smoothing

equation (Equation 3.3 on page 58) to take four inputs is easy, but we found the parser’s

performance dropped below 80%, even when the new fourth level is just a duplicate of

Collins’ third level (or a duplicate of his second level, for that matter). Since we are trying

to improve the parser’s performance, we cannot afford to have the performance drop before

any useful modifications have been made, and so we resorted to directly modifying the third

level of backoff.

Modifying the third level to load pseudo-events is quite simple to implement. The main

change involves the parser’s initialisation phase, when it creates a hash table of events from

the file of raw events in the WSJ. In the new algorithm, we read in the event file as be-

fore. However, before simply inserting the event into the hash-table, we consult the nearest-

neighbours information to transform the event, by replacing w with each of the members of

W. Since w is itself a member of W, this is guaranteed to include all the hash-table entries as

the old method, but will result in the same effect as if we had also seen this event with each

of the neighbours of w in the place of w. A few other minor tweaks to the parser were also

required to cope with the larger hash-tables that resulted.

7.2.2 Reversing the neighbours

For any given rare word w, we need to create pseudo-events featuring w modelled on real

events where w could be substituted. That is, we must look for events containing neighbours

of w and substitute in w rather than looking for events containing w and substituting the

neighbours of w (which would lead to the wrong output).

Since it is easier to modify the neighbours file than it is to tweak the parser, we reverse

the neighbours by loading the mapping between words and their neighbours into a hash-

table backwards — that is, if a has a neighbour of b then we store in the hash-table that b is

a neighbour of a. We then iterate over this hash-table to produce a reversed neighbour file.

Loading the reversed neighbours into the parser now gives us the results we expect: that

when we see a word we can replace it with its (reverse) neighbours to generate the correct

pseudo-events.

7.2.3 How to select a group of neighbours for a word

Having decided how to integrate neighbours, exactly what should be considered a neigh-

bour? The principal goal is to improve the probability estimate for events involving rare

words and so, especially in light of the results on the effect of noise, it would be safest to

leave the probabilities of events involving common words alone. We consider two methods

for determining the set of neighbours, which are discussed in the remainder of this section.

157

Using a Euclidean cutoff

One method is simply to define the neighbours of a word w to be all words whose vector

representations are within a threshold Euclidean distance of w. What should this threshold

be? It is useful to recall that we do not want to form groups around common words, as these

have high enough counts by themselves. If it happened to be that common words are more

isolated in vector space than rare ones, we could use this fact to help determine a Euclidean

cutoff which led to only rare words being grouped together.

In Figure 7.2 we plot a word’s frequency against the distance to its nearest neighbour.

Before analysing this figure it is worth mentioning that the frequency of words is derived

using their frequency in the T/G corpus and this is why we do not see the ordinary Zipf dis-

tribution. (It would be possible to generate this graph using frequencies from the WSJ, but

the resulting trend is less clear.) The trend we want to see is that there is a cutoff Euclidean

Figure 7.2: Graph of the log of a word’s frequency versus the distance to

its nearest neighbour

distance at which rare words still have neighbours, but common words do not. Since the

scale between ‘rare’ and ‘common’ is continuous, this is always going to be impossible to

achieve completely, but it appears impossible to achieve at all.

There is a clear trend that increasing the word’s frequency leads to a greater distance

to the nearest neighbour, which is a good start. There is also a clear cutoff that ‘common’

words almost always have no neighbours with a Euclidean distance less than 0.001. While

‘rare’ words do frequently have neighbours within this boundary it is absolutely not the

158

case that ‘rare’ words all fall within this boundary. It seems that while the graph has similar

properties to what we want, we must choose a slightly different technique.

While we cannot find a minimum threshold for common words that all rare words are

within, we can find a maximum threshold (of 0.004) that all rare words have a neighbour

within. Many common words also have neighbours within this threshold, but applying it

will significantly reduce the number of common-word neighbours we introduce. Empirical

analysis of the quality of the neighbours generated supports this approach, it seems almost

all neighbours within 0.002 are appropriate, most neighbours within 0.003 are appropriate,

as are over half of the neighbours within 0.004.

Using the N-best neighbours

The Euclidean approach generally works correctly, but still leads to some problems. As an

example of a correlation on the threshold, is has a distance of 0.0039 to does and so it is good

the system (just) decides to classify it as a neighbour. Similarly, aid has a nearest neighbour

of assistance, a perfect choice, but with a distance of 0.03998. However the closest neighbour

to subject is certain at a distance of 0.039. It is impossible to accept the neighbours for is

and aid without accepting the neighbour of subject. The only way to avoid such errors is to

improve the quality of the word clustering.

An alternative approach is to note that while distance is not always a good measure of

neighbour quality, it is true that the closer neighbours are generally better than neighbours

that are further away. Therefore, if we just accept the closest five neighbours of any word

then we can reject many unrelated words that have a close Euclidean distance merely as a

virtue of coming from a dense area of word-space.

Naturally, we must now include an explicit condition that a word’s count is only sup-

plemented with counts from its n best neighbours if it is a ‘rare’ word. Thus an arbitrary

threshold for rareness is required.

7.2.4 Avoiding swamping counts

Despite all the precautions we have taken, the neighbours we have generated are not perfect

and using them to generate pseudo-events could harm the probability distribution of the

target word. This is true both for common words which have a very accurate probability

distribution already, and also for words with very common neighbours, since while they

need their counts increased we do not want to increase them so much as to replace the

meaning of the word entirely with that of its neighbour.

To mitigate this concern while still increasing the counts of rare words, we keep track of

the number of pseudo-events we have created for every word. Once this reaches a certain

threshold, we skip this word as a neighbour, so only increase counts for events with the word

159

itself. Some experimentation showed a threshold between one hundred and five hundred

makes a slight improvement to parsing accuracy, but the exact threshold within this range is

less important; apparently the main effect of the threshold is to prevent extremely common

neighbours of a word from swamping the counts of the word itself. To give an example, a

word that occurs once in the WSJ, such as eked, generates fifteen real events. A word that

generates one hundred events will occur at least ten times in the WSJ, an example would be

dive. Therefore the approach could be viewed as expanding the corpus to the point that the

parser’s old understanding of dive is comparable to its new understanding of eked.

This solution is much more effective than simply preventing common words having

neighbours, since it does allow slight tweaks to common words and more importantly, it

eliminates the need for a sharp distinction between rare and common words. Were we

able to incorporate neighbours as a separate level of backoff, this step would have been

unnecessary.

7.2.5 Summary

In conclusion, we take the neighbours generated in the previous chapter and build a file giv-

ing for each word the words which consider this word as one of their nearest neighbours.

Then we delete any neighbours that are not within a Euclidean distance of 0.005 or within

the closest five neighbours. When loading the neighbours into the parser, we keep count

of the number of pseudo-events we have generated for every word and stop once one hun-

dred pseudo-events have been created. During experimentation, we found these parameters

were quite robust, so even quite large changes in them would lead to only slight changes in

overall performance.

7.2.6 Results and discussion

It is easy to find examples where the modified parser performs significantly better than it

did before modifications. For example, consider the sentence:

But always, in years past, they have bucked the trend and have been able to pick

up a fifth vote to eke out a number of major victories in civil rights and liberties

cases.

When parsing this sentence, the unmodified parser will come across the word bucked, which

occurs exactly twice in the WSJ. Words which occur less than five times are replaced by

#UNKNOWN#, and are almost invariably nouns. Shortly afterwards, the parser will en-

counter eke. This occurs zero times in the WSJ1 So, we will again replace the word with

1Technically, it occurs once, in the testing section. However, recall that all training is performed with the

testing section deleted.

160

#UNKNOWN# and struggle to interpret the phrase sensibly. It is therefore unsurprising that

the parser parses this sentence incorrectly.

In the modified parser, we will not have to replace bucked or eke since their counts are

supplemented by their neighbours: yanked and outstrip, respectively. While these are not the

perfect neighbours, they give the parser enough knowledge to parse the sentence correctly.

Naturally, it is also easy to find counterexamples where the extra counts coupled with

a poor neighbour result in a slightly worse parse. It is necessary to measure the effects

of the changes at a corpus level. To measure the results, Section 23 of the treebank was

parsed using the modified parser. We evaluated the neighbours code on both our own code

(which reimplements (Collins, 1996)) and on Collins’ own (1999) parser; the results of both

evaluations are presented below.

Modifications to Collins (1996)

The results of adding neighbours to our reimplementation of Collins (1996) are shown in

Table 7.1. There are a few points that need to be made about this table. Firstly, the table

Criteria Unmodified One Two Three Four Five

Neighbours

Bracketing Recall 85.18 85.07 85.08 85.07 84.92 84.91

Bracketing Precision 85.05 84.99 84.97 84.93 84.78 84.76

Complete match 24.83 24.70 24.83 24.74 24.57 24.57

Average crossing 1.01 1.01 1.01 1.02 1.03 1.03

No crossing 65.05 65.05 65.00 64.96 64.78 64.82

2 or less crossing 85.38 85.60 85.47 85.33 85.02 85.11

Tagging accuracy 96.50 96.56 96.53 96.52 96.53 96.55

Table 7.1: Performance of Collins’ 1996 parser over Section 23 before and

after integrating neighbour information

includes a column for an ummodified version of Collins, and a column for only one neigh-

bour. Since the neighbour code considers every word its own nearest neighbour, these two

columns could be expected to be identical. They are not identical because the unmodified

version does not include any of the tweaks used to load neighbours — the largest of which

was declaring all words as frequent2. In later columns we see only slight changes, all show-

ing decreasing performance as more neighbours are incorporated.

2Which, incidentally, means that Collins’ code to measure the frequency of words in the WSJ is of virtually

no benefit.

161

Any of these results would not be significant on their own, but the probability of having

four independent results following a trend is much lower than the probability of any one of

the results being lower. Surprisingly, it turns out that this is still not enough for significance

and therefore we must conclude that the effects on performance are inconclusive. About

all we can conclude is that results did not change significantly, and if anything there was a

slight loss in performance. However, it is important to ask if this loss in performance also

caused an increase in generalisation, because if so then the loss is probably acceptable but if

not we should look at alternative measures of integrating words.

The area in which we expect to see the biggest improvement due to neighbours is when

parsing sentences containing rare words, because this is when generalising is most impor-

tant. In practice, the effects will be most clearly visible if the rare words are head words,

because Collins’ probabilities are only conditioned on head words. By sorting Section 23 of

the treebank by the frequency of the least frequent verb in the sentence we can create a sub-

corpus of arbitrary length of ‘sentences containing rare head words’. A sub-corpus of two

hundred sentences was created and results from parsing it using different parsers is given

in Table 7.2. There seems no support from this table for the hypothesis that the neighbours

Criteria Neighbours

1 3 5

Bracketing Recall 83.28 82.89 82.75

Bracketing Precision 82.98 82.74 82.65

Complete match 14.00 14.5 14.50

Average crossing 1.64 1.69 1.70

No crossing 57.50 57.50 57.50

2 or less crossing 81.00 80.00 79.50

Tagging accuracy 96.29 96.25 96.23

Table 7.2: Performance of Collins’ 1996 parser over a sub-corpus of two

hundred sentences containing rare verbs, before and after integrating tag

information

code leads to improved performance for rare head words.

Modifications to Collins 1999

One concern with the above results is that we have used Collins 1996 as a baseline, since

that was the parser we reproduced in Chapter 4. A valid question is whether the same

properties hold when applying the technique to Collins’ later work, since it obtains much

better performance (88% instead of 85%). Since Collins has now released the source code

162

of his 1999 parser, modifying it to create virtual events for neighbours is relatively easy. In

Table 7.3 we show the performance of Collins’ 1999 parser without modification, after being

tweaked to load neighbours.

Criteria Unmodified One Two Three Four Five

Neighbours

Bracketing Recall 88.52 88.50 88.47 88.54 88.40 88.41

Bracketing Precision 88.68 88.72 88.66 88.74 88.61 88.61

Complete match 36.04 35.81 35.63 35.63 35.46 35.55

Average crossing 0.92 0.90 0.91 0.90 0.91 0.91

No crossing 66.68 66.99 66.99 67.13 66.99 67.08

2 or less crossing 87.13 87.44 87.17 87.12 87.08 87.17

Tagging accuracy 96.74 96.82 96.79 96.79 96.79 96.80

Table 7.3: Performance of Collins’ 1999 parser over Section 23 before and

after integrating neighbour information

While the neighbours extension to Collins (1996) showed a gradual decline in perfor-

mance, the extension to Collins (1999) seems to show no change; the fluctuations in scores

are well below the significance threshold, and are best attributed to noise. This is somewhat

more promising. Table 7.4 shows the performance of the neighbours extension on the rare-

verb corpus. In this table, we finally see some support for our hypothesis. Examining these

Criteria Neighbours

1 3 5 7

Bracketing Recall 87.49 87.66 87.71 87.51

Bracketing Precision 87.78 87.80 87.92 87.84

Complete match 27.00 26.00 26.00 26.00

Average crossing 1.48 1.40 1.47 1.51

No crossing 60.00 59.50 60.00 59.00

2 or less crossing 83.00 81.50 82.00 82.00

Tagging accuracy 96.59 96.59 96.65 96.61

Table 7.4: Performance of Collins’ 1999 parser over a sub-corpus of two

hundred sentences containing rare verbs, before and after integrating tag

information

differences using Dan Bikel’s compare.pl program for significance gives a confidence of

80% that each result is significant, somewhat less than the 95% that is necessary. However,

163

again we are viewing results in isolation. If we ask instead if increasing the number of neigh-

bours tends to increase performance then we do get a statistically significant result. From

this we can conclude that the integration of neighbours improves the performance of the

parser in (Collins, 1999) through the inclusion of neighbours, though by less than we would

have hoped.

164

7.3 Parsing using a neural network probability model

While the previous approach was well justified in that it made safe modifications to a fragile

model, it did not have much scope for significantly improving the performance of Collins’

system on test data from the same genre as the training data. Among the faults in Collins’

model is its inability to note which of its parameters is causing it to lose high counts and

smooth over this parameter. Essentially, the model implements an indefeasible rule about

the order in which information is to be thrown away. But there is no reason to suppose that

this order will be the same for every construction in the grammar. For instance, sometimes

the identity of the head word of a construction might be more important than details about

its subcategorisation frame, while other times the subcategorisation frame might be more

important than the word.

While this is easy to see as a problem, it is not obvious how the problem can be resolved:

a hash-table capable of storing every permutation of every event would be inconceivably

large. However, hash-tables are not the only way of encoding large amounts of highly re-

dundant information. Neural networks have been used for this purpose by a large number

of people (see for example Stuart, Cha, and Tapper (2004)).

A neural network is an adaptive algorithm that attempts to mimic the functional map-

ping between some input and output by minimising the error between their approximation

and the target (training) output. Neural networks are particularly suitable when the map-

ping is too complex for a specific algorithm to be designed, while at the same time the

mapping has a number of properties, the most important of which is that it is continuous.

The potential advantages of applying a neural network to genprob are significant. It

would allow an arbitrarily large input vector to be used, so researchers could experiment

with far more complex statistical models than have been considered to date. NNs are well

known for interpolating well between data points, and so degrading gracefully as we move

between points of training, such as on a novel combination of words.

Replacing genprob by a neural model involves several tasks. The first is that all of the

inputs must be mapped onto a continuous space, so that small changes in the input vector

only lead to small changes in the output vector. This will be discussed in Sections 7.5, 7.6 and

7.7. The second task is to decide on suitable training data and parameters for the network;

this will be discussed in Section 7.8.1. The training of the resulting networks is discussed in

Sections 7.9 and 7.10. Having assembled all of the trained network, the method of integrat-

ing the neural network into the parser is discussed with associated results in Section 7.11.

We begin in Section 7.4, however, by introducing the Cascade Correlation neural network

architecture.

165

7.4 Cascade Correlation

The Cascade Correlation architecture was chosen as the most suitable for all of the neural

networks. Cascade Correlation is a supervised, constructive neural network architecture,

meaning that it requires direct training data, and that it grows to match the complexity of

the problem. Because it uses supervised learning, Cascade is faster and easier to train than

networks relying on indirect feedback. Being a constructive neural network architecture,

Cascade requires fewer parameters than other algorithms. Within constructive neural net-

works, Cascade is the most popular due to its extremely fast training. While a full descrip-

tion of the Cascade Correlation learning algorithm takes too long to present here, a simple

overview is useful. For a full description, see Fahlman and Lebiere (1990).

Cascade Correlation can best be understood by contrasting it to Quickprop (which is

itself just a more complex version of the basic Backprop algorithm.) The algorithm starts

with no hidden units and attempts to learn a mapping directly from the input to the output

using the Quickprop algorithm. Since this is training the connections to the output, it is

known as the output training phase.

Learning a mapping directly from the input to the output will be impossible for all but

the simplest of problems and so Cascade will usually fail. Assuming it fails, Cascade will

create n candidate hidden units, each connected to all the input units, and trains each can-

didate unit such that the unit is most active when the network’s error is at its highest. This

is known as the candidate training phase. Essentially, this phase is learning feature detec-

tors for the situations that cause errors. The candidate training phase completes either when

the best candidate unit has reached a plateau (the training is said to stagnate), or too many

epochs have passed (called a timeout). Of these, stagnation is an indication that the network

is still learning, while a timeout is an indication that the problem is too hard.

Regardless of why the candidate training phase stops, the best candidate unit is then

added to the network as a new hidden unit, with the other candidates being discarded. The

new hidden unit is connected to every unit in the network, including the output units, and

finally the connections to the output units are trained. (This is the output training phase

again.) The process of training a single unit, adding it, and then training the outputs is

repeated until either the maximum number of units have been added, or a victory error

criterion is achieved.

7.5 Testing the vector representation of words

The development of a vector representation of words was the goal of Chapter 6. Here we

will examine if the representation that was developed is going to be suitable for input to a

neural network. The way we will do this is by training the neural network to perform rote

166

memorisation of word mappings and then test it on both perfect recall and on its ability to

generalise.

While the theoretically interesting parts of a statistical parser are undoubtedly its ability

to generalise with ungrammatical input and/or poor knowledge, a great deal of what the

parser does in practice is essentially looking up mechanical rules. Writing 1.000 next to

these rules and calling them probabilities does not change the fact that the parser is acting

as a glorified state transition machine. Because of this, the neural network must be able

to perform as a very accurate lookup table when required. This point cannot be stressed

strongly enough: from a research perspective, the interesting aspects of a statistical parser

are the fuzzy areas, but from a pragmatic perspective, parsing is almost always a simple

rule-based system.

This study is a simple test of whether a neural network can be trained to perform lookup

on the word vectors. That is: can it reliably map word vectors into words? A related and

harder test is, when presented with a novel word vector, can it map it to a good choice of

word – this test is extremely similar to the word generalisation we are hoping to achieve.

There are a number of published results demonstrating that neural networks can be trained

to perform both lookup and to generalise using the same network (see for example Garfield

and Wermter (2003)) but it is important to perform this test with my word vectors because

my vectors may well be harder to memorise than previously published results. It is also

important to perform the test using the same neural network architecture as is used in the

final parser implementation. My parser implementation uses Cascade Correlation, and so

Cascade was also used here.

7.5.1 Mapping words to words

Neural networks that map between one representation and another are an effective method

of determining if a neural network can extract salient features from the representations.

In Chapter 6 we developed a vector representation of words. Here we take the first fifty

elements from the vector representations of each word and use this as an approximation

of the word’s vector. The choice of fifty cells is based on previous neural network studies

which show that networks have significant trouble with over a few hundred units, and fifty

units for words leads to the most complex of the networks (dependency) having a little over

two hundred input units.

One avenue of research that was not examined was reducing the size of the input vector

and investigating if the network could still learn. Reducing the vector size has not been

investigated here because Cascade has demonstrated in the past that it is extremely good

at ignoring irrelevant information and so smaller vectors are very unlikely to improve re-

sults here. However, decreasing the vector size would be an interesting avenue to explore

167

from a theoretical perspective, because it would provide some insight into the amount of

complexity in the English lexicon, or at least the complexity that was captured in the vector

representation.

Output from the network is one unit for every possible target word, and so four hundred

units for this first network. This is a large number of units, but this is balanced by their

simplicity.

Sigmoidal activation functions are used for all hidden units. They are the default unit

in Cascade, and there is no reason to believe they are inappropriate here. For the output

layer, an asymmetric sigmoidal activation function is used instead because values of zero

to one are more intuitive than values between −0.5 and 0.5. This is also in line with the

documentation. The default pool size of eight candidate units was used.

This brings us to the error measure. The error measure used was bits which means there

is an error of one bit if a unit is on when it should be off, or off where it should be on.

This differs from the standard error measure which Cascade refers to as index, in which

the amount of error is a scaled measure of the difference between the actual output and the

target. The difference between these two measures is significant: Measuring error using

the normal RMS approach would result in the network being encouraged to get outputs

extremely close with occasional large errors tolerated, while the bits measurement results

in no benefit to the network for choosing the right answer with a higher confidence, and a

significant penalty for each wrong answer. The maximum tolerated error was set to zero,

meaning the network is not allowed to incorrectly classify any words. It is believed this

combination of bits with no tolerance to error will result in a network that is fully trained

but not over-trained, although this was not extensively investigated.

7.5.2 Evaluation

Results from the initial network were successful with the network making zero errors af-

ter a little over a minute3 and using seven hundred and fifty epochs and six hidden units

on average. As a rough estimate of the complexity of the problem, ninety percent of the

words were correctly classified without any hidden units, implying that the input repre-

sentation was almost good enough to make the problem linearly separable. Generalisation

after learning just four hundred words was remarkably good, with a small sample being

presented in Table 7.5. Given the tiny training space, this result is quite surprising. In this

table, the left column being identical to the middle column is unsurprising since this is the

result we trained the system to produce. However there is a strong relationship between

the left column and the right column which is very promising for the network’s ability to

3Times taken throughout this section are informal — based on a single run on a multi-user system under

varying load. They are intended as ballpark figures.

168

Word Nearest Next nearest

the the this

is is does

as as through

has has was

will will would

market market value

But But and

share share income

shares shares securities

Inc. Inc. Corp.

prices prices markets

interest interest real

earlier earlier ago

buying buying selling

Table 7.5: Learned mapping of words to words from four hundred words

generalise.

A much harder test of generalisation is evaluating it on words it has never seen. In this

test we derive the best mapping for the first four hundred words that the network was not

trained on. That is, words 401 to 800. Predictably, it performed much worse on this as is

shown by Table 7.6. While results in this table are not especially good, it is remarkable there

is any generalisation at all. This bodes very well for the later tests. In subsequent runs, I

used a lexicon of one, four and ten thousand words with equally good results.

In presenting these results we have not concentrated significantly on how hard it was

for the network to learn the task, or how well it learned the task, that is, the ability of the

neural networks to generalise. This omission is deliberate; we have not concentrated on ease

of learning because we do not know how closely this task correlates with genprob, and in

order to fully test the network’s ability to learn the mapping we set the error tolerance to zero

which will cause the network to significantly overfit the data and sacrifice generalisation for

accuracy. Were we interested in testing generalisation, we would have tolerated some error

and so stopped training before the network overfitted its training data.

169

Word Nearest Next nearest

production industry use

total small foreign

third-quarter major revenue

President American president

notes bonds interest

25 20 1989

London West Mr.

latest financial late

credit debt loss

earthquake loss past

almost far only

Table 7.6: Evaluation of network generalisation after learning from the

first four hundred words

7.6 A vector representation of tags

While the goal in shifting to a neural network was principally to smooth more finely over

words, it is essential that all inputs to the neural network are vectors and so we need a

vector representation of POS tags. By far the easiest method would be a simple enumeration,

with one bit for every POS tag. There are fifty tags in the WSJ so a simple enumeration

would lead to a fifty bit vector. This would be within the limits of the number of nodes that

a network can process, but since a single dependency event uses five tags, two-hundred

and fifty nodes is probably too many for just the tags in the input. Apart from a simple

enumeration pushing the boundaries, it seems very wasteful that say NNPand NNPSshare

no information in common. Again, we have a problem of dimensionality reduction.

My first approach to generating a vector representation was to create one by hand. With

only fifty tags my intuition was that it would be faster to hand-encode a representation than

to write software to derive a representation automatically. Unfortunately, writing a vector

by hand proved even more difficult than examining the raw word vectors directly. Based

on this observation, the second approach was to create a dendrogram by hand. This proved

slightly more successful than writing the vectors directly, but the only way I had of encoding

this dendrogram as a word vector was as a binary classification tree. Such a representation

requires a bit in the representation for every single branch in the tree, this approach ends up

with the same number of nodes as the simple enumeration — we have not gained anything.

Because they have not resulted in reduced dimensionality, manual approaches were

abandoned and an attempt was made to recreate the process used in generating vectors

170

for words. Again we need a corpus of ‘words’ (in this case, the words are tags) and that

corpus must be much larger than the ‘lexicon’ (the set of all tags). While the lexicon for

words had a cardinality of around forty thousand, there are only fifty tags and so it seems

reasonable that a corpus of around one thousandth of the size of T/G will be adequate, that

is, one million words. We already have such a corpus, after the words are removed, the WSJ

contains a little over a million sequential tags (after Section 23 is removed).

Running the code from Chapter 6 over the corpus of tags required a few minor tweaks,

such as using RMS scaling instead of logarithmic, but nothing significant. A total of twenty-

two dimensions were selected from the output of SVD because this is where the contribution

of each dimension starts to drop off rapidly. In Section 6.4.7 we used these tag vectors to

assist in giving the word vectors a global hierarchy. Then, we used vectors with a length of

just fifteen, but analysis of the standard derivatives from SVD shows that fifteen is perhaps

a little too liberal. The fifteenth component has a contribution of 0.03 (out of a total of one),

but it is not until we get to the early twenties that the contributions fall away: the twenty-

first component has a contribution of 0.004, the twenty-second of 0.002 and the twenty-third

of 0.0000004.

As with the words, raw vectors are impossible to interpret and so a dendrogram is given

in Figure 7.3. To explain some of the relationships in this figure: SVD has found that full

stops are closely related to colons and the end of sentence marker. For nouns it found that

common nouns are similar to plural common nouns, and that proper nouns are similar to

plural proper nouns. It also found that foreign words are more similar to proper nouns

than to common nouns. Verbs are also well clustered, with modal verbs, past tense verbs

and third-person present tense verbs forming a cluster, and so on. Definitions for all of the

elements in this figure are given in Appendix A.

There are very few studies in the literature which can be usefully compared to this

dendrogram. An extremely large number of works, such as (Ushioda, 1996; Powers, 2001;

Schutze, 1998; Finch, 1993) produce dendrograms of words and note the POS structures that

are being formed as a side-effect, but I am not aware of any attempts to produce such a den-

drogram directly. This makes evaluation a little more complex since we have no reference

points, but overall the dendrogram looks quite good.

Returning to neural networks, the first test of the tags is to attempt to learn a mapping

between their vector representation and an enumeration of all tags, much as we did for

words in Section 7.5.1. The results from this test were remarkably successful. With only

fifty training patterns it would be expected for the network to learn the mapping; what was

surprising was that the network was able to generalise from only fifty training points. The

neighbours generated using this method are significantly better than those generated using

the Euclidean approach.

171

VBRP

RBSJJ

JJSCDDT

PRP$PDT

$#

VBGUNKNOWN

RBVBNJJR

RBRIN

TO.

DOT:

NNNNS

−RRB−,

−LRB−FW

POSNNP

NNPS#STOP#

LS‘‘

SYMUH

PRPEX

VBPMD

VBDVBZ

’’WRBWP$WDT

CCWP

0.0 0.4 0.8 1.2

Tag

Distance

Figure 7.3: Dendrogram of tag representation

172

A more neural-network oriented approach for deriving tag vectors would have been to

implement a simple predicting neural network, much like a POS tagger that has no words,

and then use the hidden units for the representation. This approach was not taken prin-

cipally because I already had perfectly working code for generating vectors from words

and so it was easier to reuse the more complex but existing code than to build the simpler

approach from scratch.

7.7 A vector representation of nonterminals

As with tags and words, it is necessary to derive a vector representation for nonterminals.

Like tags, a raw enumeration would be possible but is perhaps undesirable since there are

close to one hundred nonterminals. Also, it would be good if, for example, NP and NPB

have a similar representation. One problem with deriving this representation is that the par-

ent of a word is its part-of-speech tag, which makes it necessary to treat all part-of-speech

tags as nonterminals. This is a problem because it doubles the number of nonterminals,

significantly increasing the complexity of building the tag representation. Another prob-

lem is that while the method for deriving tag representations was fairly obvious since tags

occur sequentially, the method for deriving nonterminal representations is not obvious. Re-

viewing the literature I was unable to find anybody who had attempted to derive a vector

representation for nonterminals, so this representation can be considered a somewhat naıve

first step.

Ideally the vectors should have similar representations for nonterminals when they are

interchangeable. Furthermore, it is known that the bigram algorithm used for tags and

word produces similar representations for events with similar bigram counts. Based on this,

the approach is to derive an event corpus which lists all nonterminals used as defined by

Collins’ event concept. For example, Collins looks at unary events, where a nonterminal

selects its parent, so all such events are extracted from the WSJ, simplified to just include

the nonterminals, and stored in a table. There are six such tables: head → parent, parent

→ head, head→ left (adjacent), head→ right (adjacent), head→ left (non-adjacent), head

→ right (non-adjacent). This approach generated about five thousand events, and output of

acceptable quality, as shown in Figure 7.4.

However the importance of a good nonterminal representation means attempts were

made to improve the results. This was done by asking a linguist4 to hand encode something

approximating the desired output and use this as extra training data. Hand encoding a full

vector representation proved too difficult, but instead a simple classification of tags was

made, which is presented in Table 7.7. For nonterminals, a more complex classification was

4Thanks to my supervisor, Alistair Knott.

173

RBSNNPRBRPDTJJRNNCD

VP−AADJP−A

VPSINV

SQWHNP−A

X−APRT|ADVPSBARQ−A

VBDVBZMD

VBPPP−A

WHPPX

NP−ASG−A

−RRB−SBAR−A

SGCONJP

LSTTOP

FRAG−ANAC

VBVBG

WHADVPWHADJPADVP−A

NXPOS

SSBARQ

S−AFRAG

UCPRRC

UCP−ASYMWDT

FWWP$

WPSQ−A

QP−LRB−PRP$

WHNPNNSJJS

PRT$

WRBPRP

NNPSUH

#LSEX

‘‘VBNPRN

RBNPCC

#STOP#INJJ

DTADVP

PPINTJ

TOADJP

RPSBAR

NPBINTJ−A

0.0 0.5 1.0 1.5 2.0

Nonterm

inal

Distance

Figure 7.4: Dendrogram of nonterminals produced using only unsuper-

vised training

174

Category Members

Nounish CD NNP DT NNPS IN NNS JJ POS NN

Verbish , quotes CC VBD RB DOT VBN TO PRP VBP UH VB MD VBZ VBG EX

Other FW JJR JJS LS PDT WDT RBR symbols RP WRB RBS WBR WP

Table 7.7: Hand encoded categories for POS tags

developed and this is given in Figure 7.5. Note: this figure includes the tag classifications

just mentioned.

Figure 7.5: Hand encoded representation of nonterminals

Encoding this tree into the computer is relatively easy, especially by taking advantage

of the directory tree which means membership can be computed by running built-in com-

mands, reducing the programming required to simple shell scripts. Once encoded, the data

cannot be integrated by simply adding the information as extra features because it would

result in too many features. Instead, we add the information as an extra type of event and so

can use PCA to choose what information to keep. This proved to work very well at detecting

and removing redundant information. The standard deviations showing the importance of

different dimensions do not have a sudden drop like the tags do, making it hard to decide

the number of dimensions to keep. The first five dimensions capture approximately forty

percent of the variations, the first ten capture approximately fifty percent, the first fifteen

capture almost sixty percent, and the first twenty capture about sixty-four percent. While

this is clearly diminishing returns, it is not obvious where to draw the line and so fifteen di-

175

mensions was chosen as fewer dimensions appears to produce a worse dendrogram. Since

there is no similar work in the literature, it is impossible to compare this result to others.

Overall it appears the approach works quite well.

Returning to the now familiar neural network enumeration test, we are able to learn the

nonterminal mapping in just nine seconds (two thousand epochs), with the resulting net-

work containing sixteen hidden units. This test was as successful as the others: all nonter-

minals mapped correctly and showed some good generalisation such as the nearest neigh-

bour to a gapped sentence being a prepositional phrase, or that nonterminal complements

are more similar to other nonterminal complements than to nonterminal adjuncts.

In conclusion, results are better than would be expected for a first attempt at a vector

representation of nonterminals. Though they are not as good as the tag representations,

they are likely good enough for our purposes.

7.8 Neural network design

We have now shown that the quality of our input is high enough for processing by a neural

network. This still leaves us with two tasks before we can begin to replace genprob. The

first of these is deciding what to use for training data, and the second task is the parameters

we are going to use. These will be discussed below.

7.8.1 Training data

There are two obvious sources of training data for a neural network parser: the raw event

file or actual output from genprob. Using genprob would be easiest — it is a function that is

called many times and so simply logging its inputs and outputs will provide an unlimited

amount of training data. It would be preferable if we could use the event file directly since

we are trying to improve on genprob, which is very hard to do when genprob is the training

data being used.

However, it is not obvious how to use the event file for training. Logging from genprob

will lead to training instances in a format similar to: event→ probability, the exact format

necessary to replace genprob. Using the raw event file would instead lead to training in-

stances similar to rhs→ lhs. Converting this into a probability would require something

like placing an enumeration of all possible lhs on the output and using a winner-take-all

learning strategy.

Even if we use the winner-take-all approach, there are still a number of serious prob-

lems. The number of outputs is infeasibly large in places; for instance dependency events

generate a word, tag, and nonterminal for 50000× 50× 100 possible outputs, several orders

of magnitude higher than any network can manage. Restricting ourselves to events seen

176

SBAR−APRN

SBARADVPSINV

SSG

ADJPPP

UCPIN

VPNP−A

NPRB

#STOP#JJ

CDCCDT

PRP$JJR

NPBQPNN

NNSNNP

NNPSPRPVBNVBGVBP

VBTO

VBZ‘‘

EXUHMD

VBDFRAG

SQSBARQ

VP−AINTJ

NXCONJP

PRTLSTFW

NACPOS

$JJS

−LRB−−RRB−

RBRWRBWDTRBS

#LS

PDTSYM

RPWP$

WPS−ARRCTOP

XWHADVP

WHNPWHADJP

WHPPSG−A

FRAG−AWHNP−A

INTJ−AX−A

SBARQ−APRT|ADVP

SQ−AADJP−AUCP−A

PP−AADVP−A

0.0 0.5 1.0 1.5 2.0

Nonterm

inal

Distance

Figure 7.6: Dendrogram of the representation of nonterminals

177

during training would reduce the number of possible outputs to an acceptable number but

would prevent the parser generating novel combinations. Even if we are able to somehow

represent the outputs (perhaps using three neural networks where each one generates a

component of the event, analogous to Collins generating the tag and nonterminal before the

word) we run into the problem that we are training with only a tiny number of positive in-

stances. Ordinary training algorithms are unable to learn under these circumstances and so

we would have to investigate nonstandard training techniques (for example pseudo-events

or training in parts.)

Since I was unable to design a network that used the event data for training, I was forced

to use the simpler logs from genprob. The main goal is thus to test the feasibility of imple-

menting a statistical backoff system using a neural network, rather than make an improve-

ment on the state-of-the-art. Even if this is less than ideal, it might nonetheless be that the

network is able to learn to interpolate better than the algorithm which provided its training

data.

7.8.2 Neural network parameters

Even having decided to train from logged events, there are a number of details that need to

be worked out, and these will be discussed next.

Multiple networks for different genprob calls Recall that Collins’ genprob function is

called with different numbers of parameters for different types of event. It makes sense to

reproduce this by creating a different neural network for each event type. Thus we need

six separate networks, for tag events, unary events, prior events, subcat events, dependency

events and top events. The tag network corresponds to my POS tagger and is used here

as it was in the parser largely as a prototyping tool. (If Collins’ model three were to be

implemented, then an extra network would be needed to generate gapping information.)

Many zero outputs We have already seen that genprob is intolerant to noise. Another

major concern is that genprob is not at all Gaussian in shape. Almost 99% of the calls to gen-

prob result in a generated probability of zero. Even amongst non-zero values, the output is

non-Gaussian, as is shown in Figure 7.7. This graph is best categorised as a bimodal Poisson

distribution, so that most of the time we should choose to produce a value of approximately

zero, while in about ten percent of the remaining instances we should produce a value of

approximately one.

Representation of input units The vector representation for tags, nonterminals and words

has already been described. If any of these turn out to be insufficiently precise then they

178

0.0 0.2 0.4 0.6 0.8 1.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

density(x = d, from = 0, to = 1)

N = 25981 Bandwidth = 0.05026

Den

sity

Figure 7.7: Probability of different outputs from genprob, after outputs of

zero are excluded

179

can be supplemented with an enumeration; supplementing them would however have the

disadvantage of greatly increasing the number of units. The best input representation thus

needs to be determined empirically. For subcategorisation frames and the distance metric,

the representation used will be discussed where these are used.

Single vs multiple output nodes The genprob function only has one output, the proba-

bility. However, it is equally possible for the neural networks to have many outputs, with

each output corresponding to one possible value that the parser could be generating. For

instance, one value could correspond to generating a parent of SBAR, while another cor-

responds to S. In practice I found some networks favoured a single output, while others

favoured multiple outputs. These will be discussed in the tuning section (Section 7.9.)

Activation function Hidden units use sigmoidal activation while output units use asym-

metric sigmoidal activation.

Number of hidden units A maximum of six hundred hidden units is permitted for each

network. It is more likely that this is too many (so permitting overfitting) than that it is too

few. Later we found this was much too many, and most networks use only a tiny fraction of

the permitted six thousand.

Number of ‘candidate’ units Recall that a candidate unit is a potential hidden unit con-

sidered by Cascade. Due to concerns that with so many training epochs we could easily

have weights going to infinity, sixteen candidate units are permitted instead of the default

of eight.

Volume of training data Even having decided every single property of the training data,

one critical aspect that has not been addressed is, how much data should we use? Since

training data is derived from genprob, it is available in virtually limitless quantities. The

rule of thumb that ‘more is better’ does not apply to neural networks. Increasing the amount

of training data will effectively increase the amount weights will change every epoch which

can easily cause a network to fail at learning a task in much the same way as setting the

learning constant too high will.

Unique training data Recall that training data is generated by sampling the genprob func-

tion. Since genprob is frequently called with the same parameters, this means the sample

has a significant number of duplicate entries. For instance, a random sample of one hun-

dred thousand calls resulted in just over six thousand unique calls. This may be a good

thing because it will encourage the neural network to assign more importance to data that

180

occurs frequently, which is why it has not been discussed up until now. It may also be a bad

thing, since it will both cause overfitting and will result in training on just a tiny sample of

the space at the optimal training size of around one-hundred thousand training instances.

Incremental training A technique for training a neural network with a large amount of

data is to partially train the network with a sample of this data first, to produce a rough

approximation of the weight space, and then incrementally train the resulting network with

larger samples of the data. This approach may be necessary for some of our larger networks.

Number of training epochs The maximum number of epochs for training connections to

and from each candidate hidden unit was set to five thousand. While many values timed

out before stagnating at five thousand, it was decided that increasing the value would be

as likely to hinder the networks with false progress as it would be to assist with premature

locking of the weights.

Validation data Cascade supports testing data to detect when overfitting is outweighing

the learning of the basic function. This is implemented by running the test set on the inter-

mediate network and outputting the current error index. This facility is used throughout

the results, with the same amount of test data as training data. It would possibly have been

better to use a consistent amount of testing data instead, but varying the amount makes it

easier to cope when the functions being simulated do not have so much data. The validation

approach is somewhat simpler than cross-validation, but again we are looking for successful

networks rather than proof of consistent repeatability.

Error measure The difference between measuring error as an index or in bits has already

been discussed for the enumeration networks in Section 7.5.1. There we decided ‘bits’ were

appropriate since we wish to get binary values correct and are concerned about overfitting.

Here however, ‘index’ is often appropriate since we are both dealing with real-valued data

and we often wish to get very high accuracy on common training instances. We will swap

between these measures for the different networks as appropriate.

Victory criterion The victory criterion is not set due to concerns about overfitting.

Evaluation of trained networks The error measure is not a perfect measure of the quality

of the network, so it is useful to consider other (more accurate and expensive) methods

of evaluating a trained network. A very useful visual aid is a scatterplot of the network’s

output against the output of genprob which it was trained to reproduce. This can help

identify where the network makes the mistakes it does. Another evaluation method is simply

181

to run the parser using the network as a replacement for genprob, and see what its effect on

precision and recall is. Obviously both procedures are very time-consuming; they cannot

be used as part of the process of training, but they are useful in helping to decide between

alternative network designs and training regimes.

Evaluation of reproducibility Normal practice in neural networks is to test that results

are reproducible by performing multiple trials. There are two reasons this is not performed

here. Firstly, we are not so much interested in reproducible results as we are in a trained

network, and secondly at several days to train each network it is impractical to wait the

weeks that multiple trials would require.

7.9 Training the tag network

As in the reimplementation of Collins’ parser, I began by considering the task of POS tag-

ging, since this task has all of the complexity of the other parsing tasks, but can be inde-

pendently evaluated. My tag network was therefore a prototype for my other networks;

accordingly, I will motivate its design and training regime in some detail, and present the

other networks more succinctly. In addition, I include discussion of techniques here that are

not necessary for the tag network, such as incremental training, because they can be most

accurately evaluated on the tag network.

Recall from Section 4.4 that the tagger is a simple HMM based POS tagger. It does not

include the argmax that is standard in any high-performance tagger (see for example Char-

niak, Hendrickson, Jacobson, and Perkowitz (1993)) because it was developed to fit with

the parser as closely as possible rather than as a high performance tagger. The probability

model uses the previous two tags, the current word and the current tag. The exact number

of input and output units this corresponds to in the neural network depends on a few de-

sign decisions. If tags have an enumeration of nonterminals included and a single output

node then this gives a total of four hundred and forty units. If the single output node is

replaced with multiple outputs then this eliminates the need for one of the input tags and so

reduces the network’s size to three hundred nodes. If tags do not include an enumeration

of nonterminals then these values drop to one hundred and sixteen and one hundred and

fifty nodes respectively. So, the choice of representation has a huge effect on the complexity

of the neural network.

Actually integrating the neural network into the parser is quite simple. The distribution

of Cascade already includes code for performing feed-forward operations, and so all that

is required is to encapsulate it into a class so as to hide its global variables from the rest

of the system. A little extra work is required for each operation in that the input is in the

182

0

0.05

0.1

0.15

0.2

0.25

0.3

0 100 200 300 400 500 600

Err

or in

dex

Units used

"tag-network-errors.points" using 1:2

Figure 7.8: Plot of errors in the tag network against units used

parser’s internal format rather than the vector format, but a simple lookup suffices for the

transformation.

Before considering alternative network designs and training regimes, we will first present

a single network all the way through to evaluation. This will make it much easier to contrast

alternatives later.

7.9.1 The initial tag network

The first network evaluated uses only tag vectors (no enumerated tags). It uses twenty

thousand non-unique training instances, corresponding to roughly twelve thousand unique

instances. Training was given a victory criterion of 0.10. A graph of the error reducing as

hidden nodes are added is included in Figure 7.8. This figure has a number of interesting

properties. Firstly, the error consistently decreases with time, and while it is approaching

an asymptote, it is not approaching it especially fast as we would expect given that the

problem is particularly hard. There are two conclusions we can make from this: firstly, it

means the problem surface is amenable to neural network learning, and secondly it appears

that the problem simply takes a long time to learn. This is, of course, concerning for our

other networks since they are up to twice the size, and somewhat more complex.

As can be seen from the graph, victory was not quite achieved with the final network

using six hundred hidden nodes for an error index of 0.1054. Even partially training the

183

tag network is an extremely slow process. Just loading the training data file before learning

begins takes over a minute, and additional units are trained and added to the network in ap-

proximately linear time, so the six hundred node network took seventeen hours to build on

a two gigahertz AMD CPU. Curiously, using a one gigahertz IBM CPU instead resulted in

almost identical training times. Whether this is because of the IBM machine’s better mem-

ory bandwidth, the use of altivec, or other issues was not investigated since we are only

interested here in the final network.

Since the network was unable to reduce the error to zero, it is obvious that there are some

values which the network is still getting ‘wrong’. A single error index is not an especially

useful indicator of these, so in order to get a better visual picture, a scatterplot of the output

of the network and the values genprob produces is presented in Figure 7.9.

Figure 7.9: Scatter plot of output in the tag network using six hundred

hidden units against genprob’s output

This figure contains some interesting information — for instance the odd horizontal lines

between 0.2 and 0.45 (indicating that the network generates certain values for a relatively

large range of inputs) and the greater error at the extremities of the function. Certainly it

seems that the basic shape of the function has been learned correctly. However, recall that

there is a much more effective method of evaluating the network, which is simply to test

how well it works at replacing genprob. As described in Section 4.4, the genprob-based

POS tagger obtains 93% and as a baseline figure the tagger can obtain an accuracy of 89.4%

without the use of any historical context. The tag network just described obtains an accuracy

184

of 87.6%, lower than that which can be obtained by simply using each word’s most common

tag.

Clearly, we need to experiment with alternative architectures and training regimes. These

experiments are presented in Sections 7.9.2 and 7.9.3.

7.9.2 Network architecture

In this section we will select the input format, the number of hidden units, and the output

format.

Number of hidden units

The first trial was to vary the number of hidden units in the tag network. My hypothesis

was that there were too many hidden nodes in the initial network, leading to overfitting

problems, and that a network with fewer hidden nodes would perform better. To test this,

I used the same input and output representations as in the initial network, increased the

number of training instances from twenty to fifty thousand, and experimented with different

numbers of hidden nodes. In Table 7.8 I present the effect of adding hidden units to the

tagger’s accuracy, along with the network’s error.

Tagger Network error Tagger accuracy

10 units 0.2269 0.9374

20 units 0.2182 0.9363

40 units 0.2025 0.9343

60 units 0.1910 0.9332

80 units 0.1816 0.9327

120 units 0.1644 0.9311

160 units 0.1489 0.9299

Genprob N/A 0.9299

Table 7.8: Tagger accuracy as hidden units are added to the neural net-

work

This result is very interesting. Firstly, I have shown that the minimally trained neural

network generalises better than genprob and so obtains a higher accuracy than its training

data. This is an extremely promising result since the parser will be attempting to perform

a similar generalisation. Another important result is that while the neural network was

continuing to improve, this was having no beneficial effect on the accuracy of the tagger.

It may seem counter-intuitive that we should stop training long before the error reaches a

plateau, but the reason this is correct can be seen by examining the error index in the held-off

185

test set. For this data, the error drops rapidly to 0.25 after ten hidden units, and then slowly

decreases for about one hundred hidden units before beginning to rise. The conclusion

from this is clear; Cascade will begin to overfit after just ten hidden units, and after twenty

or so hidden units the benefit of further training is outweighed by the inflexibility caused

by over-fitting. A graph showing the error index of both the training data and the testing

data is presented in Figure 7.10.

0

0.05

0.1

0.15

0.2

0.25

0.3

0 20 40 60 80 100

Err

or in

dex

Units used

Training errorTesting error

Figure 7.10: Graph of the training set and the test set error as hidden

nodes are added to the tag network

To see exactly what the network has learned after twenty units, a scatterplot showing the

output of the network when compared to genprob is presented in Figure 7.11. This figure

shows that the network has learned when to produce a one, and when to produce a zero,

but within these boundaries pays only lip-service to genprob. We again see the horizontal

lines in this figure, so they are likely a property of using Cascade rather than a quirk in the

initial scatterplot. Another interesting property to note in this figure is that the network has

trouble producing values very near to one. This is possibly caused by a hack in the Cascade

code for sigmoidal activation functions which says if the output is near one it should be

treated as one. (A similar hack rounds near-zero outputs to zero, but there is much more

training data in the near-zero case, which helps it maintain accuracy in this region.)

Since almost all outputs from genprob are zero, the network is able to get a relatively

low error rate just by getting these values correct. This is of concern for the parser since

occasional catastrophic errors are known to be problematic. If necessary, the initial network

186

Figure 7.11: Scatterplot of output from the tag network against genprob’s

output, using just twenty hidden nodes

shows that over-training can be used to eliminate these occasional large errors in exchange

for very small errors and no generalisation.

It is hard to see through the cloud of central values in the scatterplot. To determine if the

core shape of the function is correct, a density plot is presented in Figure 7.12. A density plot

is identical to a scatterplot except instead of plotting individual points, a colour is displayed

whose brightness is determined by how many points are nearby.5 This figure clearly shows

that the core of the function has indeed been learned after just twenty nodes; there is a clear

(if blurred) leading diagonal in the graph.

Error measure

We have been using the index method of assessing the network’s error because it discour-

ages many small errors. The reason these errors are concerning is that the initial test of noise

in genprob (Section 7.1) showed even small amounts of noise are unacceptable. However the

bits error method proved more effective than the index error method for the enumeration

networks since it has much less of a problem with overfitting.

5I was unable to find code to derive a density plot, and so I wrote it myself. My code views the data as a

three-dimensional surface where the height is the log of the number of points in the scatterplot within a certain

radius. More sophisticated approaches where points are assigned a weighting based on their distance would

also be possible but have not been explored here.

187

Figure 7.12: Density plot of output from the tag network against gen-

prob’s output, after twenty hidden nodes have been added

188

We evaluated the tagger network using the bits method and were able to obtain only

slightly lower performance to that derived using the index method. With ten hidden units,

performance dropped from 93.7% to 93.4%, and for twenty hidden units, performance dropped

from 93.6% to 93.2%. Since it is likely the bits method will cause more problems in the parse

and it leads to lower performance, we will not be using the bits method. However it is worth

noting that performance dropped much less than expected.

Representation of the input layer

So far we have not included an enumeration of tags on the input layer because the larger

network takes longer to both learn and run once trained. However, it is well worth including

if the final accuracy is higher. I therefore experimented with a network which has enumer-

ated tags in the input layer. This raises the number of input nodes from 94 to 324, with a

corresponding doubling in training time.

The error index for any given number of hidden nodes appears to slightly favour enu-

merated tags. Additionally, the enumerated network asymptotes to a significantly lower

total error (0.060 instead of 0.1). However, neither of these details is particularly important:

by any other metric, such as total number of nodes, number of weights, or training time, the

enumerated network has a higher error rate. Further, we have already shown that networks

with a large number of hidden nodes actually perform worse. The key test is how well it

tags, and the answer to this is (within statistical error) identically. Since the enumeration is

not helping, the idea was abandoned.

Representation of the output layer

The next test is whether a single output should be used, instead of the multiple outputs that

have been used until now. If a single output is used then the target tag will be placed on the

input layer, and the output will correspond to its probability (much as genprob takes the tag

as input and generates a probability). Because this will increase the number of training in-

stances by a factor of fifty without significantly speeding up each epoch, its effect on training

time is problematic. The reason this test is important is that the dependency network cannot

enumerate all possible outputs and so we need to investigate the feasibility of placing the

output on the input layer with a simpler test case.

There is one other major benefit of using a single output representation: the probability

no longer has to be generated in one step. Recall that the distribution of probabilities in

genprob is far from normal, with an extremely high number of zero outputs, quite a lot of

one outputs, and just a few values in between. Essentially, the network is initially trained

to produce zero, and then under which circumstances to produce one instead of zero, and

finally when to produce another output. This is a lot of steps for the network to learn, and it

189

makes sense to split training instead into two subtasks — deciding if the output is zero, and

if not deciding what the output should be. This technique of splitting complex tasks into

logical subtasks has been shown to be very effective in solving complex problems in neural

networks (see for example, Rueckl, Cave, and Kosslyn (1989)), although it is more normal to

combine the multiple networks in parallel rather than serially.

Testing the splitting hypothesis was done by training both sub-networks. The non-zero

network was able to learn quite accurately, as was the discriminator network. When the

discriminator is supposed to predict zero it does so two-thirds of the time, and when it

is supposed to predict non-zero it does so over 99% of the time. (Training patterns were

weighted to discourage false zeros since the non-zero network correctly produces zero about

95% of the time, as shown on the density plot.)

However, integrating the networks with the tagger was less successful. Recall that the

tagger’s accuracy with multiple outputs is around 93%. With a single output in the net-

work, and using a similar number of training instances, tagging accuracy dropped to just

5% (about twice as good as guessing randomly). Investigating why this is the case, a hack

was added to the code causing the probability estimate from the non-zero network to be re-

placed by an estimate from the multiple-output network. This gave an acceptable accuracy

of 92.7%, showing that all the problems are in training the non-zero network.

It is likely the plummeting accuracy is due to the significantly reduced training data. To

test this, the network was trained with half a million training instances (the limit of Cascade,

and yet corresponding to just ten percent of the training data with multiple outputs). This

test resulted in a final accuracy of 89%. Based on this result we conclude that using a single

output is undesirable.

Another potential benefit of using multiple outputs is that it becomes much easier to

ensure that all probability distributions sum to one. Curiously, I found the inclusion of this

normalisation resulted in a slight drop in performance (about half a percent) so, while it is

desirable from a theoretical perspective, it has been left disabled for performance reasons.

Vector length

The length of the vectors has been based on the output from SVD. When SVD sorts the

dimensions, it outputs not just the sorted dimensions, but the amount of information con-

tained in each dimension. Using this information, I have cut off the vectors at the point

where the information content of new dimensions drops significantly.

However, it could well be that this approach is either too conservative or too liberal. For

example, the tag vectors are likely to be extremely important in the tag network, so it may

well be that we need the less informative dimensions. However the words are probably

principally used as keys in a lookup table, and so we may well only need twenty or so

190

dimensions rather than fifty.

Due to time constraints, the effects of vector length on the tagger’s accuracy were not

investigated. Instead we simply rely on the output from SVD.

7.9.3 Training data

In this section we will decide on exactly how to choose the training data, how much to use,

and how to use the training data to train the network.

Amount of training data

The networks presented so far have been trained with fifty thousand training instances. In

Table 7.9 I compare the tagging accuracy as I vary the amount of training data.

Tagger Network error Units Tagger accuracy Training time

10k 0.1873 20 0.9295 12m

20k 0.2124 20 0.9347 17m

50k 0.2182 20 0.9360 1h

100k 0.2212 20 0.9396 2h10m

200k 0.2241 20 0.9353 8h58m

500k 0.2205 40 0.9359 10h15m

Genprob N/A N/A 0.9299 N/A

Table 7.9: Tagger accuracy as extra training data is provided to the neural

network

This table shows that we can continue to see performance improvements in the tagger

up to about one hundred thousand training instances. Beyond this results are less obvious

since they depend more on the number of hidden units used, but they do not appear to be

getting better.

Incremental training

For the larger networks, such as dependency, it will not be feasible to train them on all of

their data right from the start. Therefore in Table 7.10 I present the same experiment as in

Table 7.9 but this time the training starts from the learned value in the previous network

rather than from scratch. In neural network literature this is known as serial learning and is

generally an extremely ineffective approach. Since my data is much less contradictory than

is typical in neural networks, I am hoping for more success.

191

The publicly available implementation of Cascade does not support loading and sav-

ing of weights during training, but I implemented it using only slight modifications to the

source code. A side effect of how Cascade trains requires each of these networks to have

more hidden units than the previous row, which quickly leads to very large networks. While

Tagger Network error Units Tagger accuracy Training time

10k 0.2011 10 0.9272 7m

20k 0.2130 20 0.9311 16m

50k 0.2150 30 0.9355 36m

100k 0.2116 40 0.9367 1h20m

200k 0.2283 50 0.9360 2h45m

500k 0.2108 60 0.9353 4h50m

Genprob N/A N/A 0.9299 N/A

Table 7.10: Tagger accuracy as extra training data is incrementally provided

to the neural network

these results are not especially interesting for the tagger, they are important for the parser.

It means that for the huge and very complex networks such as dependency, we can safely

train the network on a small number of training instances and then increase the training

data. Another interesting result is that ten neurons appears not quite enough to correct the

errors from the previously trained network, implying that we have been overfitting the data.

An alternative interpretation of incremental training is to train a second network on the

output from the first network since the first network outperformed genprob. Roughly, the

idea is that generalisation can be bootstrapped. Predictably however, this approach did not

work with performance dropping to 89%.

Using unique training data

Training data has so far been generated by sampling the genprob function. Since the gen-

prob function is frequently called with the same parameters, this means the sample has a

significant number of duplicate entries. A sample of one-hundred thousand calls resulted in

just over six thousand unique calls. This may be a good thing because it will encourage the

network to assign more importance to data that occurs frequently, which is why it has not

been discussed up until now. It may also be a bad thing, since it will both cause overfitting

and will result in training on just a tiny sample of the space at the optimal training size of

around one-hundred thousand training instances.

To determine which hypothesis is correct, we have tested the network when using only

unique training data. Unfortunately, performance dropped from ninety percent to just thirty

192

percent, and the absolute error in the network also jumped. Since the absolute error has also

increased, it was trained until the absolute error was a similar value. This produced a more

acceptable 86.5% accuracy after thirty hidden units. While this is much worse than the

network with duplicate data, it can be shown to make fewer gross errors and so we do not

want to abandon the idea of unique data quite yet.

Recall that the probability distribution from the tagger means it must produce zero in

most instances, and in instances where it is to produce nonzero it must usually produce

one. However, in the unique data we have much fewer probabilities of zero (just 70%).

Therefore it is likely the unique network was being less careful about these zero values than

the network with duplicate data. This is confirmed with a scatterplot in Figure 7.13, which

shows the network does quite well at all areas of the input rather than concentrating on zero

and one.

Figure 7.13: Scatter plot of output from the tag network against genprob’s

output, using unique training data

As noted previously, the duplicate network does well at zero and one but takes a long

time to accurately map the rest of the function. This scatterplot shows a quite different result

with the network performing much more evenly. Perhaps the best approach is a compromise

where half the data comes from duplicated events and half is unique. The results from this

mixed approach are presented in Table 7.11.

Clearly we have corrected the initial problem with using unique training data, our per-

formance is close to that with only duplicate data. But what have we gained? While the

193

Network Units Network error Tagger accuracy

50k 10 0.2876 0.933

50k 20 0.2763 0.932

100k 10 0.2870 0.932

100k 20 0.2729 0.931

200k 10 0.2351 0.937

200k 20 0.2279 0.929

400k 10 0.2888 0.909

400k 20 0.2748 0.904

Table 7.11: Performance of different taggers using half unique and half

duplicate training data

tagger’s accuracy is slightly lower than we were able to achieve using duplicated training

data, this does not tell the whole story. A scatterplot of this semi-unique data in Figure 7.14

shows the network has much tighter control over all areas of the output rather than just

around zero and one. This approach is therefore the safest to use when we wish to avoid

major discrepancies with genprob. We will therefore mix unique training data into future

networks.

Training with raw data

By far the most ambitious experiment was to eliminate genprob from the loop and train on

the raw event file. We have shown that the tag network can be trained based on genprob.

But can it also be trained based directly on the raw data?

Every line in the event file can be viewed as an event occurring with a probability of one,

and all the alternative tags occurring with a probability of zero. This interpretation gives

exactly the same file format as the multiple outputs from genprob and so it is worthwhile at

least trying to train directly on the raw file. The benefits of successful training are obvious:

any weaknesses in the smoothing function are eliminated, and future extensions do not have

to be simulated in genprob before being converted to a neural network, which is particularly

beneficial if they cannot be simulated in genprob.

The obvious flaw in this approach is that we are training a network to learn a function

mapping (or, technically, a relation) that is impossible to learn. Since the probability model is

necessarily incomplete, a given input will sometimes lead to one output and at other times

will lead to another. We are hoping that the neural network will average between these

outputs, effectively producing a probability. Setting this concern aside, converting the event

file to a neural network was straightforward, requiring just a tweak to the code that adds

194

Figure 7.14: Scatter plot of output from the tag network against genprob’s

output, using a mix of unique and duplicate training data

unary events to the hash-table.

There are one million training events available in the training file (because every word

generates a new event). Of these, quarter of a million are unique, and the rest are duplicates.

The only way then to include all unique events and still have a representative sample of du-

plicates is to go to the largest possible neural network (400k training patterns). Surprisingly,

this network was very fast to train; after three hours it had added ten hidden units and was

only about 0.01 away from the asymptote with an error of 0.346. Testing this network results

in a final accuracy of 94.8%. This is the best result achieved and so it is very positive that it

occurs with our most ambitious experiment.

7.9.4 Conclusion

The vector representations seem to be adequate for learning neural representations, which

means we should be able to train the other networks to replace genprob. The network’s

output was better than genprob in most of the tests, so it seems likely we will be able to

improve on genprob in the parser.

Apart from a feasibility study, the goal of this section was to determine good parameters

for training later networks. In this regard we found that multiple outputs outperformed

single outputs, and so we will favour them whenever possible, but that the inclusion of

enumerated inputs is unnecessary. The index error method seems the most appropriate, but

195

the bits method is almost as good.

For the training data we found that around one hundred thousand training instances was

optimal although this number can be varied quite significantly without a large change in

accuracy. Just twenty hidden units seems more than adequate, with a near optimal number

being easily derived from the local minima for the error index of the test data. We also

found that where necessary, we can save a lot of training time at only a slight error cost by

training incrementally. Finally, we found that using genprob to derive our training data is

unnecessary at least for this simple case, and we can obtain better results by attempting to

learn mapping directly from the raw data.

7.10 Training the other networks

While the tag network is easy to evaluate in isolation, this is not the case for the other net-

works. So we will assume what worked best for the tag network will also work best for

them.

In the previous section we found that while the neural network’s errors will continue

to drop over training, this does not lead to better performance and it is more desirable to

stop training when the evaluation on the cross-validation data approaches its asymptote

(around 20 hidden units). We also found that the more data used, the better results we get,

and so the absolute maximum amount of training data is used in all cases. Because this

amount of training data would make initial learning impossible in several cases, we have in

those cases pre-trained the network on a smaller sample and then tuned it on the full set of

training instances.

There are five different networks to be trained, and their training will be discussed in-

dividually below. In each case we will be comparing the parser to performance based on

genprob after just the first hundred sentences of Section 23 of the WSJ, which has a precision

of 85.2% and a recall of 85.6% (remember that precision and recall both drop slightly over

Section 23).

7.10.1 Training the prior network

The role of the prior network is to predict whether the current edge is likely to be part of

the global parse, or if it is leading the parser up the garden path. It is perhaps one of the

least interesting since its output does not directly affect the parse, but it is important in that

any errors will make parsing virtually impossible, since a poor prior probability will result

in immediate discarding of the edge, and too many high prior probabilities will swamp the

beam and thereby cause parse failure. Because the network is so important, we need to be

more careful than we would normally be in training a network that has only three hundred

196

units.

Prior probabilities have a very different distribution to other probabilities, resembling a

gamma distribution. Because of this, the zero/non-zero distinction is inappropriate and the

multiple-output approach is best. In the multi-output approach we must select a single out-

put to generate a probability distribution over and, somewhat arbitrarily, the nonterminal

was chosen. This leads to a network with two hundred units before the inclusion of hidden

units.

Another complication with the probability distribution is that even in the few instances

it is non-zero, the probabilities are very close to zero. About 95% of the probabilities are so

close to zero as to be indistinguishable, but well over 99% are less than 0.002. This level of

accuracy is likely to be hard to achieve in the neural network. One final complication is that

calls to the prior network are highly redundant; over a million calls produced just thirty-two

thousand unique calls.

Ignoring these concerns, building the network for training is simply a matter of sampling

genprob, just as it was with the tagger. In parsing the first fifteen sections of the WSJ, only

thirty thousand unique training instances were found and so these were all included in

training, and supplemented with one-hundred thousand duplicate entries to provide a little

extra training data and an equal amount of testing data.

Training of this network stagnated immediately, with the test data approaching an asymp-

tote after just five hidden units and reaching a minimum error after thirty hidden units.

When the parser was run with the network version of prior, it was apparent that the beam

was overflowing with too many low probability nodes, so the parser code was modified

to round approximately zero probabilities down to zero. With this modification, the parser

achieved 86.0% precision and recall, which is slightly better than the performance of Collins’

genprob version of prior, but well below statistical significance. The network obtained sim-

ilar scores when trained from the raw data, rather than from genprob output.

7.10.2 Training the top network

The top network is somewhat similar to the prior network. As with the prior network it is

simple, with just one word, one tag, and one nonterminal involved in each production. It

also has a similar probability distribution although slightly easier to manage as the highest

probability is 0.1 instead of 0.02.

The role of the top network is to determine when a parse is complete since the WSJ uses

sentence nonterminals both for complete sentences and for encapsulated sentences. This

does not matter to the neural network which is only concerned with the inputs, outputs,

and their mapping. In this case there is only one tag, one nonterminal and one word in-

volved in the network, for a total of two hundred units. Multiple outputs were used with

197

the nonterminal on the output, so the network has roughly the same number of inputs as

outputs.

As with the prior network, the first fifteen sections of the WSJ lead to just thirty thou-

sand unique calls to genprob, and so these were supplemented with one-hundred thousand

duplicate calls to make sixty-four thousand patterns for both training and testing.

When the parser was run using the network version of top, with ten hidden nodes, it

obtained precision of 85.4% and a recall of 85.8%. Using more hidden nodes resulted in

the same precision and recall, as did increasing the amount of training data. This result is

about the same as obtained by the hash-table approach; it is slightly lower but with only

one hundred sentences we cannot say if the difference is statistically significant. Training

the top network directly from the event file was also experimented with, with performance

dropping slightly to 83%. Since our purpose here is to find out what works, we will defer

more detailed analysis to Section 7.11.

7.10.3 Training the unary network

The unary network differs from the top network in three ways. Firstly, it is more compli-

cated; compared to the top network it takes an extra nonterminal (the parent to produce),

leading to about three hundred input units. Secondly, it has a wider range of inputs; the

range of words, tags and nonterminals is significantly higher than that of the prior and top

networks. (Previously we used the full set of unique training instances; for the unary net-

work we have to use a subset of the full set.) Finally, while the top network is only used

once per sentence in sorting the parses, the unary network is used about once per word, so

the quality of the unary network is much more important to the parser than either of the

previous two networks. In effect, this is the first ‘serious’ network we have trained.

For the first attempt, a total of one-hundred thousand training patterns were used, fifty

thousand from the unique subset and fifty thousand from the duplicate subset. An equiva-

lent number of both were used to form the testing patterns. However, watching training, it

became clear that there was insufficient training data as evaluating on the test patterns gave

a similar error to the training patterns rather than asymptoting. Even with large numbers

of hidden units, Cascade learnt only the outline of the function, as is shown in Figure 7.15.

In this figure, the horizontal banding that has been noted before, is much more pronounced.

Despite this, precision and recall of the parser using the unary network were quite good at

84.6% and 84.7% respectively (regardless of the number of hidden units tested). Given the

weakness in the graph and that cross-validation had not asymptoted it makes more sense to

increase the amount of training data rather than to increase the number of hidden units. The

maximum increase is a factor of four, which both uses all the unique data at the current ratio

and also is just shy of Cascade’s limit (since the network file is just under C’s 2GB limit).

198

However, this increase did not change precision or recall significantly.

A separate property first noticed in this network is a large increase in parsing time. The

parser was previously slightly slower than one sentence per second but when using Cascade

for all unary evaluations the time per sentence increases to roughly five seconds. This is un-

fortunate, and will be particularly annoying when evaluating the dependency network, but

there is little that can be done about it. Another interesting property is that the network with

fewer hidden nodes actually performed slower. Since the feed-forward operation is slightly

more complex than linear on the number of hidden nodes, this result is somewhat counter-

intuitive but is perhaps caused by the increased ambiguity in the less precise network.

Training the unary network from raw events proved significantly harder; with 400k

training patterns it took eleven hours just to add ten hidden nodes. However, it did appear

to learn, with the error index coming out lower than it did with genprob’s data. Evaluat-

ing on the one hundred sentence corpus gives a precision of 85.1% and a recall of 85.0%,

showing that we have replaced genprob with a neural-network-based function that gives

approximately the same performance. Since this network is not derived from genprob, it is

interesting to graph it against genprob, particularly to see how tightly it follows genprob.

This graph is presented in Figure 7.16. This graph clearly shows that the neural network

is reproducing genprob only in the most cursory manner, and yet we know it performs as

well as genprob. Either unary events are irrelevant or else we do not have to follow gen-

prob closely to get good results. If the latter, then the conclusion from the preliminary noise

experiment described in Section 7.1 was incorrect.

7.10.4 Training the subcategorisation network

The subcategorisation network determines the number of arguments each new phrase takes.

Since this is rarely obvious based on just the head, I had anticipated it being one of the

hardest networks to train.

Subcategorisation events also introduce a new type of output: how should the set of all

subcategorisation frames be enumerated? The solution I decided to use was to enumerate

the maximum number of different noun phrases (6), sentences (2), SBARs (2), verb phrases

(1), and others (1). This leads to 252 different outputs, making this the largest network

trained so far.

Only twenty of these outputs were actually seen during training, but that just means

the neural network learns to always produce zero on some outputs which is extremely easy

to learn. An alternative approach would be to enumerate the outputs in the order they

were seen during training. This would give the same result and use significantly fewer

connections, but it would mean that any extension to the probability model would break

the enumeration.

199

Figure 7.15: Scatter plot of output from the unary network against gen-

prob’s output, using 100k training patterns and eighty hidden nodes

Figure 7.16: Scatter plot of output from the unary network against gen-

prob’s output, using the network trained directly on the raw event file

200

Training the subcategorisation network was not initially successful, with Cascade timing

out instead of stagnating. After a full day of training and having only added six hidden

units, it finally started stagnating. Even then the network’s training would best be described

as minimal, as shown in Figure 7.17. Predictably, this network does not work well at all,

Figure 7.17: Scatter plot of output from the subcat network, trained with

ten thousand events with ten hidden units

resulting in parser precision of 51.7% and recall of 52.2%. Clearly we have to do something

different. A first attempt (which proved overambitious) was to try training on the raw event

file but this caused precision and recall to drop even further (to 25%).

Recall from the tag network that where training is too hard, it is possible to train smaller

networks using serial learning. Instead of training directly on half a million patterns, train-

ing commenced with just 2k patterns, and then the weights from this network were used

to train a network with 8k patterns, then 32k, then 128k, and finally 512k as before. This

approach resulted in stagnation, implying that it was working, and leads to a final precision

of 77.2% and a recall of 75.8%.

In summary, the subcategorisation network seems trainable, but is at the limit of what

can be trained using the representation and training schemes we have considered so far.

Exploring variants on these schemes is likely to improve performance, but further variants

will be not be considered in this thesis.

201

7.10.5 Training the dependency network

The dependency network predicts when two phrases should be combined. Since it is effec-

tively generating the new daughter phrase, it is impractical to enumerate possible outputs

and we must resort to the zero/non-zero approach.

This network is by far the most complex. Not only does it have the most parameters (as

many as 566), it also is called at least as often as the unary network. Finally, the parameters

it is called with show much more variation; the training file leads to half a million unique

events, over twice the number of unique unary events. This means that we cannot use our

earlier approach integrating unique information to span the function’s range, with a sample

of duplicate information to emphasise dense areas.

Despite this complexity, it was relatively simple to train the zerotest network. A total of

133k training patterns was used (sampled randomly, with no emphasis on unique patterns)

and training the network until the test set stopped improving took twelve hours and fifty

hidden units. Evaluating these partially trained networks shows that after five hidden units

the network makes the correct prediction about 89% of the time, and with forty-five units

this rises to about 94% of the time.

For the non-zero network, things are not quite so simple and we must plan the represen-

tation more carefully. One possible improvement is in the representation of subcategorisa-

tion frames. The two candidate representations are: using a simple enumeration for twenty

one nodes, or representing the count of each nonterminal type for eighteen nodes. The deci-

sion about which is best depends on how much data is shared between different events and

since we do not know how much is shared, we tried both and found the first representation

was slightly better.

Apart from changing the representation, we could alternatively split the training process

into two subtasks, in the same way as Collins does. Collins first generates the dependent

nonterminal and tag (‘dep1’) and then the dependent word (‘dep2’). The problem with this

simplification is that neither dep1 or dep2 can use enumerated output and so it turned out

to give slightly worse results.

Despite additional complexity caused by all the different parameters, training on the

network proceeded quite well. With forty thousand training patterns, the network error

approached an asymptote of about 0.2 after twenty hidden nodes. Using the network with

twenty hidden nodes, the parser was very slow. This is because the dependency network

is used as often as the unary network but is unable to benefit from the enumerated output

which reduces parsing time by an order of magnitude. Ignoring this, precision and recall

stayed at 76.0% and 79.9% respectively. While this is a drop compared to genprob, it is quite

reasonable given the complexity of training the model, and given the less than perfect word

and nonterminal representations. It thus serves as a good proof-of-concept for the feasibility

202

of a neural network model of genprob.

7.11 Final evaluation

We have been able to replace all calls to genprob with neural networks, but the performance

of those networks has been mixed; some of the networks work as well as or better than

genprob, and some work less well. It therefore makes sense to retain genprob in the latter

cases. We thus need a way of allocating responsibility between genprob and the networks.

There are cases where we can be very confident that the value returned by genprob is

accurate, such as when the numerator and denominator at level one are high. In these cases,

there is no advantage in using the neural network and so, unless the neural network is a

perfect replacement, we do not. By smoothing between the value returned by genprob and

the neural network’s estimate, we can minimise any weaknesses in the neural network. The

equation I used to derive confidence is:

min(1.0, denom/50) (7.1)

This equation was derived by examining the cumulative density function of the derived

distribution and adjusting the scaling constant until the hash table averaged a confidence

of 0.5. It is possible that a more sophisticated method would lead to better results although

empirically I found very little variation.

With this smoothing in operation, evaluating the final system on the first half of Section

23 of the WSJ results in a precision of 80.02% and a recall of 80.40%. This contrasts with

genprob which obtained a final precision and recall of 85.1% and 85.2% on this same test.

Clearly the neural-network based parser is working, but not working as well as Collins’

parser.

It is also interesting to consider the system’s performance on the two-hundred ‘hard’ sen-

tences featuring rare headwords. For these sentences, the system’s precision and recall are

almost identical at 80.0% and 80.4%. This compares to Collins’ 83.0% and 83.2%. While this

result is still lower than Collins’, it is interesting to note that performance has not dropped

at all on the harder sentences, where Collins’ system dropped two percent. This lends some

weight to the hypothesis that the neural network approach generalises better than Collins

on rare words. If the networks could be improved to reach Collins’ base level performance,

there are prospects for exceeding Collins’ performance on sentences containing rare words.

203

204

Chapter 8

Conclusion

The work described in this thesis has involved the creation of three large programs and

many small programs. Following how various components of these systems tie together

is complicated and so to aid the reader in following the summary, a somewhat simplified

data flow diagram for the entire thesis is presented in Figure 8.1. Relating the figure to the

thesis itself, the rightmost branch corresponds to Collins’ system as described in Chapter

4; the leftmost branch corresponds to the conversion of words to vectors as described in

Chapter 6; the bottom corresponds to the integration of the two principal components by

way of training neural networks as described in Chapter 7; finally the middle of the figure

corresponds to the reuse of the vector code from Chapter 6, with the reuse itself discussed

in Chapter 7.

In the remainder of this chapter, I summarise the work completed in the thesis, and list

some avenues for future work.

8.1 Summary

This thesis began by introducing the topic of statistical parsing, and surveying the current

state of the art. Within the field of statistical parsing I examined a number of different ap-

proaches and noted that the approach taken by Michael Collins obtains the best results.

While Collins’ approach performs well on the WSJ, it is not especially well suited to gener-

alising to other domains and I hypothesised that this problem was due to its backoff algo-

rithm, particularly how it backs off words. I reimplemented Collins’ parser, principally so I

had a parser I knew perfectly that would be suitable for later modifications.

Having settled on word backoff, I surveyed the field of word clustering and decided to

extend one of the oldest approaches, that of Hinrich Schutze, as the most suitable since it

could be scaled to a full lexicon. I reimplemented Schutze’s work, including my extensions

to support a lexicon far too large to process directly, and explored different parameters and

205

To vector rep

SVD

words.vectors

bigrams

Sort by WSJ freq.

corpus

Generate bigrams

SVD

tag.vectors

bigrams

SVD

nts.vectors

bigrams

genprob log

PARSER

Generate bigrams

Count syntax events

event hash-tables

make hash-tables

Event counter

NPB, Head, SG etc.

wsj.collins

model 2

Select tags

wsj.tags

Tipster/Gutenberg

corpus.unsorted

TOKENISE

Treebank

To numbers

words.freq wsj.combined

PARSER

weight files

genprob vectors

Cascade correlation

Figure 8.1: Data flow diagram for the entire thesis (simplified)

206

their effects on the quality of results generated.

Returning to the parser, I incorporated the word clustering first using a simple neigh-

bours approach and then using a much more ambitious neural network approach. The

neighbours approach is simple and assists the parser slightly when generalising at a slight

cost in places where counts are already high. The net effect is approximately zero on the

WSJ, but it is likely the better generalisation would be helpful in other domains. Replac-

ing probabilistic backoff with a neural network is an approach that has significant potential.

If successful, it would provide innumerable benefits since it would change the process of

developing a language model from a complex process full of compromises to a matter of

simple exploration. It is also a large step towards fully automatic learning in the statistical

parser, which would eliminate our current dependence on the WSJ. While I was able to get

most of the intermediate results necessary, I was unable to get all of them. Until the other re-

sults can be generated, perhaps through better input or better training, the resulting neural

network will not perform as well as the current statistical system.

There are numerous areas in which this thesis incorporated new ideas, and in many ways

it was the interaction of all of these which makes it hard to say categorically which aspects of

the new system work well, and which need to be rethought. Work was largely divided into

three main areas: reimplementing Collins’ parser, generating word vectors, and integrating

the word vectors. These are discussed in turn below.

8.1.1 Implementing Collins’ 1997 parser

There was very little work in the literature describing the complexities in implementing a

statistical parser, concentrating instead on justifying differences in their probability mod-

els. Despite this, as Bikel (2004) and Klein and Manning (2003) have both shown, the im-

plementation details can easily make more difference to the parser’s performance than the

probability model. In this thesis, I have explained how the parser has been implemented in

enough detail that it can be used as the basis for building another parser or experimenting

with the effects of different implementations. The parser I built performs as well as Collins’

1997 parser.

The implementation itself has a modular design, so it is easy to read and modify. It also

includes several new ideas. Most significant is a new data structure for implementing beam

search that significantly outperforms heap-based approaches for extremely large beams.

This is useful both within statistical parsing, and in other areas of artificial intelligence.

The chart data structure also demonstrates how peculiarities of the problem being solved

can be taken advantage of to provide a much faster implementation. From a programming

perspective, the use of a macro language provides a method for reducing redundancy in the

same way as a function does, in situations where it is impractical to use a function. Finally,

207

locking data structures so they become read-only in order to detect pointer-related bugs is an

extension over the standard technique of simply locking the pages before and after arrays.

8.1.2 Word and nonterminal representations

While the field of unsupervised thesaurus generation is very extensive, virtually all of the

techniques concentrate on obtaining the best possible results over a small lexicon rather than

developing good representations for all words.

I have concentrated on one existing technique, that of using singular value decomposi-

tion on a matrix of bigram counts. I have demonstrated how to extend this technique for an

arbitrarily large lexicon, and my extension will continue to produce better results as comput-

ers become more powerful. Additionally, I have shown a number of ways that the technique

can be tuned in order to produce different representations, depending on intended use of

the representations.

This phase also included the development of vector representations for nonterminals

and tags, a task that I have not seen tackled anywhere in the literature. Despite this, the

results showed good generalisations between syntactic structures.

8.1.3 Experiments in using word vectors for backoff

An initial experiment showed that a nearest-neighbours approach to grouping words for

backoff in parsing was basically as good as Collins’ existing backoff scheme, with some

small improvements to Collins for sentences featuring rare headwords. But the more inter-

esting experiment involved the use of neural networks to implement an entirely new backoff

technique.

While neural networks are frequently referred to as function approximators, and well

known for their excellent interpolation, the idea of using them instead of hash-tables in

backoff is new. The idea of using a neural network instead of a hash-table has considerable

merit: neural networks’ distributed representations means they require only a tiny fraction

of the amount of RAM that a comparable hash-table approach uses. Furthermore, they can

take an arbitrarily large number of inputs and so give the researcher much more flexibility

in designing the probability model. Finally, the problems of deciding how to discard data

are completely eliminated by the use of a neural network. The final neural network version

of Collins’ parser was quite successful; when replacing backoff in the POS tagger, the neu-

ral network performs slightly better than the hash-table based approach, and in the parser

it performs equally well in all but two of genprob’s sub-functions. There is considerable

promise that networks will be able to train directly from raw WSJ events. While it would

be desirable for the neural network to outperform hash-tables in the parser, the close result

208

justifies the technique used and implies that with tweaks to word representations, network

designs and/or training schemes, it should be possible to outperform genprob.

8.2 Further work

There are many further improvements which could be made to the system described in this

thesis; I outline the most interesting of these below.

8.2.1 Reimplementing Collins

The most obvious and easiest method of significantly improving results would be to replace

Collins’ 1996 parsing model with his later 1999 model. This change is likely to be relatively

simple since the models do not differ very much. However, there are a number of extensions

over Collins’ approach that are worth considering.

Parsing as search

Recall that the parsing algorithm is implemented as two loops. An outer loop iterates over

every possible span, while an inner loop expands edges to add parents and grandparents.

This inner loop uses AI search techniques to find the best edge quickly, but the larger outer

loop does not, leading to a lot of wasted time that is addressed in this optimisation and the

next one.

It would be natural to treat the entire parsing process as a single AI search. The reason

Collins used many smaller searches is not clear but may be related to his simplistic imple-

mentation of beam search. Using multiple searches also has the advantage of ensuring the

chart always has some nodes considered at each span, while a single search will concentrate

on spans that are generating many edges. So Collins’ approach may be more able to to work

with a poor probability model.

Using a single search would require a complete redesign of the internal parsing algo-

rithm. It is possible that Klein and Manning’s (2002) search approach could be used instead,

but this has not been investigated.

Parsing in chunks

A different approach to the same problem is simply noting that the parser spends most of

its time searching for phrases which cannot possibly be part of the final parse. The reason it

does this, is that the parser does not note that certain parts of the input sentence must contain

a phrase (chunking) and therefore there is no point searching any span that intersects these

chunks.

209

I investigated using a chunking parser such as Abney, Schapire, and Singer (1999) before

the main parse to ban any searches which cross brackets with chunks. This approach was

discarded as chunking parsers at the time I wrote the statistical parser did not obtain signif-

icantly higher precision than the current statistical parser (recall is unimportant). Since then

some advances in the field have been made; Abney has released a revised version of the

chunking parser that supports dependencies, and Kudo and Matsumoto (2001) have devel-

oped a parser obtaining almost 96% precision. If parsing time becomes more of a problem

then the integration of this could assist significantly.

Dynamic updates

Currently, all access to the probability model is static which is not a problem if we are going

to continue parsing the WSJ but inappropriate if we wish to parse sentences from other

domains. The psycholinguistics literature has shown for years that people are primed to

prefer phrase structures they have heard recently (for example, Bencini, Bock, and Goldberg

(2002)); it would be useful if the parser could take advantage of priming.

Parsing text with errors

One of the first promises in statistical parsing was its better handling of erroneous text. De-

spite this, no work I am aware of has investigated the performance of a statistical parser on

erroneous text. This seems a shame since a statistical model provides a great deal more in-

formation than the coarse penalty model usually used (for example Min and Wilson (1998)).

8.2.2 Word vectors

The representation of words worked well, but the results from other people such as Dekang

Lin (1998) are better. While we cannot use Lin’s approach here, the success he had strongly

implies that we could also do better.

Integrate tagging properly

The integration of tags with the word vectors improved them significantly, and yet this

integration was actually very crude. There are numerous errors made by this integration,

such as homographs and polysemous words having their bigram counts merged together.

This has resulted in the regrettable result of words that can act as both nouns and verbs

being clustered with other such words when it would be far more favourable to create two

words in each case. It is often clear that Lin’s results are better than mine because of this

omission.

210

Such a solution would be quite easy to implement by tagging the T/G corpus and cre-

ating new words by using the old word combined with its POS tag. However the tagger I

developed would be inappropriate for this task since its tagset is slightly too coarse and it is

much too slow. It is very likely that a suitable tagger already exists.

Eliminating R

All of the word vectors are derived using the R statistical toolkit. This toolkit is excellent

for prototyping, presenting hundreds of different tools to the programmer. Furthermore,

it scaled to relatively large amounts of data, a property that was not true for any other of

the off-the-shelf systems investigated. However, it was found to be very inefficient when

compared to writing the same algorithm in C.

For instance, the maximum PCA that can be performed in R on a machine with two

gigabytes of ram is 4000 by 4000. However, some arithmetic shows that a much larger matrix

should be easily manipulated with this much ram. PCA requires four matrices to be stored

simultaneously, and assuming eight bytes per cell, we should be able to process a square

matrix with over ten thousand rows. Even if IEEE extended floating point is assumed, we

should still be able to have over six thousand rows. If we scrapped our use of R and rewrote

in C then it is reasonable to expect the much larger matrices to lead to significantly better

word representations.

8.2.3 Backoff

Training from the event file

All but two of my networks already train directly from the WSJ event file. If all networks

could be thus trained, there would be a number of benefits. Most obviously, we avoid hav-

ing genprob’s output as an approximate ceiling on performance. Moreover, if we can train

directly on the event file, we would be able to experiment with the effects on performance

of adding any number of interesting features to the events file. Collins’ probability model

was carefully tuned to the amount of available data in the Penn treebank and so with better

backoff we can expect to usefully develop a more complex probability model.

8.2.4 Using Maximum Entropy methods instead of a neural network

Neural networks were chosen as our learning algorithm because simple hash-tables were

inappropriate. However, neural networks have their share of problems, especially when

scaling to this much data and when quite accurate lookups are the norm, with interpola-

tion happening only rarely. One option that has come to prominence recently in NLP and

would almost certainly work significantly better is Maximum Entropy modelling (Curran,

211

2004). Taking the event file and mapping it into vectors using the same approach as was

used for neural networks should lead directly to a data file that is suitable for building a

MaxEnt model. At 200MB, the file size is probably a little too large for MaxEnt on any of the

workstations I have access to, but is within reach for current cluster machines.

8.2.5 Using a different parser

Collins’ parser was chosen at the start of this project because it was, and still is, the highest

performing statistical parser. However, it may well be that smoother backoff is more appro-

priate in other parsing models such as DOP or CCG. A more ambitious idea would be to

use a neural-network based parser such as that in Lawrence, Giles, and Fong (1996). This

parser obtained excellent performance for a parser using only POS tags, and noted in the

conclusion that words could not be used because it was neural-network based. However,

word vectors such as those developed in this thesis would work extremely well in such a

parser.

8.3 Concluding remarks

In conclusion: I have proposed that the principal weakness in current statistical parsers is

their use of a fragile backoff algorithm. By developing vector representations for all the

fields incorporated in a typical probability model, I have been able to train a neural network

to simulate the probability model. This approach is both more extensible to new domains

and to more complex probability models, as well as being more robust in situations of lim-

ited training data.

212

References

Abney, S., Schapire, R., and Singer, Y. (1999). Boosting applied to tagging and PP attachment.

In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro-

cessing and Very Large Corpora.

Allen, J. F. (1995). Natural Language Understanding (2nd. ed.). Redwood City, CA: Benjam-

in/Cummings.

Banerjee, S. and Pedersen, T. (2003). The Design, Implementation and Use of the Ngram

Statistics Package. In Proceedings of the fourth International Conference on Intelligent Text

Processing and Computational Linguistics, Mexico City, Mexico.

Bencini, G., Bock, K., and Goldberg, A. (2002). How abstract is grammar? Evidence from

structural priming in language production. In Proceedings of the 15th Annual CUNY Sen-

tence Processing Conference, Queen’s College, City University of New York, NY.

Bengio, S. and Bengio, Y. (2000). Taking on the Curse of Dimensionality in Joint Distributions

Using Neural Networks. IEEE-NN, 11(3), 550.

Bengio, Y. (2003). Personal correspondence.

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A Neural Probabilistic Language

Model. Journal of Machine Learning Research, 3, 1137–1155.

Bies, A., Ferguson, M., Katz, K., and MacIntyre, R. (1995). Bracketting Guidelines for Treebank

II style Penn Treebank Project. Linguistic Data Consortium.

Bikel, D. (2004). Intricacies of Collins’ Parsing Model. Computational Linguistics, 30(4), 479–

511.

Bikel, D. (2005). Web page.

Black, E., Jelinek, F., Lafferty, J., Magerman, D., Mercer, R., and Roukos, S. (1992). Towards

history-based grammars: using richer models for probabilistic parsing. In M. P. Marcus

(Ed.), Fifth DARPA Workshop on Speech and Natural Language, Arden Conference Center,

Harriman, New York.

213

Bod, R. (1996). Efficient Algorithms for Parsing the DOP Model? A Reply to Joshua Good-

man. Computational Linguistics Archives: cmp-lg/199605031.

Bod, R. and Scha, R. (1996). Data-Oriented Language Processing: An Overview. Technical

Report LP-96-13, University of Amsterdam.

Booth, T. L. and Thompson, R. A. (1973). Applying Probability Measures to Abstract Lan-

guages. IEEE Transactions on Computers, 22(5), 442–449.

Brooks, F. P. (1982). The Mythical Man-Month: Essays on Software Engineering. Reading, MA:

Addison-Wesley Publishing Company.

Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, S. A. D., and Lai, J. C. (1992). Class-based

n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

Chapman, R. L. (Ed.) (1992). Roget’s International Thesaurus (5th ed.). HarperCollins.

Charniak, E., Carroll, G., Adcock, J., Cassandra, A., Gotoh, Y., Katz, J., Litman, M., and

McCann, J. (1996). Taggers for Parsers. Artificial Intelligence, 85(1–2), 45–57.

Charniak, E., Hendrickson, C., Jacobson, N., and Perkowitz, P. (1993). Equations for Part-

of-Speech Tagging. In Proceedings of the 11th National Conference on Artificial Intelligence,

Washington, DC, 784–789. AAAI Press.

Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. (1990). AutoClass:

A Bayesian Classification System. In J. W. Shavlik and T. G. Dietterich (Eds.), Readings in

Machine Learning, 296–306. San Mateo, CA: Kaufmann.

Chen, S. F. and Goodman, J. (1996). An Empirical Study of Smoothing Techniques for Lan-

guage Modeling. In A. Joshi and M. Palmer (Eds.), Proceedings of the Thirty-Fourth Annual

Meeting of the Association for Computational Linguistics, San Francisco, 310–318. Association

for Computational Linguistics: Morgan Kaufmann Publishers.

Chen, S. F. and Rosenfeld, R. (2000). A Survey of Smoothing Techniques for ME Models.

IEEE Transactions on Speech and Audio Processing, 8(1), 37–55.

Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.

Choueka, Y. and Luisgnan, S. (1985). Disambiguation by Short Contexts. Computers and the

Humanities, 19(3), 147–157.

Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system.

In COMPLEX’94, Budapest.

214

Collins, M. (1996). A New Statistical Parser Based on Bigram Lexical Dependencies. In

Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz.

Collins, M. (1997). Three Generative, Lexicalized Models for Statistical Parsing. In P. R. Co-

hen and W. Wahlster (Eds.), Proceedings of the Thirty-Fifth Annual Meeting of the Association

for Computational Linguistics and Eighth Conference of the European Chapter of the Association

for Computational Linguistics, Somerset, New Jersey, 16–23. Association for Computational

Linguistics: Association for Computational Linguistics.

Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph. D. thesis,

Computer Science Department, University of Pennsylvania.

Copestake, A. and Flickinger, D. (2000). An open-source grammar development environ-

ment and broad-coverage English grammar using HPSG. In Proceedings of LREC 2000,

Athens, Greece.

Curran, J. (2004). Maximum Entropy Models for Natural Language Processing. In Aus-

tralasian Language Technology Summer School, Sydney, Australia, 29.

Earley, J. (1970). An efficient context-free parsing algorithm. In K. Sparck-Jones, B. J. Grosz,

and B. L. Webber (Eds.), Readings in Natural Language Processing, 25–33. Los Altos: Morgan

Kaufmann Publishers.

Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211.

Fahlman, S. E. and Lebiere, C. (1990). The Cascade-Correlation Learning Architecture. In

D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems: Proceedings of the

1989 Conference, San Mateo, CA, 524–532. Morgan Kaufmann Publishers.

Finch, S. (1993). Finding structure in language. Ph. D. thesis, Edinburgh University.

Gale, W. A. and Sampson, G. (1995). Good-Turing Frequency Estimation without Tears.

Journal of Quantitative Linguistics, 2, 217–237.

Garfield, S. and Wermter, S. (2003). Recurrent Neural Learning for Classifying Spoken Ut-

terances. Neural Language Processing, 6(3), 31–36.

Garner, S. R. (1995). WEKA: The Waikato Environment for Knowledge Analysis. Technical

report, Computer Science Dept., Waikato University.

Ginzburg, J. and Sag, I. A. (Eds.) (2000). Interrogative investigations. Stanford: CSLI Publica-

tions.

215

Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model. In E. Brill and

K. Church (Eds.), Proceedings of the Conference on Empirical Methods in Natural Language

Processing, 143–152. Somerset, New Jersey: Association for Computational Linguistics.

Goodman, J. (1998). Parsing Inside-Out. Ph. D. thesis, Harvard University.

Goodman, J. (2001). A Bit of Progress in Language Modeling: Extended Version. Technical

Report MSR-TR-2001-72, Microsoft Research (MSR).

Haegeman, L. (1991). Introduction to Government and Binding Theory. Oxford: Blackwell.

Harman, D. (1992). The DARPA TIPSTER project. SIGIR Forum, 26(2), 26–28.

Hart, M. (2005). Project Gutenberg e-text archive. Web page: http://www.gutenberg.net/.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their

applications. Biometrika, 57(1), 97–109.

Honkela, T. (1997a). Comparisons of self-organized word category maps. In Proceedings

of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4–6, 298–303. Espoo,

Finland: Helsinki University of Technology, Neural Networks Research Centre.

Honkela, T. (1997b). Self-Organizing Maps in Natural Language Processing. Ph. D. thesis,

Helsinki University of Technology, Espoo, Finland. (Some citations give this as 1997,

while others as 1998).

Honkela, T., Pulkki, V., and Kohonen, T. (1995). Contextual Relations of Words in Grimm

Tales, Analyzed by Self-Organizing Map. In F. Fogelman-Soulie and P. Gallinari (Eds.),

Proceedings ICANN’95, International Conference on Artificial Neural Networks, Volume II,

Nanterre, France, 3–7. EC2.

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters

from sparse data. In E. S. Gelsema and K. L. N. (Eds.), Pattern Recognition in Practice,

381–397. Amsterdam : North Holland Publishing Co.

Joachims, T. (2001). A Statistical Learning Model of Text Classification with Support Vector

Machines. In W. Croft, D. J. Harper, D. H. Kraft, and J. Zobel (Eds.), Proceedings of SIGIR-

01, 24th ACM International Conference on Research and Development in Information Retrieval,

New Orleans, US, 128–136. ACM Press, New York, US.

Katz, S. M. (1987). Estimation of Probabilities from Sparse Data for the Language model

Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Pro-

cessing, 35(3), 400–401.

216

Klein, D. and Manning, C. D. (2001a). An O(n3) Agenda-Based Chart Parser for Arbitrary

Probabilistic Context-Free Grammars. Technical Report dbpubs/2001-16, Stanford Uni-

versity.

Klein, D. and Manning, C. D. (2001b). Parsing and Hypergraphs. In The Seventh International

Workshop on Parsing Technologies.

Klein, D. and Manning, C. D. (2002). A* Parsing: Fast Exact Viterbi Parse Selection. Technical

Report 2002-16, Natural Language Processing Group, Stanford University.

Klein, D. and Manning, C. D. (2003). Accurate Unlexicalized Parsing. In Proceedings of the

41st Annual Meeting of the Association for Computational Linguistics.

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological

Cybernetics, 43(1), 59–69.

Kudo, T. and Matsumoto, Y. (2001). Chunking with Support Vector Machines. In Proceedings

of North American Chapter of the ACL.

Lakeland, C. and Knott, A. (2001). POS Tagging in Statistical Parsing. In Proceedings of the

Australasian Language Technology Workshop, Sydney, Australia.

Lakeland, C. and Knott, A. (2004). Implementing a lexicalised statistical parser. In Proceed-

ings of the Australasian Language Technology Workshop, Sydney, Australia.

Lawrence, S., Giles, C. L., and Fong, S. (1996). Can Recurrent Neural Networks Learn Nat-

ural Language Grammars? In Proceedings of the IEEE International Conference on Neural

Networks, 1853–1858. Piscataway, NJ: IEEE Press.

Lee, L. (2004). ”I’m sorry Dave, I’m afraid I can’t do that”: Linguistics, Statistics, and Natural

Language Processing circa 2001. In C. on the Fundamentals of Computer Science: Chal-

lenges, C. S. Opportunities, and N. R. C. Telecommunications Board (Eds.), Computer Sci-

ence: Reflections on the Field, Reflections from the Field, 111–118. The National Academies

Press.

Li, W. (1992). Random Texts Exhibit Zipf’s Law-Like Word Frequency Distribution. IEEE

Transactions on Information Theory, 38, 1842–1845.

Liddle, M. (2002). Learning Lexical Relations from Natural Language Text. Technical report,

University of Otago.

Lin, D. (1997). Using Syntactic Dependency on Local Context to Resolve Word Sense Ambi-

guity. In P. R. Cohen and W. Wahlster (Eds.), Proceedings of the Thirty-Fifth Annual Meeting

of the Association for Computational Linguistics and Eighth Conference of the European Chapter

217

of the Association for Computational Linguistics, Somerset, New Jersey, 64–71. Association

for Computational Linguistics: Association for Computational Linguistics.

Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. In COLING-ACL,

768–774.

Magerman, D. M. (1995). Statistical Decision-Tree Models for Parsing. In Proceedings of the

33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, 26–30

Jun 1995.

Magerman, D. M. (1996). Learning grammatical stucture using statistical decision-trees. In

Grammatical Inference: Learning Syntax from Sentences, 3rd International Colloquium, ICGI-

96, Montpellier, France, September 25-27, 1996, Proceedings, Volume 1147 of Lecture Notes in

Artificial Intelligence, 1–21. Springer.

Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing.

MIT Press.

Marcus, M. P., Santorini, B., and Marcinkiewicz, M. (1993). Building a large annotated cor-

pus of English: the Penn Treebank. Computational linguistics, 19, 313–330. Reprinted in

Susan Armstrong, ed. 1994,g Using large corpora, Cambridge, MA: MIT Press, 273–290.

Mayberry III, M. R. and Miikkulainen, R. (1999). Combining Maps and Distributed Rep-

resentations for Shift-Reduce Parsing. In S. Wermter and R. Sun (Eds.), Hybrid Neural

Symbolic Integration. New York: Springer.

McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval,

classification and clustering. Web page: http://www.cs.cmu.edu/mccallum/bow.

Miikkulainen, R. (1993). Subsymbolic Natural Language Processing: An Integrated Model of

Scripts, Lexicon, and Memory. Cambridge, MA: MIT Press.

Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the

ACM, 38(11), 39–41.

Min, K. and Wilson, W. H. (1998). Integrated Control of Chart Items for Error Repair. In

COLING-ACL, Volume 2, 862–868. Morgan Kaufmann Publishers.

Ney, H., Mergel, D., Noll, A., and Paeseler, A. (1992). Data Driven Search Organization for

Continuous Speech Recognition. IEEE Transactions on Signal Processing, 40(2), 272.

Plasmeijer, M. J. (1998). CLEAN: a programming environment based on term graph rewrit-

ing. Theoretical Computer Science, 194(1–2), 246–255j.

218

Pollard, C. and Sag, I. A. (1986). Head Driven Phrase Structure Grammar. Stanford, CA, USA:

Center for the Study of Language and Information.

Powers, D. (2001). Experiments in Unsupervised Learning of Natural Language. Interna-

tional Journal of Corpus Linguistics, 6(1), 8.

Pugh, W. (1989). Skip Lists: A Probabilistic Alternative to Balanced Trees. In WADS: 1st

Workshop on Algorithms and Data Structures.

R Development Core Team (2004). R: A language and environment for statistical computing.

Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.

Rueckl, J. G., Cave, K. R., and Kosslyn, S. M. (1989). Why Are “What” and “Where” Pro-

cessed by Separate Cortical Visual Systems? A Computational Investigation. Journal of

Cognitive Neuroscience, 1(2), 171–186.

Scha, R. and Bod, R. (2003). Efficient Parsing of DOP with PCFG-reductions.

Schutze, H. (1992). Dimensions of meaning. In Proceedings of Supercomputing ’92, Minneapolis,

787–796.

Schutze, H. (1993). Word Space. In S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), Advances

in Neural Information Processing Systems, Volume 5, 895–902. Morgan Kaufmann Publish-

ers, San Mateo, CA.

Schutze, H. (1995). Distributional Part-of-Speech Tagging.

Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1),

97–124.

Smith, L. I. (2002). A tutorial on Principal Components Analysis. Technical report, Univer-

sity of Otago, Dunedin, New Zealand.

Smrz, P. and Rychly, P. (2002). Finding Semantically Related Words in Large Corpora. In

TSD, 108–115. Revised 2002, origonally published in 2001.

Stuart, I., Cha, S.-H., and Tapper, C. (2004). A Neural Network Classifier for Junk E-Mail. In

Proceedings of 6th DAS 2004, Florence, Italy, 442–450.

Ushioda, A. (1996). Hierarchical Clustering of Words. In COLING, 1159–1162. Expanded

version published as ‘Hierarchical Clustering of Words and Applications to NLP tasks’.

Vapnik, V. N. (1997). The Support Vector Method. Lecture Notes in Computer Science, 1327,

263–273.

219

Viterbi, A. J. (1967). Error Bounds for Convolutional Codes and an Asymtotically Optimum

Decoding Algorithm. IEEE Transactions on Information Theory, IT-13, 260–267.

Williams, R. (1992). FunnelWeb User’s Manual. ftp://ftp.adelaide.edu.au/pub/funnelweb,

University of Adelaide, Adelaide, South Australia, Australia.

Wu, J. and Zheng, F. (2000). On enhancing katz-smoothing based back-off language model.

In ICSLP-2000, Volume 1, 198–201.

220

Appendix A

Tags and Nonterminals used

Since all the statistical parsers use the tags and nonterminals defined by the Penn treebank, it

makes sense to define them here. A deep understanding of these tags is much less important

than it would be in a conventional parser, but it is still useful for following the examples

and understanding some of the mistakes. This chapter includes a brief description of every

nonterminal and every terminal.

A.1 Tags

There are forty five tags used in the Penn treebank. Two extra tags (#STOP#, and #UNKNOWN#)

are also used by my parser, with #STOP#being used to terminate phrases and #UNKNOWN#

necessary to keep the probability theory in step with the implementation. The other tags are

described below, they have been split into several tables for ease of reading, this distinction

is not explicit in the treebank.

The symbols are described in Table A.1; the nounish words are described in Table A.2;

the verbs in A.3; the adjectives in A.4; the pronouns in Table A.5 and all others are given in

table A.6

This section is based very heavily (almost word for word) on the web page:

http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html

A.2 Nonterminals

Note: this information comes from “Bracketing Guidelines for Treebank II Style Penn Tree-

bank Project” (Bies, Ferguson, Katz, and MacIntyre, 1995)

221

Tag Description Examples

$ dollar $ -$ –$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$

‘ ‘ opening quotation

mark

‘ “

’ ’ closing quotation

mark

’ ”

( opening parenthe-

sis

( [ {

) closing parenthesis ) ] }

, comma ,

- - dash - -

. sentence termina-

tor

. ! ?

: colon or ellipsis : ; . . .

SYM symbol % & ’ ” ”. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R

Table A.1: Tags related to symbols

Tag Description Examples

FW foreign word gemeinschaft hund ich jeux habeas Haementeria Herr K’ang-si

vous lutihaw alai je jour objets salutaris fille quibusdam pas trop

Monte terram fiche oui corporis . . .

NN noun, common,

singular or mass

common-carrier cabbage knuckle-duster Casino afghan shed

thermostat investment slide humour falloff slick wind hyena

override subhumanity machinist . . .

NNP noun, proper, sin-

gular

Motown Venneboerger Czestochwa Ranzer Conchita Trumplane

Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin

ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool . . .

NNPS noun, proper, plu-

ral

Americans Americas Amharas Amityvilles Amusements

Anarcho-Syndicalists Andalusians Andes Andruses Angels An-

imals Anthony Antilles Antiques Apache Apaches Apocrypha

. . .

NNS noun, common,

plural

undergraduates scotches bric-a-brac products bodyguards facets

coasts divestitures storehouses designs clubs fragrances averages

subjectivists apprehensions muses factory-jobs . . .

Table A.2: POS tags used for nouns

222

Tag Description Examples

MD modal auxiliary can cannot could couldn’t dare may might must need ought shall

should shouldn’t will would

VB verb, base form ask assemble assess assign assume atone attention avoid bake

balkanize bank begin behold believe bend benefit bevel beware

bless boil bomb boost brace break bring broil brush build . . .

VBD verb, past tense dipped pleaded swiped regummed soaked tidied convened

halted registered cushioned exacted snubbed strode aimed

adopted belied figgered speculated wore appreciated contem-

plated . . .

VBG verb, present par-

ticiple or gerund

telegraphing stirring focusing angering judging stalling lactat-

ing hankerin’ alleging veering capping approaching traveling be-

sieging encrypting interrupting erasing wincing . . .

VBN verb, past partici-

ple

multihulled dilapidated aerosolized chaired languished panel-

ized used experimented flourished imitated reunifed factored

condensed sheared unsettled primed dubbed desired . . .

VBP verb, present tense,

not 3rd person sin-

gular

predominate wrap resort sue twist spill cure lengthen brush ter-

minate appear tend stray glisten obtain comprise detest tease at-

tract emphasize mold postpone sever return wag . . .

VBZ verb, present tense,

3rd person singu-

lar

bases reconstructs marks mixes displeases seals carps weaves

snatches slumps stretches authorizes smolders pictures emerges

stockpiles seduces fizzes uses bolsters slaps speaks pleads . . .

Table A.3: POS tags used for verbs

223

Tag Description Examples

JJ adjective or numer-

ical, ordinal

third ill-mannered pre-war regrettable oiled calamitous first sep-

arable ectoplasmic battery-powered participatory fourth still-to-

be-named multilingual multi-disciplinary . . .

JJR adjective, compari-

tive

bleaker braver breezier briefer brighter brisker broader bumper

busier calmer cheaper choosier cleaner clearer closer colder com-

moner costlier cozier creamier crunchier cuter . . .

JJS adjective, superla-

tive

calmest cheapest choicest classiest cleanest clearest closest com-

monest corniest costliest crassest creepiest crudest cutest darkest

deadliest dearest deepest densest dinkiest . . .

RB adverb occasionally unabatingly maddeningly adventurously profess-

edly stirringly prominently technologically magisterially pre-

dominately swiftly fiscally pitilessly

RBR adverb, compari-

tive

further gloomier grander graver greater grimmer harder harsher

healthier heavier higher however larger later leaner lengthier

less-perfectly lesser lonelier longer louder lower more . . .

RBS adverb, superlative best biggest bluntest earliest farthest first furthest hardest hearti-

est highest largest least less most nearest second tightest worst

Table A.4: POS tags used for adjectives

Tag Description Examples

PRP pronoun, personal hers herself him himself hisself it itself me myself one oneself

ours ourselves ownself self she thee theirs them themselves they

thou thy us

PRP$ pronoun, posses-

sive

her his mine my our ours their thy your

WP WH-pronoun that what whatever whatsoever which who whom whosoever

WP$ WH-pronoun, pos-

sessive

whose

Table A.5: POS tags used for pronouns

224

Tag Description Examples

CC conjunction, coordinating & ’n and both but either et for less minus neither nor or plus so

therefore times v. versus vs. whether yet

CD numerical, cardinal mid-1890 nine-thirty forty-two one-tenth ten million 0.5 forty-

seven 1987 twenty ’79 zero two 78-degrees ’60s .025 fifteen

271,124 dozen quintillion

DT determiner all an another any both each either every half many much nary

neither no some such that the them these this those

EX existential there there

IN preposition or conjunc-

tion, subordinating

astride among uppon whether out inside pro despite on by

throughout below within for towards near behind atop around

if like until below next into if beside

LS list item marker A B C First One SP-44001 Second Third Three Two one six three

two

PDT pre-determiner all both half many quite such sure this

POS genative marker ’s

RP particle about across along apart around aside at away back before be-

hind by down ever fast for forth from go high i.e. in into just

later low more off on open out over per raising start that through

under unto up upon whole with you

TO “to” as a preposition or

infinitive marker

to

UH interjection Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Oops

amen huh howdy uh dammit whammo shucks heck anyways

honey golly man baby hush sonuvabitch . . .

WDT WH-determiner that what whatever which whichever

WRB Wh-adverb how however whence whenever where whereby whereever

wherein whereof why

Table A.6: Other POS tags

225

Tag Description Examples

ADJP Adjective Phrase

CONJP Conjunction Phrase

FRAG Fragment

INTJ Interjection Corresponds approximately to the part-of-speech

tag UH.

LST List marker Includes surrounding punctuation.

NAC Not a Constituent used to show the scope of certain prenominal mod-

ifiers within an NP.

NP Noun Phrase.

NX Used within certain complex NPs to mark the head

of the NP. Corresponds very roughly to N-bar level

but used quite differently.

PP Prepositional Phrase

PRN Parenthetical Asides to the main sentence, usually delimited by

commas, brackets or dashes.

PRT Particle Category for words that should be tagged RP.

QP Quantifier Phrase complex measure/amount phrase; used within NP.

RRC Reduced Relative Clause A relative clause that does not attach neatly to the

rest of the sentence, e.g. yesterday in I read the books

on the shelf yesterday quickly and the books on the shelf

today slowly.

UCP Unlike Coordinated Phrase Similar to CC but for where the tags do not match,

e.g. big/ADJP and/UCP growing/VP

VP Verb Phrase

WHADJP Wh-adjective Phrase Adjectival phrase containing a wh-adverb, as in how

hot.

WHAVP Wh-adverb Phrase Introduces a clause with an NP gap. May be null

(containing the 0 complementizer) or lexical, con-

taining a wh-adverb such as how or why.

WHNP Wh-noun Phrase Introduces a clause with an NP gap. May be null

(containing the 0 complementizer) or lexical, con-

taining some wh-word, e.g. who, which book, whose

daughter, none of which, or how many leopards.

WHPP Wh-prepositional Phrase Prepositional phrase containing a wh-noun phrase

(such as of which or by whose authority) that either

introduces a PP gap or is contained by a WHNP.

X Unknown, uncertain, or unbracketable X is often used for bracketing typos and in bracket-

ing the...the-constructions.

Table A.7: The main nonterminal categories

226

Appendix B

Code specifications for my parser

B.1 Data structures

Pseudocode for Collins’ algorithm has already been given in Figure 3.12. This results in

the data flow shown in Figure B.1 which is implemented using the class structure given in

Figure B.2. The classes are described briefly below. For the complex classes references are

given to the sections where they are described fully.

Figure B.1: Data flow diagram of the parser

227

Main

Globals

Node

Parser

Grammar

Prob

Chart

BeamPunc

Sentence

Hash

Nodes

BeamArray BeamList

BeamElement

Figure B.2: Class structure of the parser

228

Arguments was written by Jared Davis. It gets runtime arguments such as the beam size

from the user which saves recompilation when testing.

Beam performs the beam search. It is described in Section B.3.

Beam array is one implementation of the Beam class, also described in Section B.3.

Beam list is another implementation of the Beam class, also described in Section B.3.

Beam element stores elements in the beam and provides the operations performed on all

beam elements. It is also described in Section B.3.

Chart stores all edges that might be of later use. It is described in Section 4.5.

Convert is used to convert between words and strings, as well as to make words frequent,

drop semantic information from nonterminals, and so on. It is all very simple but

putting it in a separate class made the rest of the code easier to read.

Cutoff stores the minimum probability, below which all edges are immediately rejected. It

is separated from the chart because passing its test does not guarantee entry to the

chart, and it is used in a number of places inside the beam which do not otherwise

need the chart. However it could just as easily have been implemented inside the

chart.

Globals is a dummy class that eliminates the need for global variables. This worked out as

a good compromise between the convenience of globals and the increased maintain-

ability of code not containing globals. This also solved a classic C++ problem of global

constructors executing before their needed data had been loaded.

Grammar implements the grammar checking code. It uses simple lookup tables.

Hash implements a fast and simple hash-table with integer keys and integer values. It is

used by the probability class and for grammar lookups. Even accounting for it having

two different input formats and two different storage mechanisms, it is still under two

hundred lines of code.

Node is the data structure for phrases. It implements all of the operations on phrases, such

as join two edges , and provides an interface between the beam and the probability

model. It is described in Section B.2.

Nodes is a preallocated bag of nodes. This is used by the chart, and by some implementa-

tions of the beam, to provide the simplicity of malloc /free without their overhead.

229

Parser implements the main control loop given in Figure 3.4. It is only four hundred lines

long and the most complicated part is in keeping the innermost loop both fast and

readable. It is described in Section 3.2.

Prob is the probability model. It is a set of functions for estimating the probability of differ-

ent structures. It is described in Section 4.3.

Punc encapsulates all punctuation so that the rest of the system doesn’t have to deal with

it. Collins’ design treats punctuation as second-class lexical items, almost throwing

them away. This class strips punctuation from the sentence but allows the parser to

ask where punctuation is present. This enabled me to experiment with not stripping

punctuation while keeping a lot of code the same.

Sentence is mainly used for reading sentences. It abstracts away the way the sentence is

read which means the parser can also be used to parse html or parse interactively.

The sentence class also allows its input to either be in words or already converted into

numbers1.

Tagger implements a couple of different tagging models. Separating tagging meant the

effect of the tagger’s performance on parsing can be examined. The tagger is described

in Section 4.4.

B.2 The node data structure

The main data structure is the Node. Its fields are given in Table B.1. This table highlights

an interesting weakness with C++, relating to C’s historical use for hardware programming.

In order to ease coding of hardware devices, C++ always allocates class members in exactly

the order they are defined. This means that the intuitively obvious declaration may result in

poor memory efficiency. When millions of such structures are being declared, this can waste

a hundred of megabytes with memory alignment that is of no use to the program. To avoid

this wastage, all classes have to be declared with the largest variables first, through to one

byte variables. Doing otherwise was even causing memory corruption when optimisation

flags were used.

The class also contained a number of helper functions to simplify other classes and to

hide the very complex API presented by the probability model. These are detailed in Table

B.2.

All member variables are public. This is normally a very bad design decision because

it greatly decreases flexibility. However, in this case the parser was a reimplementation of

1Numbers are used by all internal functions as they are easier to manipulate than strings

230

Variable Type Description

info double The inside probability of the phrase

prior double The outside probability of the phrase

prob double info + prior, precomputed for efficiency

children Node list An array of children

next Node The next node in the chart

prev Node The previous node in the chart

lc Subcat Arguments to the left still to be found

rc Subcat Arguments to the right still to be found

headtag Tag POS tag of the phrase’s head

headword Word The head word

headnt Head The nonterminal of the phrase’s head

parent Parent The phrase’s parent

begin short The word the phrase starts on

end short The word the phrase ends on

numkids short The number of children

used node state Used for my own garbage collection

terminal bool True if the phrase is a POS tag

adj l bool Is the head the leftmost element

adj r bool Is the head the rightmost element

verb l bool Is a verb between the head and the leftmost element

verb r bool Is a verb between the head and the rightmost element

stop bool Is this phrase complete

hasverb bool True if this phrase contains any verbs

Table B.1: Data structure for phrases

231

Function Description

make silly Corrupts the node to try and highlight bugs

clear Wipes all data ready to reuse the node

collins equal True if the node is equal to another at Collins’ simplified

shallow level

shallow equal True if the node is equal to another at a shallow level

equal True if the node is equal to another at a deep level, used for

debugging

print Prints the node to the terminal for debugging

print for eval Prints the node in the format used to test precision/recall

print for gml Prints the node in GML for producing graphs, useful in

testing

verify Performs internal consistency checks for debugging

lc info, etc. Encapsulate the calls to the probability functions

join follow Joins two nodes together

join cc Deals with the special case of coordination

Table B.2: Member functions for phrases

a design that was known to work so there would be no design changes. Making variables

public slightly increased efficiency and code readability.

Nonterminals were also differentiated based on their use. For example, parent nonter-

minals were much more commonly used with other parents than with heads. The only time

parents and heads are interchanged is in add singles where a head is promoted to a par-

ent. To represent this in the code, the different types were implemented as tiny dummy

classes containing a single integer. This method meant it was impossible to accidentally use

a head where a parent was expected and it caught a number of bugs that would otherwise

have been almost impossible to catch.

Unfortunately it also ran into a weakness in the C++ language — it turns out to be im-

possible to give a class the same privileges as a normal integer. I had hoped to write the

header so that if a DEBUGflag was set then classes would be used but if it was unset then

the nonterminals would be synonyms for integers to increase speed and decrease memory

consumption. A fully OO language such as Smalltalk might have avoided this problem.

Because of the weakness in the language it required a large amount of mechanical code

changing to swap between the class representation and the integer representation.

232

B.3 The beam data structure

Section 4.6.1 describes how the beam was implemented as a doubly linked skiplist in some

detail. It was noted at the end of that section that Collins’ implementation uses an array

and bears little resemblance to beam search as I understand it. This section describes how

the two interpretations of beam search were abstracted away in the parser so that I could

accurately simulate Collins’ model without having to give up my solution, because I con-

sider it more elegant. The other role of this section is to describe how add singles stops

interleaves two independent beams.

In perhaps the only use of true object orientation in the parser, the two interpretations of

beam search were implemented independently as two seperate classes: Beam Array uses

an array based representation with probability thresholding in exactly the same way as

Collins’, and Beam List uses a skiplist based representation with length thresholding as

was described in Section 4.6.1. Both of these classes are subclasses of the abstract virtual

class Beam, which allows the entire parser to just use the beam without caring about how it

is implemented. An API showing all the methods provided by the beam class is presented

in Figure B.3. This virtual beam class provides all the functions that the rest of the parser

Function Description

Beam(void) Pruduce a beam of the default size

Beam(void) Destructor

void store and clear(void) Insert all elements onto the chart

and erase the beam. Used after the

last recursive call.

void clear(void) Erase any remaining elements

int length(void) const The number of element on the

beam

Beamdata * pop(void) Best item on the beam

void push(Beamdata * data) Insert an item

Beamdata * pop back(void) Worst item

void push back(Beamdata * data) Discard this item

void process(const int depth, Beam * dest) Perform the recursive add singles

stops.

void print(void) print the beam (for debugging).

Table B.3: High level API for the beam

needs, and so they just instantiate beams without knowing if they are going to get an array

implementation or a list implementation. It is even possible to instantiate one beam as an

233

array and the other as a skiplist. Some readers may have noticed that the elements on the

beam are of type Beamdata rather than of type Node. This distinction was to make the

beam manipulation code simpler. It enables the skiplist to include next and previous point-

ers in the actual elements instead of storing them seperately. It also allows syntactic sugar

functions like beam operators (one node is less than another node if it has a lower priority)

to simplify the search routines.

Efficency was a major concern in the design of this API. The single process method is

used instead of the more fashonable iterators. Similarly, there a few functions here that are

unnecessary such as clear or push back but assist efficency; if say add stop knows an

edge will not meet the threshold then it is a waste of time to perform the same calulation

twice. A related function is pop back which returns the worst item in the beam so that it

can be overwritten rather than allocating a new item.

There are two types of nodes used throughout the system: complete nodes and incom-

plete nodes. While from an implementation perpsective the only difference between these

nodes is the setting of a single boolean flag, they are conceptually very different because the

kinds of functions that apply to them are different. Within add singles stops , nodes of

both types are generated in rapid succession: incomplete nodes come in from initialise

or join two edges and are completed using add stop with the resulting complete nodes

then being processed by add singles to generate more incomplete nodes. These are then

passed to add stop for completion, and so on. The job of the beam throughout this process

is to keep track of the nodes: accepting nodes from the previous stage in the pipeline, insert-

ing likely candidates into the chart, discarding unlikely candidates, and continually passing

nodes on to the next stage in the pipeline.

It is entirely possible to implement add singles stops using just one beam, but it re-

quires the priority queue algorithm to be efficient since nodes coming out of a function could

have higher priorities than nodes already in the beam. The list implementation does have

an efficient priority queue, but the array implementation does not and rather than develop

a relatively efficient (i.e. heap based) priority queue for the array implementation, I simply

implemented two beams. Under this model, one beam can be viewed as the add singles

beam and the other as the add stop beam. Each beam is always operating in one of two

modes: either accepting edges and inserting (or discarding) them, or processing and re-

moving edges. The bimodal operation makes it easy to implement the array processing

efficiently.

234

Appendix C

Relevant source code

The complete source code for programs used in the thesis is over two megabytes, or around

thirty five thousand lines. Printing such a large amount of source code would be impractical,

almost doubling the length of the thesis. At the same time there are a number of places

where the description of the code given in the text is insufficient for understanding the

method, a problem I know very well from trying to implement the parser based on Collins’

thesis.

To help resolve this, the complete source code is available from:

http://cs.otago.ac.nz/postgrads/lakeland/phd code.tgz

as well as on several backup mirrors:

http://cs.otago.ac.nz/staffpriv/alik/lakeland code.tgz

http://go.org.nz/lakeland/phd code.tgz

Some of the source code is also included here. I have included the following:

• the complete build system since it is one of the easiest ways of understanding how

everything fits together

• the scripts used to drive R since they are short and include a lot of information

• A small sample of funnelweb code since using funnelweb is a novel method of refac-

toring complex code, and readers may be more interested in the idea than the complete

source.

• The source code for converting the Penn treebank into Collins’ format since I do not

believe the process has been adequately documented elsewhere

• The bigram processing code since this process is normally not mentioned in the litera-

ture and yet I found it had a significant difference to final results

235

C.1 Build script

The build system is implemented as a unix makefile. This allows much easier access to shell

commands than more sophisticated build systems, so it is perhaps better to view the use of

make more as a shell script with built in dependency resolution.

NETS=tag . wgt

NETS+=p r i o r . nz . wgt p r i o r . z e r o t e s t . wgt

NETS+=dep . nz . wgt dep . z e r o t e s t . wgt

NETS+=subcat . nz . wgt subcat . z e r o t e s t . wgt

NETS+=unary . nz . wgt unary . z e r o t e s t . wgt

NETS+=top . nz . wgt top . z e r o t e s t . wgt

a l l : $ (NETS)

# Approx 2 0 mins

words : wsj . tagged

nts : wsj . tagged

tags : wsj . tagged

wsj . raw : wsj . tagged

best−tags : wsj . tagged

d i c t : best−tags

load .mem: wsj . tagged

# Approx 3 0 mins

t e s t . tagged :

make −C $$CODEDIR/treebank t e s t−tokenize . done

cp −u $$CODEDIR/treebank/ t e s t−tokenize/ t e s t . tagged .

wsj . tagged :

make −C $$CODEDIR/treebank

cp −u $$CODEDIR/treebank /{wsj . tagged , words , nts , tags , wsj . raw , best−tags , d ic t , load .mem} .

r e a d a b l e t e s t . tagged : wsj . tagged

make −C $$CODEDIR/treebank/ t e s t−tokenize t e s t . tagged

cp −u $$CODEDIR/treebank/ t e s t−tokenize/ t e s t . tagged r e a d a b l e t e s t . tagged

readable . tagged : wsj . tagged

make −C $$CODEDIR/treebank/combined−tokenize readable . tagged

cp −u $$CODEDIR/treebank/combined−tokenize/readable . tagged .

# Approx one hour

model2−shor t : model2−long

cp −u $$CODEDIR/preproc/model2 .02−21 model2−shor t

model2−long :

make −C $$CODEDIR/preproc

236

cp −u $$CODEDIR/preproc/model2 model2−long

# Approx one hour

fwords : h t s i z e s . h

l c : h t s i z e s . h

rc : h t s i z e s . h

l e f t : h t s i z e s . h

r i g h t : h t s i z e s . h

unary : h t s i z e s . h

h t s i z e s . h : load .mem readable . tagged r e a d a b l e t e s t . tagged model2−long model2−shor t

make −C $$CODEDIR/events # Done t w i c e t o a v o i d a s t u p i d bug ( in make ? )

make −C $$CODEDIR/events # make doesn ’ t r e a l i s e i t has c r e a t e d . d a t a and a b o r t s

cp −u $$CODEDIR/events /{fwords , l c , rc , l e f t , r ight , unary , h t s i z e s . h } .

# P r o b a b l y unneeded with my p a r s e r d i s a b l e d

p unary 1 n . boot : h t s i z e s . h

cp −u $$CODEDIR/events/p ∗ . boot .

# Approx 2 mins

p pos 1 n . boot : tags . h

cp −u $$CODEDIR/events/p ∗ . boot .

tags . h : load .mem

make −C $$CODEDIR/events

cp −u $$CODEDIR/events /{ tags . h , t a g s i z e s . h } .

# Approx 2 min

model2 : load .mem

make −C $$CODEDIR/events model2 . re tok

cp −u $$CODEDIR/events/model2 . re tok model2

tags . v e c t o r s : wsj . tagged

make −C $$CODEDIR/v e c t o r s tags . v e c t o r s

nts . v e c t o r s : tags . v e c t o r s model2

make −C $$CODEDIR/v e c t o r s nts . v e c t o r s

rev−neighbours : words . v e c t o r s

make −C $$CODEDIR/v e c t o r s/words rev−neighbours

cp −u $$CODEDIR/v e c t o r s/words/rev−neighbours .

words . v e c t o r s : tags . v e c t o r s wsj . raw best−tags

make −C $$CODEDIR/v e c t o r s words . v e c t o r s

237

cp −u $$CODEDIR/v e c t o r s/words/words . v e c t o r s .

# # Ha l f way

# The c o d e up t o now i s r e l a t i v e l y s t a b l e

# And c h a n g e s t end t o need a f u l l r e g e n e r a t i o n anyway

# Next we have t o g e n e r e t e t h e NN t r a i n i n g d a t a and

# t r a i n t h e NNs .

# Approx 1 min

tokenised . l e x i c o n : tokenised . nts

tokenised . nts : model2

make −C $$CODEDIR/parser/mcol l ins empty input grammars . done nts . done l e x i c o n . done

cp −u $$CODEDIR/parser/mcol l ins/grammars/tokenised .{ nts , l e x i c o n } .

# Approx 2 0 minutes

# n t s . v e c t o r s not e x p l i c i t l y ne e ded but t h e c o d e l o a d s i t anyway

tag . net : tags . v e c t o r s nts . v e c t o r s words . v e c t o r s tokenised . nts tags nts words fwords l c l e f t unary tags . h p pos 1 n . boot

p unary 1 n . boot d i c t

make −C $$CODEDIR/parser/mine empty input tagger

cp −u $$CODEDIR/parser/mine/{ tag . header , tagger } .

# Approx 2 hours

dep . nz . net : p r i o r . nz . net

unary . nz . net : p r i o r . nz . net

subcat . nz . net : p r i o r . nz . net

top . nz . net : p r i o r . nz . net

dep . z e r o t e s t . net : p r i o r . nz . net

unary . z e r o t e s t . net : p r i o r . nz . net

subcat . z e r o t e s t . net : p r i o r . nz . net

top . z e r o t e s t . net : p r i o r . nz . net

p r i o r . z e r o t e s t . net : p r i o r . nz . net

p r i o r . nz . net : tags . v e c t o r s nts . v e c t o r s words . v e c t o r s tokenised . nts tokenised . l e x i c o n rev−neighbours

make −C $$CODEDIR/parser/mcol l ins

cp −u $$CODEDIR/parser/mcol l ins /{dep , prior , unary , subcat , top } .{nz , z e r o t e s t } . header .

cp −u $$CODEDIR/parser/mcol l ins /∗ . net .

sec00 . mcol l ins : p r i o r . nz . net

make −C $$CODEDIR/parser/mcol l ins sec00 . t e s t a b l e

cp −u $$CODEDIR/parser/mcol l ins/sec00 . t e s t a b l e sec00 . mcol l ins

parser−hash . r e s u l t s : sec00 . mcol l ins

make −C $$CODEDIR/eval/parser / empty input parser−hash . r e s u l t s

238

cp −u $$CODEDIR/eval/parser/parser−hash . r e s u l t s .

tag . wgt : tag . net

make −C $$CODEDIR/ t r a i n−net empty input tag . wgt

cp −u $$CODEDIR/ t r a i n−net/tag .{wgt , log } .

tagger−net . r e s u l t s : tag . wgt

make −C $$CODEDIR/parser/mine tagger−net . r e s u l t s

cp −u $$CODEDIR/parser/mine/tagger−net . r e s u l t s .

p r i o r . z e r o t e s t . wgt : p r i o r . nz . wgt

p r i o r . nz . wgt : p r i o r . nz . net

make −C $$CODEDIR/ t r a i n−net empty input p r i o r . { nz , z e r o t e s t } . wgt

cp −u $$CODEDIR/ t r a i n−net/p r i o r .{ nz , z e r o t e s t } .{wgt , log } .

dep . z e r o t e s t . wgt : dep . nz . wgt

dep . nz . wgt : dep . nz . net

make −C $$CODEDIR/ t r a i n−net empty input dep . { nz , z e r o t e s t } . wgt

cp −u $$CODEDIR/ t r a i n−net/dep .{ nz , z e r o t e s t } .{wgt , log } .

top . z e r o t e s t . wgt : top . nz . wgt

top . nz . wgt : top . nz . net

make −C $$CODEDIR/ t r a i n−net empty input top . { nz , z e r o t e s t } . wgt

cp −u $$CODEDIR/ t r a i n−net/top .{ nz , z e r o t e s t } .{wgt , log } .

subcat . z e r o t e s t . wgt : subcat . nz . wgt

subcat . nz . wgt : subcat . nz . net

make −C $$CODEDIR/ t r a i n−net empty input subcat . { nz , z e r o t e s t } . wgt

cp −u $$CODEDIR/ t r a i n−net/subcat .{ nz , z e r o t e s t } .{wgt , log } .

unary . z e r o t e s t . wgt : unary . nz . wgt

unary . nz . wgt : unary . nz . net

make −C $$CODEDIR/ t r a i n−net empty input unary . { nz , z e r o t e s t } . wgt

cp −u $$CODEDIR/ t r a i n−net/unary .{ nz , z e r o t e s t } .{wgt , log } .

c lean :

make −C treebank clean

make −C preproc c lean

make −C events c lean

make −C tag c lean

make −C v e c t o r s c lean

make −C parser/mine c lean

239

make −C parser/mcol l ins c lean

make −C t r a i n−net c lean

rm − f input

Some points in this are worth commenting on. The preprocessing of the treebank was

principally intended to remove Section 23 so that it is impossible for any later parts of the

system to gain any knowledge from it, but at the same time it provides easy access to Section

23 where that is necessary (such as in evaluation). The preprocessing stage was also used

later when I experimented with tokenising the treebank (which leads to lower precision and

recall but better neighbours).

The ‘input’ directory is used to contain all of the system’s generated files. That way if

something breaks it is trivial to go back to older intermediate files just as it is trivial to go

back to old versions of the code using CVS.

This build script was only rarely perfectly up-to-date, but even so it was invaluable on

numerous occasions. For example: make clean has to be explicitly coded and so can easily

leave important files behind. However, checking out the project from CVS ensures a totally

clean build, and so failures due to missing files leads quickly and easily to finding source

code that I neglected to add to CVS. Similarly, running two identical build is much easier

using a single huge build-scripts and because this should in theory lead to identical inter-

mediate files, it provides a method of testing that the toolchain is performing consistently.

For example, if the language is set to English then the sort command will not perform

identically in the shell as in C. This sort of bug is virtually impossible to track down since it

will be identical for most test cases, but running MD5 on intermediate data files finds any

discrepancies very quickly.

C.2 R scripts

The following scripts control R system. Source code for each of the functions in R is available

inside R by typing the name of the function without arguments.

240

Bigram generation

library(mva)

wvs <- as.numeric(Sys.getenv("WORD_VECTOR_SIZE"))

data <- read.table("bigram.4000")

pc <- prcomp(data, scale=FALSE,retx=FALSE)

rot <- pc$rotation

save(rot, file="rotation")

data <- as.matrix(data)

wordslong <- data %*% pc$rot

words <- wordslong[,0:wvs]

write.table(words,file="output.1-4000")

Dendrogram generation

library(mva)

words <- read.table(file="words.700")

dist <- dist(words)

clust <- hclust(dist)

pdf(file="english.pdf",paper="special",width=200,height=20)

plot(clust,hang=-1)

PCA for words beyond the first four thousand

library(mva)

wvs <- as.numeric(Sys.getenv("WORD_VECTOR_SIZE"))

load("rotation")

rot <- as.matrix(rot)

data <- read.table("bigram.8000")

data <- as.matrix(data)

wordslong <- data %*% rot

words <- wordslong[,1:wvs]

write.table(words,file="output.4001-8000")

C.3 Funnelweb code

@$@<valid join@>@(@4@)@M==@{

@! @1 = follow/precede

@! @2 = parent (l for follow, r for precede)

241

@! @3 = child

@! @4 = right/left

bool Parser::valid_@1_join(const Node * l, const Node * r) const

{

Subcat new_subcat = @2->@3c;

new_subcat -= g->conversions->nt_type(@3->parent);

if (l->dontuseme || r->dontuseme) return false;

assert(g->gram->@4_gram(@2->parent, @2->headnt, @3->parent));

return g->gram->@3c_gram(@2->parent, @2->headnt, new_subcat);

}

@}

@$@<gram rules@>==@{@-

@<valid join@>@(cc@, l@, r@, right@)

@<valid join@>@(follow@, l@, r@, right@)

@<valid join@>@(precede@, r@, l@, left@)

@}

@$@<combine loop@>@(@7@)@M==@{

@! Arguments:

@! 1 = call = follow vs precede

@! 2 = 1 if parent is left

@! 3 = left or right (parent val)

@! 4 = left or right (child val)

@! 5 = cc node (NULL or a real node)

@! 6 = left end (typically split)

@! 7 = right start (typically split + 1)

{

const int left_end = @6;

const int right_start = @7;

int parent_as_int;

int child_as_int;

int numkids, child_nr;

....

}

242

@$@<combine@>==@{@-

void Parser::combine(const int from, const int to)

{

int start,split,span,end;

int len = to - from + 1;

Chart *left_chart, *right_chart; // for combine loop

....

@<combine loop@>@(@’’follow@’’ @,

@’’1@’’ @,

@’’left@’’ @,

@’’right@’’ @,

@’’NULL@’’ @,

@’’split @’’ @,

@’’split + 1@’’ @)

@<combine loop@>@(@’’precede@’’ @,

@’’0@’’ @,

@’’right@’’ @,

@’’left@’’ @,

NULL @,

@’’split @’’ @,

@’’split + 1@’’ @)

}

While this source code is obviously incomplete, it will hopefully give the reader the idea

of how funnelweb was used. Macro expansion is extremely similar to a function call, but

it can be put in places where a function call cannot, or where refactoring into a function

call would be more awkward than duplicating code. For example, having a function that

calls join 2 edges follow or join 2 edges precede based on how it is called would

require a conditional in a really hard-to-read part of the loop. At first, reading the funnelweb

code is awkward, but with practice it reads as easily as ordinary function calls.

C.4 Processing the treebank

This section presents a full implementation of Collins’ preprocessor. This is important be-

cause Collins does not define the preprocessor in anywhere near enough detail to reimple-

ment it, and any errors implementing it lead to a significant loss of accuracy. This concern

243

has been previously reported in Bikel (2004). While Dan Bikel does provide code, it is not

especially easy to follow and is quite tightly tied to his parser.

C.4.1 Transforming the corpus

; add−compliment

; Input i s a t r e e wi th s e m a n t i c i n f o r m a n t i o n

; Output i s a t r e e w i t h o u t s e m a n t i c but wi th headword i n f o r m a t i o n

; add−npb

; Input i s a t r e e i n c l u d i n g compl iment i n f o r m a t i o n but not s e m a n t i c .

; Output i s a t r e e wi th b a s e NPs changed t o NPBs

; add−headword

; Input i s a t r e e i n c l u d i n g compl iment and npb i n f o r m a t i o n but not s e m a n t i c .

; Output i s a t r e e wi th compl iment npb and headword i n f o r m a t i o n .

; ( nt ( t a g word ) ( nt ( t a g word ))) −>

; ( nt word t a g ( t a g word t a g ) ( nt word t a g ( t a g word t a g ) ) )

; add−sg

; Input i s t h e f i n a l t r e e ( wi th headword ) .

; Output i s t h e t r e e wi th S changed t o SG when t h e s e n t e n c e has

; no s u b j e c t ( i . e . a NONE)

( defconstant badset

’ ( ”ADV” ”VOC” ”BNF” ”DIR” ”EXT” ”LOC” ”MNR” ”TMP” ”CLR” ”PRP” ) )

( defvar head−match−ht ( make−hash−table : s i z e 2 5 : t e s t # ’ equal ) )

( defvar get−direct ion−ht ( make−hash−table : s i z e 2 5 : t e s t # ’ equal ) )

( defvar co l l ins−events ) ; o u t put o f e v e n t s f i l e

( defun mygethash ( i ht ) ( gethash i ht ) )

( defun getword ( x )

( l e t ( ( r es ( mygethash x words ) ) )

( i f re s re s ( format t ”Eep−WORD: ˜ a˜%” x ) ) ) )

( defun getnt ( x )

( l e t ( ( r es ( mygethash x nts ) ) )

( i f re s re s ( format t ”Eep−NT : ˜ a˜%” x ) ) ) )

; Th i s b r e a k s an i t em i n t o i t s s y n t a c t i c and s e m a n t i c p a r t s

( defun d e t a i l s ( item )

244

( gethash ( gethash item nts ) n t s−d e t a i l s ) )

; Th i s r e t u r n s an item ’ s key s y n t a c t i c p a r t a s a symbo l

( defun s im pl i f y ( item )

( i f ( not ( ge tnt item ) )

item ; For t a g s j u s t use t h e t a g

( f i r s t ( d e t a i l s item ) ) ) )

( defun tags ( item )

( r e s t ( d e t a i l s item ) ) )

( defun compliment ( item )

( gethash ( + ∗ compliment−diff ∗ ( gethash item nts ) ) nts− inverse ) )

( defun nocompliment ( item )

( l e t ( ( item−as−num ( getnt item ) ) )

( i f ( > item−as−num ∗ compliment−diff ∗ )

( gethash (− item−as−num ∗ compliment−diff ∗ ) nts− inverse )

item ) ) )

( defun is−verb ( nt )

( find ( s im pl i f y nt ) ’ ( ”VP” ) : t e s t # ’ equal ) )

; t e s t e d .

; c h a n g e s i t em t o item−A

; The non− terminal must be :

; ( 1 ) an NP SBAR or S whose p a r e n t i s an S ;

; ( 2 ) an NP SBAR S or VP whose p a r e n t i s a VP ; or

; ( 3 ) an S whose p a r e n t i s an SBAR .

; 2 . The non− terminal must not have one o f t h e f o l l o w i n g s e m a n t i c t a g s :

; ADV VOC BNF DIR EXT LOC MNR TMP CLR or PRP .

( defun make−compliment ( parent item )

( l e t ∗ ( ( simple−parent ( s i mp l i fy parent ) )

( simple−item ( s i mp l i fy item ) )

( compliment−item ( compliment simple−item ) ) )

( cond ( ( find simple−parent ’ ( ”PP” ”PP−A” ) : t e s t # ’ equal )

( i f ( find simple−item ’ ( ”NPB” ”NP” ”SBAR” ”S” ”SG” ”PP” ”ADJP” ”ADVP” )

: t e s t # ’ equal )

compliment−item

simple−item ) )

( ( i n t e r s e c t i o n ( tags item ) badset : t e s t # ’ equal ) simple−item )

245

( ( find simple−parent ’ ( ”S” ”S−A” ”SG” ”SG−A” ) : t e s t # ’ equal )

( i f ( find simple−item ’ ( ”NPB” ”NP” ”SBAR” ”S” ”SG” ) : t e s t # ’ equal )

compliment−item

simple−item ) )

( ( find simple−parent ’ ( ”VP” ”VP−A” ) : t e s t # ’ equal )

( i f ( find simple−item

’ ( ”NPB” ”NP” ”SBAR” ”S” ”SG” ”VP” ) : t e s t # ’ equal )

compliment−item simple−item ) )

( ( find simple−parent ’ ( ”SBAR” ”SBAR−A” ) : t e s t # ’ equal )

( i f ( find simple−item ’ ( ”S” ”SG” ) : t e s t # ’ equal )

compliment−item simple−item ) )

( t simple−item ) ) ) )

( defun makequote ( x )

( format n i l ” \”˜ a\”” x ) )

( defun output− for−col l ins ( output depth t r e e )

( i f ( atom t r e e )

( format output ” ˜ a” ( makequote t r e e ) )

( progn

( format output ”˜%” )

( dotimes ( i depth ) ( format output ” ” ) )

( format output ” ( ˜ a ˜ a ˜ a”

( makequote ( f i r s t t r e e ) )

( makequote ( second t r e e ) )

( makequote ( thi rd t r e e ) ) )

( mapcar # ’ ( lambda ( x ) ( output− for−col l ins output ( 1 + depth ) x ) )

( cdddr t r e e ) )

( format output ” ) ” ) ) ) )

; t e s t e d

; c h a n g e s t h e p a r e n t o f a node t o −A i f t h e c h i l d has c e r t i a n f e a t u r e s

( defun add−compliment ( node )

( l e t ( ( parent ( s i mp l i fy ( f i r s t node ) ) ) )

( cons parent

( mapcar # ’ ( lambda ( c h i l d ) ( add−compliment−internal parent c h i l d ) )

( cdr node ) ) ) ) )

; t e s t e d

( defun add−compliment−internal ( parent c h i l d )

( i f ( atom c h i l d ) c h i l d

( cons

246

( make−compliment parent ( f i r s t c h i l d ) )

( mapcar # ’ ( lambda ( grandchild )

( add−compliment−internal ( f i r s t c h i l d ) grandchild ) )

( cdr c h i l d ) ) ) ) )

; Re turns a two i t em l i s t . F i r s t i t em i s t h e r e s u l t and s e c o n d i t em

; i s t r u e when a change has been made and n i l o t h e r w i s e .

; A lgo i r thm l o g i c :

; i f any c h i l d r e n changed

; th en a c h i l d c o n t a i n s NPB so we don ’ t change t h e c u r r e n t node

; Otherwi s e

; t h e c h i l d r e n can be d i s c a r d e d ( n o t h i n g was changed ) and we j u s t

; c o n s i d e r t h e c u r r e n t node

; i f non NP then ( X a b c ) −> (X a ’ b ’ c ’ )

; where a ’ means ( add−npb a )

; e l s i f baseNP then ( NP a b c ) −> (NP ( NPB a b c ) )

; e l s e ( NP a b c ) −> (NP a b c )

; t e s t e d

( defun add−npb ( node )

( i f ( atom ( second node ) ) ; t e r m i n a l

( l i s t node n i l )

( l e t ( ( ch i ldren ( mapcar # ’ add−npb ( cdr node ) ) ) )

( i f ( member t ch i ldren : key # ’ second ) ; s ome th ing was c o n v e r t e d

( l i s t ( cons ( f i r s t node ) ( mapcar # ’ f i r s t ch i ldren ) ) t )

( cond ( ( equal ( f i r s t node ) ’ ”NP” )

( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )

( ( equal ( f i r s t node ) ’ ”NP−A” )

( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )

( t ( l i s t node n i l ) ) ) ) ) ) )

; t h i s v e r s i o n r e p l a c e s ( NP . . . ) wi th ( NPB . . . )

; ( cond ( ( e q u a l ( f i r s t node ) ’ ”NP ” ) ( l i s t ( cons ’ ”NPB ” ( c d r node ) ) t ) )

; ( ( e q u a l ( f i r s t node ) ’ ”NP−A” ) ( l i s t ( cons ’ ”NPB ” ( c d r node ) ) t ) )

; t h i s v e r s i o n r e p l a c e s ( NP . . . ) (NP ( NPB . . . ) )

; ( cond ( ( e q u a l ( f i r s t node ) ’ ”NP”)

( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )

; ( ( e q u a l ( f i r s t node ) ’ ”NP−A”)

( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )

; t e s t e d

247

( defun add−headword ( node )

( i f ( atom ( second node ) )

( l i s t ( f i r s t node ) ( second node ) ( f i r s t node ) ) ; t a g word t a g

( l e t ( ( head ( nocompliment ( f i r s t node ) ) )

( ch i ldren ( mapcar # ’ add−headword ( cdr node ) ) ) )

( cond

( ( or ( equal head ’ ”NP” ) ( equal head ’ ”NPB” ) ( equal head ’ ”NX” ) )

( add−headword−np ( f i r s t node ) ch i ldren ) )

( ( equal head ’ ”CC” ) ( add−headword−cc ( f i r s t node ) ch i ldren ) )

( t ( add−headword−normal ( f i r s t node ) ch i ldren ) ) ) ) ) )

; Th i s adds headword i n f o r m a t i o n by s e l e c t i n g t h e head c h i l d .

; t e s t e d

( defun add−headword−normal ( head ch i ldren & opt iona l ( mustfind t ) ( nt−head n i l ) )

( l e t ∗ ( ( head−for−output ( i f nt−head nt−head head ) )

( basehead ( nocompliment head ) )

( l e f t− to− r ight ( get−direc t ion basehead ) )

( p r i o r i t y− l i s t ( g e t−p r i o r i t y− l i s t basehead ) )

( search−chi ldren ( i f l e f t− to− r ight ch i ldren ( reverse ch i ldren ) ) )

( found ( remove n i l

( mapcar # ’ ( lambda ( item )

( find item search−chi ldren : key # ’ f i r s t : t e s t # ’ equal ) )

p r i o r i t y− l i s t ) ) ) )

( cond

( found ; f ound t h e headword

( append

( cons head−for−output

( l i s t ( second ( f i r s t found ) ) ( th i rd ( f i r s t found ) ) ) )

ch i ldren ) )

( ( not mustfind ) n i l ) ; headword not found and not ne e ded

( t ( append ; headword not found assume t h e f i r s t / l a s t c h i l d

( cons head−for−output

( l i s t ( second ( f i r s t search−chi ldren ) )

( th i rd ( f i r s t search−chi ldren ) ) ) )

ch i ldren ) ) ) ) )

( defun add−headword−cc ( head ch i ldren )

( append

( cons head

( l i s t ( second ( f i r s t ch i ldren ) ) ( th i rd ( f i r s t ch i ldren ) ) ) )

ch i ldren ) )

248

; t e s t e d

( defun add−headword−np ( head ch i ldren )

( cond ( ( last−word−pos head ch i ldren ) )

( ( add−headword−normal ’ ”FAKE−1” ch i ldren n i l head ) )

( ( add−headword−normal ’ ”FAKE−2” ch i ldren n i l head ) )

( ( add−headword−normal ’ ”FAKE−3” ch i ldren n i l head ) )

( t ( append

( l i s t head

( second ( f i r s t ( l a s t ch i ldren ) ) )

( th i rd ( f i r s t ( l a s t ch i ldren ) ) ) )

ch i ldren ) ) ) )

; t e s t e d

( defun last−word−pos ( head ch i ldren )

( l e t ( ( l a s t c h i l d ( f i r s t ( l a s t ch i ldren ) ) ) )

(when ( equal ’ ”POS” ( th i rd l a s t c h i l d ) )

( append

( l i s t head ( second l a s t c h i l d ) ( th i rd l a s t c h i l d ) )

ch i ldren ) ) ) )

( defun get−direc t ion ( head )

( multiple−value−bind ( r e s u l t found ) ( gethash head get−direct ion−ht )

( progn ( when ( not found ) ( format t ”Oops : get−direc t ion ˜ a −> NULL˜%” head ) )

r e s u l t ) ) )

( defun g e t−p r i o r i t y− l i s t ( head )

( multiple−value−bind ( r e s u l t found ) ( gethash head head−match−ht )

( progn ( when ( not found )

( format t ”Oops : g e t−p r i o r i t y− l i s t ˜ a −> NULL˜%” head ) )

r e s u l t ) ) )

( s e t f

( gethash ’ ”ADJP” head−match−ht )

’ ( ”NNS” ”QP” ”NN” ”$” ”ADVP” ” J J ” ”VBN” ”VBG” ”ADJP”

” J JR ” ”NP” ”NPB” ” J J S ” ”DT” ”FW” ”RBR” ”RBS” ”SBAR” ”BR” )

( gethash ’ ”ADVP” head−match−ht )

’ ( ”RB” ”RBR” ”RBS” ”FW” ”ADVP” ”TO” ”CD”

” JJR ” ” J J ” ”IN” ”NP” ”NPB” ” J J S ” ”NN” )

( gethash ’ ”CONJP” head−match−ht ) ’ ( ”CC” ”RB” ”IN” )

( gethash ’ ”LST” head−match−ht ) ’ ( ”LS” ” ˆ ” )

( gethash ’ ”NAC” head−match−ht ) ’ ( ”NN” ”NNS” ”NNP” ”NNPS” ”NP” ”NPB” ”NAC”

”EX” ”$” ”CD” ”QP” ”PRP” ”VBG” ” J J ” ” J J S ” ” J JR ” ”ADJP” ”FW” )

249

( gethash ’ ”FAKE−1” head−match−ht ) ’ ( ”NN” ”NNP” ”NNPS” ”NNS” ”NX” ”POS” ” J JR ” )

( gethash ’ ”FAKE−2” head−match−ht ) ’ ( ”NP” ”NPB” )

( gethash ’ ”FAKE−3” head−match−ht ) ’ ( ”$” ”ADJP” ”PRN” ”CD” ” J J ” ” J J S ” ”RB” ”QP” )

( gethash ’ ”FRAG” head−match−ht ) n i l

( gethash ’ ”INTJ” head−match−ht ) n i l

( gethash ’ ”PRN” head−match−ht ) n i l

( gethash ’ ”UCP” head−match−ht ) n i l

( gethash ’ ”PP” head−match−ht ) ’ ( ”IN” ”TO” ”VBG” ”VBN” ”RP” ”FW” )

( gethash ’ ”PRT” head−match−ht ) ’ ( ”RP” )

( gethash ’ ”QP” head−match−ht )

’ ( ”$” ”IN” ”NNS” ”NN” ” J J ” ”RB” ”DT” ”CD” ”NCD” ”QP” ” J JR ” ” J J S ” )

( gethash ’ ”RRC” head−match−ht ) ’ ( ”VP” ”NP” ”NPB” ”ADVP” ”ADJP” ”PP” )

( gethash ’ ”S” head−match−ht )

’ ( ”TO” ”IN” ”VP” ”S” ”SG” ”SBAR” ”ADJP” ”UCP” ”NP” ”NPB” )

( gethash ’ ”SG” head−match−ht )

’ ( ”TO” ”IN” ”VP” ”S” ”SG” ”SBAR” ”ADJP” ”UCP” ”NP” ”NPB” )

( gethash ’ ”SBAR” head−match−ht )

’ ( ”WHNP” ”WHPP” ”WHADVP” ”WHADJP” ”IN” ”DT” ”S” ”SG” ”SQ” ”SINV” ”SBAR”

”FRAG” )

( gethash ’ ”SBARQ” head−match−ht ) ’ ( ”SQ” ”S” ”SG” ”SINV” ”SBARQ” ”FRAG” )

( gethash ’ ”SINV” head−match−ht )

’ ( ”VBZ” ”VBD” ”VBP” ”VB” ”MD” ”VP” ”S” ”SG” ”SINV” ”ADJP” ”NP” ”NPB” )

( gethash ’ ”SQ” head−match−ht ) ’ ( ”VBZ” ”VBD” ”VBP” ”VB” ”MD” ”VP” ”SQ” )

( gethash ’ ”TOP” head−match−ht ) ; C o l l i n s didn ’ t have a t o p c a t e g o r y h e r e ??

’ ( ”TO” ”IN” ”VP” ”S” ”SG” ”SBAR” ”ADJP” ”UCP” ”NP” ”NPB” )

( gethash ’ ”VP” head−match−ht )

’ ( ”TO” ”VBD” ”VBN” ”MD” ”VBZ” ”VB” ”VBG” ”VBP” ”VP” ”ADJP” ”NN” ”NNS” )

( gethash ’ ”WHADJP” head−match−ht ) ’ ( ”CC” ”WRB” ” J J ” ”ADJP” )

( gethash ’ ”WHADVP” head−match−ht ) ’ ( ”CC” ”WRB” )

( gethash ’ ”WHNP” head−match−ht ) ’ ( ”WDT” ”WP” ”WP$” ”WHADJP” ”WHPP” ”WHNP” )

( gethash ’ ”WHPP” head−match−ht ) ’ ( ”IN” ”TO” ”FW” )

( gethash ’ ”X” head−match−ht ) n i l )

( s e t f ; t i s l e f t t o r i g h t n i l i s r i g h t t o l e f t

( gethash ’ ”ADJP” get−direct ion−ht ) t

( gethash ’ ”ADVP” get−direct ion−ht ) n i l

( gethash ’ ”CONJP” get−direct ion−ht ) n i l

( gethash ’ ”FRAG” get−direct ion−ht ) n i l

( gethash ’ ”INTJ” get−direct ion−ht ) t

( gethash ’ ”LST” get−direct ion−ht ) n i l

( gethash ’ ”NAC” get−direct ion−ht ) t

( gethash ’ ”FAKE−1” get−direct ion−ht ) n i l

250

( gethash ’ ”FAKE−2” get−direct ion−ht ) t

( gethash ’ ”FAKE−3” get−direct ion−ht ) n i l

( gethash ’ ”PP” get−direct ion−ht ) n i l

( gethash ’ ”PRN” get−direct ion−ht ) t

( gethash ’ ”PRT” get−direct ion−ht ) n i l

( gethash ’ ”QP” get−direct ion−ht ) t

( gethash ’ ”RRC” get−direct ion−ht ) n i l

( gethash ’ ”S” get−direct ion−ht ) t

( gethash ’ ”SG” get−direct ion−ht ) t

( gethash ’ ”SBAR” get−direct ion−ht ) t

( gethash ’ ”SBARQ” get−direct ion−ht ) t

( gethash ’ ”SINV” get−direct ion−ht ) t

( gethash ’ ”SQ” get−direct ion−ht ) t

( gethash ’ ”TOP” get−direct ion−ht ) t ; C o l l i n s doesn ’ t say

( gethash ’ ”UCP” get−direct ion−ht ) n i l

( gethash ’ ”VP” get−direct ion−ht ) t

( gethash ’ ”WHADJP” get−direct ion−ht ) t

( gethash ’ ”WHADVP” get−direct ion−ht ) n i l

( gethash ’ ”WHNP” get−direct ion−ht ) t

( gethash ’ ”WHPP” get−direct ion−ht ) n i l

( gethash ’ ”X” get−direct ion−ht ) n i l )

; True i f t h e t r e e has on ly ”−NONE−” l e a v e s

; t e s t e d

( defun has−only−none ( t r e e )

( i f ( atom t r e e ) ( e r r o r ”OOPS!˜% ” ) )

( i f ( atom ( second t r e e ) )

( equal ( f i r s t t r e e ) ”−NONE−” )

( every # ’ has−only−none ( cdr t r e e ) ) ) )

; I f t h e branch has no ”−NONE−” i t i s k e p t

; I f a S node has any d e c e n d e n t s o f ”−NONE−” i t i s changed t o SG .

;

; The r e t u r n i s o f t h e form ( r e s u l t r e t v a l )

; where r e s u l t i s t h e t r e e wi th c h a n g e s made

; and r e t v a l i s t when a −NONE− has been dropped but

; no S has been changed t o SG .

; t e s t e d

( defun drop−none ( t r e e )

( cond ( ( has−only−none t r e e ) ( l i s t n i l t ) )

( ( atom ( second t r e e ) ) ( l i s t t r e e n i l ) )

( t

251

( l e t ∗ ( ( newkids ( mapcar # ’ drop−none ( cdr t r e e ) ) )

( dropped ( second ( find t newkids : key # ’ second ) ) )

( nonnullkids ( remove n i l ( mapcar # ’ f i r s t newkids ) ) ) )

( i f ( and dropped ( equal ”S” ( f i r s t t r e e ) ) )

( l i s t ( cons ”SG” nonnullkids ) n i l )

( l i s t ( cons ( f i r s t t r e e ) nonnullkids ) dropped ) ) ) ) ) )

( defun process ( t r e e output )

( output− for−col l ins output 0

( add−headword

( f i r s t ( add−npb

( add−compliment

( f i r s t ( drop−none t r e e ) )

)

) )

)

) )

( defun doi t ( )

( with−open−file ( output ” wsj . c o l l i n s ” : d i r e c t i o n : output )

( with−open−file ( input ” w s j t r a i n . combined” : d i r e c t i o n : input )

( l e t ( ( s t a r t ( get−universal−time ) )

( f i l e− s i z e 5 0 0 0 0 ) )

( do ( ( t r e e ( read input ) ( read input n i l ’ eof ) )

( sentence 1 ( 1 + sentence ) ) )

( ( equal t r e e ’ eof ) ( format t ” ˜ c100% complete ˜% ” #\CR) )

( progn

(when ( zerop ( mod sentence 5 0 ) )

( l e t ( ( sec ( ∗ (− f i l e− s i z e sentence )

( / (− ( get−universal−time ) s t a r t ) sentence ) ) ) )

( format t ” ˜ c ˜ f% complete ˜ a remaining ”

#\CR

( / sentence ( / f i l e− s i z e 1 0 0 ) )

( cond

( ( > sec 3 6 0 0 )

( format n i l ” ˜ d hour ˜ : P” ( round ( + 0 . 5 ( / sec 3 6 0 0 ) ) ) ) )

( ( > sec 6 0 )

( format n i l ” ˜ d minute ˜ : P” ( round ( + 0 . 5 ( / sec 6 0 ) ) ) ) )

( t ( format n i l ” ˜ d second ˜ : P” ( round ( + 0 . 5 sec ) ) ) ) ) ) ) )

( process t r e e output ) ) ) ) ) ) )

; ( d o i t )

252

C.4.2 Deriving a grammar

Collins’ parser uses an explicit grammar to avoid generating edges that will inevitably have

a probability of zero. Code to derive this grammar is given below:

; ; F inds e v e r y p o s s i b l e p a r e n t o f e v e r y NT.

; ( s e t f ∗PRINT−PRETTY∗ n i l )

( defconstant nts−plus1 ( 1 + ∗num−nts ∗ ) )

( defun take ( n l ) ( i f ( zerop n ) n i l ( cons ( car l ) ( take ( 1− n ) ( cdr l ) ) ) ) )

( defun drop ( n l ) ( i f ( zerop n ) l ( drop ( 1− n ) ( cdr l ) ) ) )

( defvar l e f t−data ( make−hash−table : s i z e nts−plus1 : t e s t # ’ equal ) )

( defvar l e f t− f p

( open ” l e f t ” : d i r e c t i o n : output : i f− e x i s t s : overwrite : if−does−not−exist : c r e a t e ) )

( defvar right−fp

( open ” r i g h t ” : d i r e c t i o n : output : i f− e x i s t s : overwrite : if−does−not−exist : c r e a t e ) )

( defvar unary−fp

( open ”unary” : d i r e c t i o n : output : i f− e x i s t s : overwrite : if−does−not−exist : c r e a t e ) )

( defun p r o c e s s− l e f t ( l nt head−tag )

( d o l i s t ( item l )

( format l e f t− f p ” ˜ a ˜ a ˜ a˜%” nt head−tag item ) ) )

( defun process−r ight ( l nt head−tag )

( d o l i s t ( item l )

( format right−fp ” ˜ a ˜ a ˜ a˜%” nt head−tag item ) ) )

; F inds t h e c h i l d with t h e r i g h t word / t a g headword

( defun find−head ( headtag word tag ch i ldren & opt iona l ( l e f t n i l ) )

( i f ( null ch i ldren ) ( warn ( format n i l ”Find−head : Children n u l l ˜ a ˜ a” word tag ) ) )

( l e t ( ( c h i l d ( f i r s t ch i ldren ) ) )

( i f ( and ( equal word ( second c h i l d ) )

( equal tag ( th i rd c h i l d ) ) )

( l i s t l e f t headtag ( f i r s t c h i l d ) ( cdr ch i ldren ) )

( find−head headtag word tag ( cdr ch i ldren )

( append l e f t ( l i s t ( car ch i ldren ) ) ) ) ) ) )

; Adds e v e r y c h i l d / p a r e n t p a i r t o t h e hash t a b l e

( defun process ( parent ch i ldren )

(when ( consp ( f i r s t ch i ldren ) )

253

( progn

( l e t ( ( head ( find−head ( f i r s t parent ) ( second parent ) ( th i rd parent ) ch i ldren ) ) )

( progn

( format unary−fp ” ˜ a ˜ a˜%” ( thi rd head ) ( f i r s t parent ) )

( p r o c e s s− l e f t ( mapcar # ’ f i r s t ( f i r s t head ) )

( second head ) ( th i rd head ) )

( process−r ight ( mapcar # ’ f i r s t ( fourth head ) )

( second head ) ( th i rd head ) ) ) )

( mapcar # ’ ( lambda ( c h i l d ) ( process ( take 3 c h i l d ) ( drop 3 c h i l d ) ) )

ch i ldren ) ) ) )

; ( t r a c e p r o c e s s f ind−head )

; Saves t h e hash t a b l e a s t h e f i l e p a r e n t s

( defun output ( )

( c lose l e f t− f p )

( c lose right−fp )

( c lose unary−fp ) )

( with−open−file ( f i l e ” wsj . c o l l i n s ” : d i r e c t i o n : input )

( l e t ( ( s t a r t ( get−universal−time ) )

( f i l e− s i z e 5 0 0 0 0 ) )

( do ( ( t r e e ( read f i l e ) ( read f i l e n i l ’ eof ) )

( sentence 1 ( 1 + sentence ) ) )

( ( equal t r e e ’ eof )

( progn ( format t ” ˜ c100% complete ˜ % ” #\CR)

( output ) ) )

( progn

(when ( zerop ( mod sentence 5 0 ) )

( l e t ( ( sec ( ∗ (− f i l e− s i z e sentence )

( / (− ( get−universal−time ) s t a r t ) sentence ) ) ) )

( format t ” ˜ c ˜ f% complete , ˜ a remaining ”

#\CR

( / sentence ( / f i l e− s i z e 1 0 0 ) )

( cond

( ( > sec 3 6 0 0 )

( format n i l ” ˜ d hour ˜ : P” ( round ( + 0 . 5 ( / sec 3 6 0 0 ) ) ) ) )

( ( > sec 6 0 )

( format n i l ” ˜ d minute ˜ : P” ( round ( + 0 . 5 ( / sec 6 0 ) ) ) ) )

( t ( format n i l ” ˜ d second ˜ : P” ( round ( + 0 . 5 sec ) ) ) ) ) ) ) )

( process ( take 3 t r e e ) ( drop 3 t r e e ) ) ) ) ) )

254

C.5 Processing bigrams

C.5.1 Counting bigrams

# include < a s s e r t . h>

# include <math . h>

# include < s t d l i b . h>

# include < s tdarg . h>

# include < s t r i n g . h>

# include < s t d i o . h>

# include ” amalloc . h”

# define fudge 5 /∗ Add t o e v e r y m a l l o c t o f i t q u i r k y c a s e s in ∗ /

# define maxval 100000

# define maxWordSize 5 6 0

# define quote ( char ) 0 x22 /∗ ” ∗ /

# define numLines 407836244

# define num rows ( numDendro − 1)

# define progress ( numLines / 1 0 0 0 )

# define p r o g d i s t ( num rows / 9 9 0 )

# define max( a , b ) ( ( a > b ) ? a : b )

# define t a g o f f s e t numFeatures /∗ c o u n t s [ t a g o f f s e t . . 4 0 0 1 ] i s f o r t a g s ∗ /

double lowProb ;

i n t n e i g h b o u r s t o p r i n t = 1 0 0 ; /∗ No l o n g e r # d e f i n e d b e c . t a g s have < 1 0 0 ∗ /

i n t num cols = 0 ; /∗ Number o f d i m e n s i o n s a f t e r SVD ∗ /

i n t num dis t co ls = 0 ; /∗ Number o f words t o c o n s i d e r c l o s e ∗ /

i n t numDendro = 0 ; /∗ Read from command l i n e ∗ /

i n t numTagFeatures = 0 ; /∗ Number o f f e a t u r e s t o r e s e r v e f o r t a g s ∗ /

i n t windowSize = 0 ; /∗ Remember t o add one t o s k i p t h e c e n t e r ∗ /

i n t numFeaturesTotal = 0 ; /∗ Should be 4001 − a lways add 1 t o d e s i r e d ∗ /

i n t numFeatures = 0 ; /∗ n u m F e a t u r e s T o t a l − numTagFeatures ∗ /

i n t UNKNOWNWORD = −1 ; /∗ No l o n g e r cons t , s e t by l o a d c o n v e r t ∗ /

const i n t numWords = 2 0 0 0 0 0 ;

const i n t DEBUG = 0 ;

i n t i = 0 ;

i n t j = 0 ;

typedef s t r u c t {double d i s t ;

255

i n t word ;

} d i s t e n t r y ;

void l o a d f e a t u r e s ( i n t ∗ f e a t u r e s ) ;

i n t load conver t ( char ∗ ∗ word strings , i n t numDendro ) ;

void load neighbours ( double ∗ ∗ neighbours ) ;

void sca le ne ighbours ( double ∗ ∗ neighbours ) ;

void s e e d d i s t ( d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) ;

void c a l c d i s t a n c e ( d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) ;

double c a l c a d i s t ( i n t x , i n t y , d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) ;

void check ( d i s t e n t r y ∗ d i s t ) ; /∗ Check i s o r t ( and , p a r t i a l l , some o t h e r s ) i s working ∗ /

void pr int ne ighbours ( d i s t e n t r y ∗ ∗ neighbours , char ∗ ∗ word strings , i n t k ) ;

void der ive counts ( const i n t ∗ f e a t u r e s , const d i s t e n t r y ∗ ∗ dis t ,

i n t ∗ window , char ∗ ∗ word strings , i n t ∗ ∗ counts ,

i n t ∗ seen , i n t windowSize , const char ∗ corpus name , i n t l e f t ) ;

/∗ Pseudocode

i n i t i a l i s e window [ 0 . . windowSize ]

i = windowSize − 1

w h i l e c o r p u s {i f words [ window [ i ] ]

f o r j = 0 ; j < windowSize ; j ++

i f f e a t u r e s [ window [ j ] ]

c o u n t s [ i ] [ j ]++

}}window [ i ] = $

i f ++ i = = windowSize , i = 0 ;

}∗ /

void l o a d f e a t u r e s ( i n t ∗ f e a t u r e s ) {FILE ∗ FREQ ;

i n t i , count , word ;

FREQ = fopen ( ”words . f r e q ” , ” r ” ) ;

a s s e r t (FREQ ) ;

for ( i = 1 ; i < numFeatures ; i ++) {f s c a n f (FREQ, ”%d %d” , & count , & word ) ;

f e a t u r e s [ word ] = i ;

}

256

f c l o s e (FREQ ) ;

f p r i n t f ( s tderr , ”Loaded frequency information \n” ) ;

}

i n t load conver t ( char ∗ ∗ word strings , i n t numDendro ) {FILE ∗ CONVERT;

i n t num = 0 , numCorrect = 1 , oldNum = 1 ;

char ∗ curWord ;

char ∗ l i n e ;

char ∗ number ;

l i n e = malloc ( maxWordSize ∗ 2 ) ;

number = malloc ( 1 6 ) ;

CONVERT = fopen ( ” convert ” , ” r ” ) ;

a s s e r t (CONVERT) ;

f g e t s ( l i n e , maxWordSize , CONVERT) ;

number = s t r t o k ( l i n e , ” ” ) ;

num = a t o i ( number ) ;

oldNum = num − 1 ;

numCorrect = 0 ;

while ( numCorrect < numDendro && ! f e o f (CONVERT) ) {curWord = s t r t o k (NULL, ”\n” ) ;

/ / i f ( ∗ curWord = = ’ \ \ ’ ) { / / ug ly hack , c u r r e n t l y d i s a b l e d

/ / memmove ( curWord +2 , curWord +1 , ( s t r l e n ( curWord ) ) ) ;

/ / curWord [ 1 ] = ’ \ \ ’ ;

/ / }

/ / i f ( ∗ curWord = = ’ | ’ ) {/ / memmove ( curWord +4 , curWord +1 , ( s t r l e n ( curWord ) ) ) ;

/ / curWord [ 0 ] = ’ P ’ ; curWord [ 1 ] = ’ I ’ ;

/ / curWord [ 2 ] = ’ P ’ ; curWord [ 3 ] = ’ E ’ ;

/ / }

/ / i f ( ∗ curWord = = q u o t e ) {/ / memmove ( curWord +5 , curWord +1 , ( s t r l e n ( curWord ) ) ) ;

/ / curWord [ 0 ] = ’Q ’ ; curWord [ 1 ] = ’U ’ ;

/ / curWord [ 2 ] = ’O ’ ; curWord [ 3 ] = ’ T ’ ;

257

/ / curWord [ 4 ] = ’ E ’ ;

/ / }

i f ( num ! = oldNum ) {word str ings [num] = malloc ( s t r l e n ( curWord ) + 1 ) ;

a s s e r t ( word str ings [num ] ) ;

s t rcpy ( word str ings [num ] , curWord ) ;

i f ( 0 = = strcmp ( curWord , ”UNKNOWNWORD” ) ) {UNKNOWNWORD = num;

}}i f (DEBUG) p r i n t f ( ”%s\n” , curWord ) ;

f g e t s ( l i n e , maxWordSize , CONVERT) ;

oldNum = num;

number = s t r t o k ( l i n e , ” ” ) ;

num = a t o i ( number ) ;

numCorrect ++;

}return numCorrect ;

}

void load neighbours ( double ∗ ∗ neighbours ) {FILE ∗ fp ;

double val ;

char ∗ word str ;

i n t cur row , c u r c o l ;

word str = malloc (max( 4 0 0 , maxWordSize ) ∗ s i ze of ( char ) ) ;

fp = fopen ( ” output ” , ” r ” ) ; /∗ Format : word v a l 1 . . . v a l 5 0 ∗ /

f s c a n f ( fp , ”%s ” , word str ) ;

for ( cur row = 1 ; ! f e o f ( fp ) && ( cur row <= num rows ) ; cur row ++) {for ( c u r c o l = 0 ; c u r c o l < num cols ; c u r c o l ++) {

f s c a n f ( fp , ”%l f ” ,& val ) ;

neighbours [ cur row ] [ c u r c o l ] = val ;

}f s c a n f ( fp , ”%s ” , word str ) ;

}f r e e ( word str ) ;

s ca le ne ighbours ( neighbours ) ;

f p r i n t f ( s tderr , ”Loaded neighbour information \n” ) ;

258

}

void sca le ne ighbours ( double ∗ ∗ neighbours ) {double sum ;

i n t cur row , c u r c o l ;

for ( cur row = 1 ; cur row <= num rows ; cur row ++) {sum = 0 . 0 ;

for ( c u r c o l = 0 ; c u r c o l < num cols ; c u r c o l ++) {sum + = fabs ( neighbours [ cur row ] [ c u r c o l ] ) ;

}for ( c u r c o l = 0 ; c u r c o l < num cols ; c u r c o l ++) {

neighbours [ cur row ] [ c u r c o l ] / = sum ;

}}

}

void s e e d d i s t a n c e ( d i s t e n t r y ∗ ∗ d i s t ) {i n t x , y ;

for ( x = 1 ; x <= num rows ; x ++) {for ( y = 0 ; y < num dis t co ls ; y ++) {

d i s t [ x ] [ y ] . d i s t = maxval ;

d i s t [ x ] [ y ] . word = −1 ;

}}

}

/∗ I n s e r t c u r d i s t / word i n t o t h e s o r t e d l i s t d i s t ∗ /

void i s o r t ( d i s t e n t r y ∗ dis t , double c u r d i s t , i n t word ) {i n t i = 0 ;

for ( i = 0 ; ( i < num dis t co ls ) && ( d i s t [ i ] . d i s t < c u r d i s t ) ; i + + ) { } ;

i f ( i < num dis t co ls ) {memmove(& d i s t [ i +1] ,& d i s t [ i ] , s i ze of ( d i s t e n t r y ) ∗ ( num dis t co ls − i − 1 ) ) ;

d i s t [ i ] . d i s t = c u r d i s t ;

d i s t [ i ] . word = word ;

}}

void c a l c d i s t a n c e ( d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) {double d ;

f l o a t curPercent = 0 . 0 ;

i n t x , y ;

259

i n t curLine = 0 ;

s e e d d i s t a n c e ( d i s t ) ;

for ( x = 1 ; x <= num rows ; x ++ , curLine ++) {i f ( curLine >= p r o g d i s t ) {

curLine = 0 ;

f p r i n t f ( s tderr , ”\ r S o r t i n g : %.1 f%%” , curPercent ) ;

f f l u s h ( s t d e r r ) ;

curPercent + = 0 . 1 ;

}for ( y = 1 ; y <= num rows ; y ++) {

d = c a l c a d i s t ( x , y , d i s t , neighbours ) ;

i s o r t ( d i s t [ x ] , d , y ) ;

check ( d i s t [ x ] ) ;

}}

}

void check ( d i s t e n t r y ∗ d i s t ) {i n t cur w = d i s t [ 0 ] . word ;

i n t i = 1 ;

for ( i = 1 ; ( i < num dis t co ls ) & & ( d i s t [ i ] . word ! = − 1 ) ; i ++) {a s s e r t ( d i s t [ i ] . d i s t >= d i s t [ i −1] . d i s t ) ;

a s s e r t ( cur w ! = d i s t [ i ] . word ) ;

cur w = d i s t [ i ] . word ;

}}

/∗ P r i n t t h e k c l o s e s t n e i g h b o u r s f o r e a c h word t o t h e f i l e ” n e i g h b o u r s ” ∗ /

void pr int ne ighbours ( d i s t e n t r y ∗ ∗ neighbours , char ∗ ∗ word strings , i n t k ) {i n t word ;

i n t i ;

FILE ∗ fp = fopen ( ” neighbours ” , ”w” ) ;

for ( word = 1 ; word <= num rows ; word++) {for ( i = 0 ; i < k ; i ++) {

f p r i n t f ( fp , ”%s(% f ) ” , word str ings [ neighbours [ word ] [ i ] . word ] ,

neighbours [ word ] [ i ] . d i s t ) ;

}f p r i n t f ( fp , ”\n” ) ;

260

}}

/∗ Returns t h e s q u a r e o f t h e d i s t a n c e be tween words x and y ∗ /

double c a l c a d i s t ( i n t x , i n t y , d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) {double d = 0 . 0 ;

double sum = 0 . 0 ;

i n t i ;

for ( i = 0 ; i < num cols ; i ++) {d = neighbours [ x ] [ i ] − neighbours [ y ] [ i ] ;

d ∗= d ;

sum + = d ;

}return s q r t (sum ) ;

}

/∗ Words a r e t h e words t o count , t h e rows

f e a t u r e s a r e t h e t h i n g s t o l o o k f o r , t h e columns

window i s where t h e c o r p u s p a s s e s through

w o r d s t r i n g s i s used f o r debugging / o u t pu t

c o u n t s i s where t h e r e s u l t s a r e s t o r e d

s e e n i s f a l s e t h e f i r s t t ime we s e e a word , t o d e a l wi th unknown words

windowSize i s how f a r t o l o o k

s t a r t p o s i s where t o s t a r t t h e c u r s o r −− t o e n a b l e t h e same c o d e t o do b o t h

l e f t and r i g h t

∗ /

void der ive counts ( const i n t ∗ f e a t u r e s , const d i s t e n t r y ∗ ∗ dis t ,

i n t ∗ window , char ∗ ∗ word strings , i n t ∗ ∗ counts ,

i n t ∗ seen , i n t windowSize , const char ∗ corpus name ,

i n t l e f t ) {FILE ∗ CORPUS=NULL;

f l o a t curPercent = 0 . 0 ;

i n t i = 0 , j = 0 , curLine =0;

i n t word = −1 ;

i n t pos = windowSize − 1 ;

i n t currentWord ;

i n t featureWord ;

i n t alsoDoUnknown ;

# i f d e f NEIGHBOURS

i n t k =0;

261

# endif

/∗ i i s t h e l o c t h a t we ’ r e go ing t o put t h e word i n t o ∗ /

/∗ j i s a l o o p c o u n t e r f o r l o o k i n g a t t h e f e a t u r e word ∗ /

/∗ pos i s t h e l o c o f t h e word we ’ r e l o o k i n g f o r n e i g h b o u r s o f ∗ /

/∗ c u r L i n e i s f o r debugging , c u r r e n t l i n e in t h e c o r p u s ∗ /

/∗ currentWord i s t h e word in p o s i t i o n pos

−− f o r n e i g h b o u r s i t i s a n e i g h b o u r o f t h i s word ∗ /

/∗ ac tua lCurrentWord i s on ly used in n e i g h b o u r s , i t i s t h e word in p o s i t i o n pos

−− i . e . same meaning as currentWord in non−n e i g h b o u r c a s e ∗ /

/∗ f e a t u r e W o r d i s t h e word in p o s i t i o n j , i t co−o c c u r s with currentWord ∗ /

/∗ alsoDoUnknown i s t r u e i f f t h i s i s t h e f i r s t t ime we ’ ve s e e n t h e c u r r e n t word ∗ /

CORPUS = fopen ( corpus name , ” r ” ) ;

a s s e r t (CORPUS ) ;

i f ( l e f t ) {for ( i = 0 ; i < windowSize ; i ++) {

window[ i ] = 0 ;

}i = 0 ;

} e lse { /∗ Un l i k e l e f t windows , we f i r s t have t o l o a d t h e whole c o n t e x t ∗ /

for ( i = 0 ; i < windowSize ; i ++) {f s c a n f (CORPUS, ”%d” ,&word ) ;

window[ i ] = word ;

}i = windowSize − 1 ;

}

while ( ! f e o f (CORPUS ) ) {i f (DEBUG) { /∗ May b r e a k f o r unknown ∗ /

p r i n t f ( ”Found %s\n” , word str ings [window[ pos ] ] ) ;

}i f ( 0 = = seen [window[ pos ] ] ) {

seen [window[ pos ] ] = 1 ;

alsoDoUnknown = 1 ;

} e lse {alsoDoUnknown = 0 ;

}for ( j = 0 ; j < windowSize ; j ++) {

i f ( ( j ! = pos ) && ( f e a t u r e s [window[ j ] ] ) ) {# i f d e f NEIGHBOURS

262

for ( k = 0 ; k < num dis t co ls ; k ++) {# endi f

/∗ Sorry a b o u t t h e nex t l i n e . . .

window [ pos ] i s t h e c u r r e n t word

window [ j ] i s t h e f e a t u r e word which i s w i t h i n t h e window o f t h e

c u r r e n t word

f e a t u r e s [ window ] i s ne ed ed t o map words i n t o columns

d i s t [ word ] g i v e s t h e 2 0 ( + / − ) n e a r e s t words t o t h e c u r r e n t word

c o u n t s i s t h e number o f t i m e s t h i s has o c c u r r e d

∗ /

/∗ The c u r r e n t i n p u t word ∗ /

# i f d e f NEIGHBOURS

currentWord = window[ pos ] ;

currentWord = d i s t [ currentWord ] [ k ] . word ;

# else

currentWord = window[ pos ] ;

# endif

/∗ F e a t u r e t h a t o c c u r e d w i t h i n t h e window with t h e c u r r e n t word ∗ /

featureWord = f e a t u r e s [window[ j ] ] ;

counts [ currentWord ] [ featureWord ]++;

i f ( alsoDoUnknown = = 1 ) {currentWord = UNKNOWNWORD;

# i f d e f NEIGHBOURS

currentWord = d i s t [ currentWord ] [ k ] . word ;

# endif

/∗ F e a t u r e t h a t o c c u r e d w i t h i n t h e window with t h e c u r r e n t word ∗ /

featureWord = f e a t u r e s [window[ j ] ] ;

counts [ currentWord ] [ featureWord ]++;

}# i f d e f NEIGHBOURS

}# endif

}}f s c a n f (CORPUS, ”%d” ,&word ) ;

/∗ Get r e a d y f o r t h e nex t word ∗ /

window[ i ] = word ;

263

i + + ; pos + + ; curLine ++;

i f ( i = = windowSize ) {i = 0 ;

}

i f ( pos = = windowSize ) {pos = 0 ;

}

i f ( curLine >= progress ) {curLine = 0 ;

f p r i n t f ( s tderr , ”\ r %.1 f%%” , curPercent ) ;

f f l u s h ( s t d e r r ) ;

curPercent + = 0 . 1 ;

}}f c l o s e (CORPUS ) ;

}

i n t main ( i n t argc , char ∗ argv [ ] ) {double ∗ ∗ neighbours ;

d i s t e n t r y ∗ ∗ d i s t ;

char ∗ ∗ word str ings ;

i n t ∗ ∗ counts ;

i n t ∗ window ;

i n t ∗ f e a t u r e s ;

i n t ∗ seen ;

# i f d e f NEIGHBOURS

a s s e r t ( argc = = 6 ) ;

# else

a s s e r t ( argc = = 4 ) ;

# endif

/∗ Next l i n e s w i l l s e g f a u l t i f env not s e t r i g h t ∗ /

numTagFeatures = a t o i ( getenv ( ”TAG VECTOR SIZE” ) ) ;

num cols = a t o i ( getenv ( ”WORD VECTOR SIZE” ) ) ;

numDendro = a t o i ( argv [ 1 ] ) + 1 ; /∗ + b e c a u s e words [ 0 ] unused ∗ /

numFeaturesTotal = a t o i ( argv [ 2 ] ) + 1 ;

numFeatures = numFeaturesTotal − numTagFeatures ;

windowSize = a t o i ( argv [ 3 ] ) ;

264

# i f d e f NEIGHBOURS

num cols = a t o i ( argv [ 4 ] ) ; /∗ Number o f d i m e n s i o n s a f t e r SVD ∗ /

num dis t co ls = a t o i ( argv [ 5 ] ) ; /∗ Number o f words t o c o n s i d e r n e i g h b o u r s ∗ /

# endif

lowProb = 1 . 0 / numFeatures ;

word str ings = ( char ∗ ∗ ) malloc ( s i ze of ( char ∗ ) ∗ numDendro + fudge ) ;

counts = ( i n t ∗ ∗ ) amalloc ( s i ze of ( i n t ) , NULL, 2 , numDendro + fudge , numFeatures + fudge ) ;

window = ( i n t ∗ ) malloc ( s i ze of ( i n t ) ∗ windowSize ) ;

f e a t u r e s = ( i n t ∗ ) malloc ( s i ze of ( i n t ) ∗ numWords ) ;

seen = ( i n t ∗ ) malloc ( s i ze of ( i n t ) ∗ numWords ) ;

neighbours = ( double ∗ ∗ )

amalloc ( s i ze of ( double ) , NULL, 2 ,

num rows + fudge , num cols + fudge ) ;

d i s t = ( d i s t e n t r y ∗ ∗ )

amalloc ( s i ze of ( d i s t e n t r y ) , NULL, 2 ,

num rows + fudge , num dis t co ls + fudge ) ;

a s s e r t ( word str ings ) ;

a s s e r t (window ) ;

a s s e r t ( f e a t u r e s ) ;

a s s e r t ( counts ) ;

a s s e r t ( counts [ 0 ] ) ;

a s s e r t ( d i s t ) ;

a s s e r t ( d i s t [ 0 ] ) ;

UNKNOWNWORD = numDendro ; /∗ Wil l change in l o a d c o n v e r t ∗ /

/∗# Next , d e t e r m i n e which w o r d s t r i n g s a r e words and f e a t u r e s

# a words word i s one t h a t i t i s worth computing a f e a t u r e v e c t o r f o r ( a row )

# a ve ry words word i s one t h a t h e l p s t r a i n t h e f e a t u r e v e c t o r ( a column )

∗ /

l o a d f e a t u r e s ( f e a t u r e s ) ;

numDendro = load conver t ( word str ings , numDendro ) ;

# i f d e f NEIGHBOURS

load neighbours ( neighbours ) ;

c a l c d i s t a n c e ( d i s t , neighbours ) ;

i f ( n e i g h b o u r s t o p r i n t >= numDendro ) {

265

n e i g h b o u r s t o p r i n t = numDendro − 1 ;

}pr int ne ighbours ( d i s t , word str ings , n e i g h b o u r s t o p r i n t ) ;

# endif

for ( i = 0 ; i < numWords ; i ++) {seen [ i ] = 0 ;

}

/∗ l o a d t a g s ( c o u n t s ) ; −− t h i s i s done us ing p a s t e ∗ /

/∗# Now l o a d t h e a b i l i t y t o c o n v e r t be tween w o r d s t r i n g s and numbers

# Th i s i s used t o make p r e t t y gr ap hs

∗ /

# i f d e f TIPSTER

f p r i n t f ( s tderr , ” Part 1 of 5 :\n” ) ;

# else

f p r i n t f ( s tderr , ” Part 1 of 1 :\n” ) ;

# endif

der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,

seen , windowSize , ” corpus . aa” , 1 ) ;

# i f d e f MAORI

/∗ Maori may p r e f e r t o l o o k r i g h t i n s t e a d o f l e f t −− c u r r e n t l y d i s a b l e d ∗ /

/∗ d e r i v e c o u n t s ( f e a t u r e s , d i s t , window , w o r d s t r i n g s , counts ,

s een , windowSize , ” c o r p u s . aa ” , 0 ) ;

∗ /

# endif

# i f d e f TIPSTER

f p r i n t f ( s tderr , ” Part 2 of 5 :\n” ) ;

der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,

seen , windowSize , ” corpus . ab” , 1 ) ;

f p r i n t f ( s tderr , ” Part 3 of 5 :\n” ) ;

der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,

seen , windowSize , ” corpus . ac ” , 1 ) ;

f p r i n t f ( s tderr , ” Part 4 of 5 :\n” ) ;

der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,

seen , windowSize , ” corpus . ad” , 1 ) ;

f p r i n t f ( s tderr , ” Part 5 of 5 :\n” ) ;

der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,

seen , windowSize , ” corpus . ae” , 1 ) ;

266

# endif

/∗# P r i n t t h e body

∗ /

for ( i = 1 ; i <= numDendro ; i ++) {p r i n t f ( ”%c%s%c ” , quote , word str ings [ i ] , quote ) ;

for ( j = 1 ; j < numFeatures ; j ++) {p r i n t f ( ”%d ” , counts [ i ] [ j ] ) ;

}p r i n t f ( ”\n” ) ;

}

p r i n t f ( ”\n” ) ;

f p r i n t f ( s tderr , ”\n” ) ;

return 0 ;

}

C.5.2 Scaling bigrams

/∗ R e w r i t e o f s c a l e . p l in C due t o p e r l running out o f ram

∗ S i n c e t h i s i s a r e w r i t e , some p e r l c o d i n g c o n v e n t i o n s a r e used

∗∗ PROGRAM LOGIC :

∗ S c a l e s a bigram f i l e so t h a t a l l v e c t o r s a r e u n i t .

∗ To r e d u c e t h e d e p e n d e n c e on f r e q u e n t words we e x p e r e m e n t with l o g and / o r

∗ s c a l i n g columns . t h e do v a r i a b l e s c o n t r o l t h i s

∗∗ Each l i n e has t h e f o r m a t : ” word ” num num num . . .

∗∗ /

# include < a s s e r t . h>

# include <math . h>

# include < s t r i n g . h>

# include < s t d i o . h>

# include ” amalloc . h”

# define max word size 6 0 0 /∗ f o r s t r c p y , a l l words f e w e r b y t e s than t h i s ∗ /

# define m a x l i n e s i z e 6 5 5 3 6 /∗ f o r f g e t s , a l l l i n e s f e w e r b y t e s than t h i s ∗ /

# define fudge 5 /∗ Add t o e v e r y m a l l o c t o f i t q u i r k y c a s e s in ∗ /

267

# define p r i n t i t ( ) \i f ( do pr int ) { \

f p r i n t f ( s tderr , ”PRINTING THE MATRIX\n” ) ; \for ( cur row = 0 ; cur row < num rows ; cur row + + ) { \

f p r i n t f ( s tderr , ”%s ” , word str ings [ cur row ] ) ; \for ( cur column = 0 ; \

cur column < num cols ; cur column + + ) { \f p r i n t f ( s tderr , ”%f ” , c e l l s [ cur row ] [ cur column ] ) ; \

} \f p r i n t f ( s tderr , ”\n” ) ; \

} \}

/∗ ALGORITHM:

∗ arg 1 = bigram f i l e name

∗ arg 2 = number o f rows

∗ arg 3 = number o f columns

∗ arg 4 = do add1 −− add one t o e v e r y c e l l

∗ arg 5 = d o l o g −− l o g e v e r y c e l l

∗ arg 6 = d o c e n t e r −− c e n t e r rows on z e r o

∗ arg 7 = d o s q r t −− use RMS i n s t e a d o f l i n e a r

∗ arg 8 = d o c o l s c a l e −− c e n t e r columns on z e r o

STEP 0 : Read in t h e d a t a

STEP 1 : Add one t o e v e r y c e l l ( s i m p l e maximum−e n t r o p y c o u n t e r m e a s u r e )

STEP 2 : Log e v e r y c e l l ( t o c o u n t e r z i p f )

STEP 3 : Ce nt e r t h e rows on z e r o

STEP 4 : S c a l e columns t o 1 ( t o u n d e r e m p h a s i s e f r e q u e n t words )

STEP 5 : S c a l e rows t o 1 ( n o r m a l i s e t h e v e c t o r s )

I d e c i d e d t o i g n o r e d o s q r t in d o c o l s c a l e d e l i b e r a t e l y . I f o r g e t why

∗ /

i n t main ( i n t argc , char ∗ argv [ ] ) {/ / Input

FILE ∗ IN ;

char ∗ input ;

char ∗ i n p t r ; /∗ P o i n t e r i n t o i n p u t ∗ /

char ∗ spaceptr ; /∗ P o i n t e r i n t o i n p u t −− a t l o c o f nex t t o k e n ∗ /

/ / Arrays

f l o a t ∗ ∗ c e l l s ; / / 2D a r r a y o f a l l c e l l s

268

char ∗ ∗ word str ings ; / / The words as s t r i n g s

/ / What p a r t s o f t h e c o d e t o e n a b l e

i n t do add1 = 0 ; / / Add one t o e v e r y c e l l ( r e q u i r e d f o r l o g )

i n t do log = 0 ; / / Log e v e r y c e l l b e f o r e p r o c e s s i n g

i n t do center = 1 ; / / Cen t e r on z e r o

i n t do sqr t = 1 ; / / RMS i n s t e a d o f s i m p l e a d d i t i o n

i n t d o c o l s c a l e = 0 ; / / E q u a l i s e a l l columns

i n t do pr int = 0 ; / / P r i n t debugging

/ / S i z e s o f t h i n g s

double row sum ; / / Sum o f c e l l s in t h i s row

double col sum ; / / Sum o f c e l l s in t h i s column

i n t num cols ; / / Number o f columns

i n t num rows ; / / Number o f rows

i n t cur column ; / / Current column i n d e x b e i n g p r o c e s s e d

i n t cur row ; / / Current row i n d e x b e i n g p r o c e s s e d

a s s e r t ( argc = = 9 ) ;

f p r i n t f ( s tderr , ”Reading bigrams from %s\n” , argv [ 1 ] ) ;

num rows = a t o i ( argv [ 2 ] ) ;

num cols = a t o i ( argv [ 3 ] ) ;

do add1 = a t o i ( argv [ 4 ] ) ;

do log = a t o i ( argv [ 5 ] ) ;

do center = a t o i ( argv [ 6 ] ) ;

do sqr t = a t o i ( argv [ 7 ] ) ;

d o c o l s c a l e = a t o i ( argv [ 8 ] ) ;

/∗ S a n i t y ∗ /

a s s e r t ( 0 < num rows ) ;

a s s e r t (1000000 > num rows ) ;

a s s e r t ( 0 < num cols ) ;

a s s e r t (10000 > num cols ) ;

input = ( char ∗ ) malloc ( m a x l i n e s i z e ) ;

c e l l s = ( f l o a t ∗ ∗ ) amalloc ( s i ze of ( f l o a t ) , NULL, 2 ,

num rows + fudge , num cols + fudge ) ;

word str ings = ( char ∗ ∗ ) amalloc ( s i ze of ( char ) , NULL, 2 ,

num rows + fudge , max word size + fudge ) ;

IN = fopen ( argv [ 1 ] , ” r ” ) ;

a s s e r t ( IN ) ;

269

/ /

/ / STEP 0 : Read in t h e d a t a

/ /

cur row = 0 ;

f p r i n t f ( s tderr , ”Reading data\n” ) ;

for ( cur row = 0 ; cur row < num rows ; cur row ++) {f g e t s ( input , max l ine s ize , IN ) ;

a s s e r t ( input ) ;

i n p t r = input ;

spaceptr = s t r c h r ( inptr , ’ ’ ) ;

a s s e r t ( spaceptr ) ;

∗ spaceptr = ’ \0 ’ ;

s t rcpy ( word str ings [ cur row ] , i n p t r ) ;

i n p t r = spaceptr + 1 ;

for ( cur column = 0 ; cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ] = a t o f ( i n p t r ) ;

spaceptr = s t r c h r ( inptr , ’ ’ ) ;

a s s e r t ( spaceptr ) ;

i n p t r = spaceptr +1;

}a s s e r t (∗ i n p t r = ’ \n ’ ) ;

}f c l o s e ( IN ) ;

p r i n t i t ( ) ;

/ /

/ / STEP 1

/ / Add one t o e a c h c e l l

/ /

i f ( do add1 ) {f p r i n t f ( s tderr , ”Adding one to every c e l l \n” ) ;

for ( cur row = 0 ; cur row < num rows ; cur row ++) {for ( cur column = 0 ;

cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ]++;

}

270

}}

/ /

/ / STEP 2

/ / Log e a c h c e l l

/ /

i f ( do log ) {f p r i n t f ( s tderr , ”Computing logarithm f o r each c e l l \n” ) ;

for ( cur row = 0 ; cur row < num rows ; cur row ++) {for ( cur column = 0 ;

cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ] =

log ( c e l l s [ cur row ] [ cur column ] ) ;

}}p r i n t i t ( ) ;

}

/ /

/ / STEP 3 : C ent e r t h e v a l u e s on z e r o

/ /

i f ( do center ) {f p r i n t f ( s tderr , ” Centering rows on zero\n” ) ;

for ( cur row = 0 ; cur row < num rows ; cur row ++) {row sum = 0 ;

for ( cur column = 0 ; cur column < num cols ; cur column ++) {row sum + = c e l l s [ cur row ] [ cur column ] ;

}row sum / = num cols ;

for ( cur column = 0 ; cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ] −= row sum ;

}}p r i n t i t ( ) ;

}

/ /

/ / STEP 4 : S c a l e columns t o 1 ( t o u n d e r e m p h a s i s e f r e q u e n t words )

271

/ /

i f ( d o c o l s c a l e ) {f p r i n t f ( s tderr , ”Computing column counts\n” ) ;

for ( cur column = 0 ; cur column < num cols ; cur column ++) {col sum = 0 ;

for ( cur row = 0 ; cur row < num rows ; cur row ++) {col sum + = c e l l s [ cur row ] [ cur column ] ;

}col sum / = num rows ;

for ( cur row = 0 ; cur row < num rows ; cur row ++) {c e l l s [ cur row ] [ cur column ] / = col sum ;

}}p r i n t i t ( ) ;

}

/ /

/ / STEP 5 : S c a l e rows t o 1 ( n o r m a l i s e t h e v e c t o r s )

/ /

f p r i n t f ( s tderr , ” Normalising the v e c t o r s ” ) ;

i f ( do sqr t ) {f p r i n t f ( s tderr , ” using s q r t (RMS)\n” ) ;

} e lse {f p r i n t f ( s tderr , ” using l i n e a r \n” ) ;

}

for ( cur row = 0 ; cur row < num rows ; cur row ++) {for ( cur column = 0 ; cur column < num cols ; cur column ++) {

i f ( do sqr t ) {row sum + = c e l l s [ cur row ] [ cur column ] ∗

c e l l s [ cur row ] [ cur column ] ;

} e lse {row sum + = c e l l s [ cur row ] [ cur column ] ;

}}i f ( do sqr t ) {

row sum = s q r t ( row sum ) ;

}for ( cur column = 0 ; cur column < num cols ; cur column ++) {

c e l l s [ cur row ] [ cur column ] / = row sum ;

272

}}p r i n t i t ( ) ;

/∗ F i n i s h e d ∗ /

for ( cur row = 0 ; cur row < num rows ; cur row ++) {p r i n t f ( ”%s ” , word str ings [ cur row ] ) ;

for ( cur column = 0 ; cur column < num cols ; cur column ++) {p r i n t f ( ”%.6 f ” , c e l l s [ cur row ] [ cur column ] ) ;

}p r i n t f ( ”\n” ) ;

}return 0 ;

}

273

274

Glossary

Active edge: An edge in the chart that has not yet been completed because only some of

its constituents have been found.

Adjunct: A phrasal argument that is not synatactically required, for example yesterday in

John died yesterday.

Ambiguity: See ambiguous.

Ambiguous: A string of words that can be interpreted in two or more ways.

Arc: An edge in a chart or a graph.

Argument: A phrase associated with a head constituent, for example the ball in John kicked

the ball.

Backoff: Replacing a complex statistical lookup with a simpler lookup in order to increase

counts.

Backpropagation: A simple neural network architecture in which errors are backpropa-

gated from the output layer back through the network.

Bag: A set in which duplicates are permitted.

Base-NP: A term invented by Collins that refers to a non-recursive noun phrase. Collins

hypothesised that noun phrases which include other noun phrases have differ-

ent usage rules. For example, The man with the big hat. is a noun phrase but not a

base noun phrase since it contains the constituent (base) noun phrase the man.

Beam: The data structure used to store the candidates in a beam search.

Beam search: A heuristic search in which the number of candidates being considered is

constrained.

Best edge: The edge with the highest probability.

Best first: A search strategy in which the most promising looking node is expanded next.

275

Bigram: The count of an event involving two items, for example two words co-occuring.

Bigram-statistics: Statistical analysis of bigrams.

Bit: The smallest complete unit of information. The term is used in information the-

ory to refer to the amount of information that is necessary to represent a binary

decision of equal probabilities (for example a coin toss). The term is also used in

Cascade neural networks to refer to a binary output unit.

Bits eror: Along with index error, one of the victory criteria used by Cascade-correlation.

Measuring the error in bits means a large error is given the same penalty as a

small error, making the network better at generalising but less accurate.

BNC: The British National Corpus; a billion-word corpus of text, commonly used in

natural language processing.

Bottom-up: Starting with the words and building towards some high level structure, com-

pare top down.

Branch: For code: Produce two different versions based on the same initial (root) version,

most commonly an unstable version containing the new features and a stable

version that is known to work.

Candidate hidden unit: Units used by Cascade-correlation; the best one will be incorpo-

rated into the network as a hidden unit, and the rest will be discarded.

Candidate training phase: Part of Cascade Correlation’s learning, in which it trains the

candidate hidden units to be active when the network’s error is at its highest.

Cascade Correlation: A neural network architecture developed by Fahlman and Lebiere

(1990). Cascade is similar to backpropagation with the most visible difference

being much faster learning. It is described in Section 7.4.

Centered vector: A vector in which the sum (mean) of the elements is zero.

CFG: Context free grammar. A grammar that does not require any context outside

that explicitly included in the sequence of words.

Chart: A data structure that stores the set of all edges going from word a to word b.

Child: The sub phrase which will be expanded by the addition of a parent. For instance,

a verb-phrase child could be expanded to form a sentence.

Collocation matrix: A two dimensional array of co-occurence counts.

276

Complement: An argument to a phrase that is necessary for the phrase to be well formed.

For instance, John kissed is missing its complement.

Complete: A finished edge, that is, one that has all of its arguments and so is ready to be

used in other structures.

Conditional independence: Two events are conditionally independent of a third if we can

discard each event in working out the probability of the other. That is: a and b

are conditionally independent iff P(a, b|c) = P(a|c).P(b|c).

Conditional probability: The probability of an event occuring is different if we know an-

other event has occurred.

Context: The linguistic information (usually words) surrounding the item being exam-

ined. For example, in a context of thievery, fence is likely to be interpreted differ-

ently.

Context-free grammar: A grammar made up of a finite number of rules. That is, every

nonterminal can always be expanded to other nonterminals or to a word, re-

gardless of the context the nonterminal occurs in.

Corpus: A body of text, particularly in terms of training data.

Cost: The distance between two nodes in a graph. For instance, a graph representing

the flight times between cities would represent the costs in terms of time.

Coverage: The proportion of a language that can be represented using a grammar. Wide-

coverage grammars require a large number of rules which tends to lead to over-

generalising.

Crossing brackets: A coarse measure of a paqrser’s accuracy in which the parser is pe-

nalised every time its predicted bracketing crosses the correct answer. The brack-

eting will cross when the parser gets both when a phrase starts and stops wrong.

It is a useful metric because the WSJ’s variable structure sometimes makes pre-

cision and recall inaccurate while a crossed bracket is always an error.

Cutoff: An explicit boundary where any phrase outside the boundary (i.e. too low prob-

ability) is discarded. (See also cutoff threshold.).

Cutoff threshold: In searching, the probability at which we say a given interpretation does

not look promising enough and discard it rather than expanding it further (see

cutoff).

277

Decomposed: Choosing a particular path for transforming a nonterminal, such as decid-

ing a sentence should be decomposed into a noun phrase followed by a verb

phrase instead of any other interpretation.

Dendrogram: A tree hierarchy, mainly used here in relation to words.

Density plot: A close relative of a scatter-plot where instead of plotting individual points,

the whole graph is coloured, and the colour is brighter where there are more

points nearby. This approach is extremely effective at showing trends where

there are so many points that outliers would otherwise mask the trend.

Dependency: A phrase that is part of another phrase. For instance the ball in John kicked the

ball.

Distance metric: A measure of the distance between two phrases since explicitly giving

the intermediate phrases would be too much context and reduce all counts to

near zero.

Distributed: A representation in which the meaning is not captured by a single symbol but

by the combination of a number of nominally independent items. For instance,

the word vectors are distributed because the values of any dimension in the

vector can be changed to give a new meaning.

Earley parsing: A simple top-down parsing algorithm in which the parser can either shift

the current input word, or reduce it to a nonterminal. The effect of this operation

is somewhat similar to a LR-parser.

Edge: In parsing literature: A phrase, either complete or with some parts still unex-

panded or; in graph literature, an arc.

Eigen decomposition: The decomposition of a square matrix into eigenvalues and eigen-

vectors such that the eigenvalues times the eigenvectors gives the origonal ma-

trix.

Eigenvalue: The amount a particular eigenvector is scaled when transformed.

Eigenvector: A vector that when transformed by a matrix retains its direction but perhaps

not length.

Event: The representation of a single transformation under a given grammar model.

For example NP → Det Noun. The event contains all the information that the

grammar model stipulates (lexicalised, distance, etc.) and nothing else.

278

Event file: A file containing every single event in the treebank. For a statistical parser, this

renders the treebank redundant.

Exact match: The hardest measure of a parser’s accuracy in which the parser is penalised

unless the predicted parse is identical to the gold standard. A single tagging

error (for example NNS instead of NNPS) or extra internal structure (NP (ADJP

(JJ big)) (NN man)) vs (NP (JJ big) (NN man)) will cause the parser to score zero

on the current sentence.

Expand: Replace a phrase by one interpretation of the phrase in the grammar. For in-

stance in a simple context-free grammar, a sentence could be expanded to a noun

phrase followed by a verb phrase. Later, the noun phrase could be expanded to

a determiner followed by a noun.

Expected: The most probable outcome (mean, not mode).

Feature: Properties of grammatical entities, such as count, gender or formality.

Feature vector: My term to refer to orthogonal representations, such as word-space.

Feature words: The words that bigram counts are computed between, rather than the

words we are trying to represent. For instance with may make a good feature

word, since certain classes of words co-occur with it.

Fourgram: An n-gram with four parameters.

Fragment: A small well-behaved subset of the language. The term is also used to refer to

a sequence of words that cannot be parsed into a phrase.

Fringe: The nodes in the search tree which are about to be expanded. Nodes on the

fringe are the current candidates for being in the parse.

Genprob: The name of Collins’ function to compute the probability of any event.

Grammar: A set of rules for deciding if a given sentence is well formed in the language.

Hash key: The result of mapping the event into a simple linear sequence, usually the array

refrerence.

HBG: History Based Grammar, the grammar formalism used by Black et al.’s parser.

Head constituent: The key sub-phrase in the phrase.

Head nonterminal: The nonterminal category of the head constituent.

279

Head production: The generation of a phrase’s head constituent. Along with sibling pro-

duction this allows the generation of all parse trees.

Headword: See lexical head.

Heuristic search: Any form of graph search (including parsing) in which the search is

guided towards the goal by evaluating how good each state is and expanding

promising states.

Hidden Markov model: a statistical machine in which not only transitions are probabilis-

tic but also output. HMMs are frequently used in speech recognition as well as

POS tagging. To take POS tagging as an example, observable events (words)

can be used to predict the internal state (the POS tag). They differ from Markov

models in that the internal state is not directly observable.

Hierarchical: A data represetation like a tree, so that nodes have a children, siblings and a

parent.

HMM: See hidden Markov model.

Homograph: Two different words with the same spelling. The term is a more extreme

version of polysemous in that a word is polysemous if it may be used in different

ways (for example telephone a friend vs answer the telephone) but it is a homograph

if the meanings are unrelated (for example fence the goods vs fence with a foil.

HPSG: Head driven Phrase Structure Grammar. Pollard and Sag’s formalism where

phrases contain linguistically information useful, most obviously the head word

of the phrase. Collins’ probability model takes advantage of HPSG in deciding

what information to discard.

Inactive edge: An edge that has been completed and so is ready for use by other edges

rather than being expanded itself.

Incomplete: An edge that has not yet been completed and so is still looking for neighbour-

ing words or phrases.

Independent: Two events are independent if one occuring does not affect the probability of

the other occuring.

Index array: An array of elements referencing into another array. For instance a large array

of all words sorted alphabetically might have an index array with an element for

each possible starting value.

280

Index error: Along with bits eror, one of the victory criteria, used by Cascade-correlation.

When using index, getting an output significantly wrong is worse than getting

it just a little wrong. This encourages the network to accurately fit the training

data but makes overfitting easy.

Information Theory: A branch of statistics concerned with the amount of information that

is represented by a fact. Since events do not occur independently, the sequence

of events can usually be used to predict the next event to some extent. The

amount we cannot predict the next event is the number of bits necessary to en-

code its occurence, although less efficient encodings will require more bits. It is

very useful when we wish to eliminate or at least make explicit any redundency

in the representation.

Inside: The part of the parse that has already been fully built.

Inside probability: The probability of a set of operations occuring, regardless of external

context.

Interpolation: Many functions cannot be represented explicitly but only indirectly through

some sample of input/output pairs. Given such a sample over a centain range,

interpolation is estimating what the output would be for a different input that

is still within the range of sampled inputs, such as midway between two input

values.

Iterative clustering: Performing the clustering multiple times where the input for a run

is the output from the previous run. Useful if the initial clustering only forms

‘hints’ of the clusters that are present, or to merge clusters.

Join: The combination of two phrases to produce a larger phrase.

Kernel function: A mapping between one multi-dimensional space and another (with po-

tentially a different number of dimensions) intended to make more explicit some

property of the data. For example, the classic ‘two spirals’ problem looks very

complicated in a cartesian space but is linearly seperable in a polar space.

Key generation: The process of combining several parameters to generate a hash-key. For

instance, the left grammar hash table is accessed by providing a head and a

parent; which are combined by the key generation algorithm to form a single

array index.

Lexical head: The head word of a phrase, for example kicked in John kicked the ball.

281

Markov model: A statistical machine in which the system is in an observable state and

will determine the next state probabilistically. Usually, a history of recent states

is encoded into the representation of the current state.

Markovian assumption: The assumption that only a certain amount of history for pre-

dicting the next state. While the assumption is often incorrect, it is usually close

enough to correct for practical purposes.

Maximum Likelihood Estimate: The MLE for a parameter(s) is the value for that parame-

ter that maximises the probability of observing the values which have occurred.

For instance, if we observe a coin giving heads 60% of the time, then the MLE

for P(heads) is 0.6.

MLE: See Maximum Likelihood Estimate.

Mutual information: The amount of information shared between two events x and y, mea-

sured in bits. If x and y are independent then it is zero, if they are identical then

it is the amount of information x contains.

N-gram: The count of an event involving n items, for example a word occuring with a

particular tag, nonterminal, etc.

Naive Bayes: A simple probabilitic classifier based on strong independent assumptions.

Neural network: An adaptive algorithm for approximating arbitary functions.

NLP: Natural Language Processing; any research in computational linguistics .

NPB: A base (non-recursive) noun phrase.

Object: The part of the sentence that is acted on; the ball in John kicked the ball.

Optimistic: Any search heuristic in which the cost of reaching the goal is always under-

estimated. Optimistic heuristics are useful because the sum of the cost to the

current node plus the estimate to the goal never more than the true cost to the

goal via the current node.

Order: The amount of context contained in the Markov model.

Output training phase: Part of Cascade Correlation’s learning, in which it trains the weights

to the output units.

Outside: The parts of the parse tree that have not been generated yet. Important because

some states could look good locally but be impossible to transform into a sen-

tence.

282

Outside probability: The probability that the current state will lead to a goal state.

Overfitting: Training a neural network until it very accurately reproduces the training

data, and as a result generalises poorly. It is generally best to stop training well

before the neural network reproduces the training data in order to maintain

smooth generalisations between training instances. Cascade is especially vul-

nerable to overfitting.

Parent: See parent nonterminal.

Parent constituent: The entire parent phrase.

Parent headword: The lexical head of the parent phrase. Due to the definition of head, this

will be the same as the lexical head of the head constituent.

Parent nonterminal: The nonterminal category of the parent constituent. i.e. the top non-

terminal in the tree.

PCA: See Principal component analysis.

PCFG: Probabilistic Context Free Grammar. Identical to normal context–free grammar

except rules have probababilities assigned to them.

Penn treebank: The treebank of fifty thousand hand-parsed sentences from the WSJ devel-

oped by the University of Pennsylvania, it is also known as the WSJ. This corpus

is used by all current statistical parsers for training data. The size (or frequently

lack of it) of this corpus determines most of the design decisions in building a

parser, and will continue to do so until better unsupervised learning methods

are developed.

Perplexity: The amount of information needed to convey the HMM of the language. Mea-

sured in bits, the perplexity refers to the how much information is necessary to

convey the next word in a given language model. It can be informally viewed as

the average number of words which can possibly occur next in a word sequence.

Phrase: A branch of a parse tree. Rather than just the nonterminal label, such as NP, the

phrase includes everything in the parse over the span of words.

Polysemous: A word which can be used with different (but semantically related) mean-

ings, including different POS tags.

POS: Part-Of-Speech. The tag given to denote the role of a word, for example kick is a

verb.

283

Precision: A measure of a parser’s accuracy defined as the percentage of phrases found by

the parser that are considered correct. This is distinct from recall in that a parser

which claims no input ever has any phrases will have perfect precision but zero

recall.

Primed: A word we would be unsurprised to see due to recent context.

Principal component: The most important information. More formally, if the data is repre-

sented by an n dimensional space then the principal component is a hyperplane

through the space which shows greatest variance.

Principal component analysis: a method of analysing multivariate data in order to ex-

press their variation in a set of orthogonal components, sorted by the amount of

variance they express.

Probabilistic grammar: Any grammar formalism in which the rules are associated with a

probability that the given rule applies.

Production: The expansion of a grammar rule.

Pseudo-event: A generated event which did not actually occur in the corpus but which is

treated as if it had occurred. Pseudo-events are used to increase counts and to

compensate for events not seen during training.

Recall: A measure of a parser’s accuracy defined as the percentage of phrases present

in the input which were found by the parser. This is distinct from precision in

that a parser which claims every possible word sequence as a phrase will have

perfect recall but nearly zero precision.

Reducing: In an Earley parser, noting that the top items in the stack exactly match a rule

in the grammar and replacing the items by the single item they match.

Regular expression: A formalism for representing simple languages. Regular expressions

are frequently used for complex pattern-matching or substitutions. While they

have only a fraction of the representational power that other grammars posess,

they can be used to solve a surprisingly large number of problems.

RMS: Root Mean Square; a standard statistical technique for combining a list of num-

bers to give a magnitude. Defined as

xrms =

√√√√ 1N

N

∑i=1

X2i

.

284

Serial learning: Learning a series of items in a fixed order. In neural networks this usually

leads to an inability to reproduce early items from the series as their representa-

tions are overwritten with newer items.

Shifting: In an Earley parser, placing the current input word onto the stack (and removing

it from the input).

Sibling production: The generation of dependent siblings to the left and right of the head.

Singular value decomposition: A type of principal component analysis.

Skiplist: A data type similar to a linked list in which multiple next pointers allow faster

traversing.

Smoothing: The process of combining multiple probability estimates into a single esti-

mate.

SOM: Self Organising Map. A neural network architecture developed by Kohonen

(1982) that uses unsupervised learning to extract patterns in its training data.

Stagnate: The term that uses to describe when a is no longer learning. At this point the

best candidate unit’s weights are locked and it is added to the network. Stagnat-

ing is generally a good indication that the network is still learning, as opposed

to when the network has a timeout.

Statistical parsing: Using a statistical grammar to find the best parse for a sentence, or the

possible parses sorted by probability.

Stop: Collins’ term to declare that a phrase has been completed and should now be

used as a constituent in larger phrases rather than being expanded itself. It is

computed using a special dependency production in which the abstract ‘stop’

phrase is attached to the left and the right.

Sub-event: A specific event that is counted directly in the corpus, rather than one which is

backed off to an approximated count.

Subcategorisation list: The bag of phrases that a phrase needs to have as complements in

order to be complete.

Subject: The key ‘actor’ in the sentence; John in John kicked the ball.

Supervised learning: A learning method used in neural networks where explit training

data is provided. The training data is in the form of input/output pairs.

285

Surprise: An information-theory term referring to how likely an event is to occur. Surprise

is measured in bits and if an event has a probability p of occurring then the

surprise of it occurring is computed using the formula: − log(p).

SVD: See singular value decomposition.

SVM: Support Vector Machine. An algorithm for classifying very high-dimensional

data.

TAG: Tree Adjoining Grammar. A lexicalised theory of syntax in which operations,

such as substitution, are applied to trees.

T/G: Tipster and Gutenberg. My corpus derived by concatenating the Tipster corpus

to Project Gutenberg.

Timeout: The term that uses to describe when it stops the training process because the

has taken too long to converge to a stable state and may be trapped in a (near)

infinite loop. This is generally an indication that the problem is too hard for

cascade.

Tokenisation: Breaking a sequence of letters into words.

TOP: Collins’ highest-level nonterminal structure, to make it easier to distinguish

from sentences that are embedded.

Top-down: Starting with a high-level structure such as sentence and attempting to expand

it into the low-level perceived events (typically words).

Tree: A data structure showing the nonterminal constituents in a particular parse of a

sentence, the output of a parser.

Treebank: An unordered set of (parse) trees.

Trigram: A n-gram in which an event consists of three terms.

Unary: The transformation of one item into another item. Most commonly used here to

refer to parent productions, where a nonterminal chooses a single parent.

Unigram: The simplest n-gram, an event containing exactly one term such as seeing a

particular word.

Unit vector: A vector centered on the origin with a length of one.

Victory error criterion: The point at which the neural network considers it has learned its

training data successfully.

286

Viterbi: The observation that we are typically only interested in the best interpretation

and can discard any search paths that we know cannot form the best interpreta-

tion (because another interpretation is locally more likely).

Well-formed sentences: A sentence for which the grammar will produce a valid parse; a

valid sentence.

Word space: Any representation of words in which words are ‘nearby’ when they are re-

lated, and ‘distant’ when they are not related.

WSJ: Wall Street Journal, See Penn treebank.

XBAR: A highly recursive grammar formalism. The name comes from a convention of

placing a bar over nonterminals to denote the end of recursion, with the X refer-

ring to the idea that the difference between nonterminals should be abstracted

away as a feature.

Zipf’s law: An observation that word frequency is distributed exponentially. That is, n

times the number of words occuring n times is approximately constant.

287

Index

Abney et al. (1999), 210, 213

Allen (1995), 12, 213

Banerjee and Pedersen (2003), 131, 213

Bencini et al. (2002), 210, 213

Bengio and Bengio (2000), 119, 213

Bengio et al. (2003), 119, 213

Bengio (2003), 121, 213

Bies et al. (1995), 213, 221

Bikel (2004), 32, 96, 207, 213, 244

Bikel (2005), 54, 213

Black et al. (1992), 16, 17, 22, 27–29, 31, 46,

213, 279

Bod and Scha (1996), 5, 20, 27, 29–31, 46, 47,

214

Bod (1996), 48, 213

Booth and Thompson (1973), 16, 214

Brooks (1982), 90, 214

Brown et al. (1992), 110, 113, 214

Chapman (1992), 108, 214

Charniak et al. (1993), 182, 214

Charniak et al. (1996), 75, 76, 79, 100, 214

Cheeseman et al. (1990), 131, 214

Chen and Goodman (1996), 39, 214

Chen and Rosenfeld (2000), 39, 214

Chomsky (1965), 2, 214

Choueka and Luisgnan (1985), 115, 214

Christ (1994), 114, 214

Collins (1996), 47–49, 58, 161, 214

Collins (1997), 5, 7, 48, 51, 57, 58, 94, 215

Collins (1999), 5, 20, 29, 46, 48, 51, 58, 95, 161,

164, 215

Copestake and Flickinger (2000), 1, 215

Curran (2004), 211, 215

Earley (1970), 12, 215

Elman (1990), 116, 215

Fahlman and Lebiere (1990), 166, 215, 276

Finch (1993), 68, 109, 121, 171, 215

Gale and Sampson (1995), 37, 215

Garfield and Wermter (2003), 167, 215

Garner (1995), 131, 215

Ginzburg and Sag (2000), 1, 215

Goodman (1996), 48, 215

Goodman (1998), 5, 43, 216

Goodman (2001), 136, 216

Haegeman (1991), 8, 216

Harman (1992), 128, 216

Hart (2005), 128, 216

Hastings (1970), 48, 216

Honkela et al. (1995), 121, 122, 216

Honkela (1997a), 121, 216

Honkela (1997b), 121, 216

Jelinek and Mercer (1980), 39, 216

Joachims (2001), 122, 216

Katz (1987), 39, 216

Klein and Manning (2001a), 153, 216

Klein and Manning (2001b), 43, 217

Klein and Manning (2002), 44, 209, 217

Klein and Manning (2003), 27, 31, 32, 46, 47,

52, 207, 217

288

Kohonen (1982), 217, 285

Kudo and Matsumoto (2001), 210, 217

Lakeland and Knott (2001), 74, 75, 217

Lakeland and Knott (2004), 67, 217

Lawrence et al. (1996), 212, 217

Lee (2004), 3, 217

Liddle (2002), 117, 217

Lin (1997), 114, 217

Lin (1998), 115, 210, 218

Li (1992), 4, 217

Magerman (1995), 25, 58, 71, 218

Magerman (1996), 5, 218

Manning and Schutze (1999), 45, 218

Marcus et al. (1993), 17, 218

Mayberry III and Miikkulainen (1999), 121,

218

McCallum (1996), 131, 218

Miikkulainen (1993), 117, 218

Miller (1995), 108, 218

Min and Wilson (1998), 210, 218

Ney et al. (1992), 64, 218

Plasmeijer (1998), 90, 218

Pollard and Sag (1986), 9, 10, 218

Powers (2001), 171, 219

Pugh (1989), 84, 85, 219

Rueckl et al. (1989), 190, 219

Schutze (1992), 125, 219

Schutze (1993), 106, 124, 219

Schutze (1995), 125, 126, 219

Schutze (1998), 125, 171, 219

Scha and Bod (2003), 48, 219

Smith (2002), 135, 219

Smrz and Rychly (2002), 113, 114, 219

Stuart et al. (2004), 165, 219

Ushioda (1996), 171, 219

Vapnik (1997), 122, 219

Viterbi (1967), 45, 219

Williams (1992), 91, 220

Wu and Zheng (2000), 39, 220

R Development Core Team (2004), 131, 219

active edge, 12

adjunct, 31, 53, 70, 71, 115, 176

ambiguity, 3, 11, 12, 14, 15, 20, 27, 44, 45, 61,

62, 74, 75, 79, 96, 119, 125, 145

ambiguous, 11, 74

arc, 12

argument, 25, 26, 31, 53, 72, 199

backoff, 7, 33–35, 38, 39, 47, 50, 55–57, 61, 65,

72, 75, 105, 107, 108, 110, 115, 121,

153, 154, 156, 157, 160, 205, 207, 211,

212

backpropagation, 276

bag, 53, 55, 60, 229

base-NP, 52

beam, 64, 83

beam search, 64, 65, 69, 83–85, 88, 89, 155,

209, 229, 233, 275

best edge, 82

best first, 44

bigram, 35, 276

bigram-statistics, 108

bit, 52, 109, 168, 170, 281, 282

bits eror, 168

BNC, 77

bottom-up, 11, 40, 42, 51

branch, 91

candidate hidden unit, 166, 285, 286

candidate training phase, 166

Cascade Correlation, 165–167, 276, 281, 285,

286

centered vector, 138

CFG, 8, 9, 16

chart, 12

289

child, 60

collocation matrix, 124

complement, 10, 53, 54, 70, 71, 115, 176

complete, 12, 60, 62

conditional independence, 56

conditional probability, 21

context, 75

context-free grammar, 2, 8

corpus, 4

cost, 44

coverage, 2

crossing brackets, 20

cutoff, 84

cutoff threshold, 83

decomposed, 2

dendrogram, 109, 137

density plot, 187, 190

dependency, 54

distance metric, 32, 52, 57, 180

distributed, 119

Earley parsing, 7

edge, 12

active, 12

best, 82

complete, 12, 60, 62

inactive, 12

incomplete, 12, 60, 62

eigen decomposition, 135

eigenvalue, 135

eigenvector, 135

event, 20

event file, 54

exact match, 19, 96

expand, 59

expected, 37

feature, 9

feature vector, 119

feature words, 124

fourgram, 124, 133

fragment, 2

fringe, 44, 64

genprob, 72

grammar, 1, 10, 11, 14, 16, 22, 26, 28, 29, 31–

33, 40, 43–45, 47, 61

hash key, 72, 281

HBG, 16, 27–29

head

constituent, 26

headword, 9, 26, 70

nonterminal, 26

production, 25

heuristic search, 40, 43, 44, 84

hidden Markov model, 74, 76

hierarchical, 7

HMM, 76, 77, 79, 109, 119, 182, 280, 283

homograph, 115

HPSG, 10, 11, 17, 22, 25–27, 29, 31, 40, 51, 52,

54, 58, 59, 62, 115

inactive edge, 12

incomplete, 12, 60, 62

independent, 34

index array, 82

index error, 168

Information Theory, 109

inside, 64

inside probability, 41

interpolation, 34

iterative clustering, 143

join, 59

kernel function, 122

key generation, 73

290

lexical head, 9, 22

Markov model, 45

markovian assumption, 76

Maximum Likelihood Estimate, 37

MLE, 37–39

mutual information, 108–110

n-gram, 34, 152

Naive Bayes, 28

neural network, 116, 117, 119, 122, 165, 167,

285

NLP, 1, 27, 34, 108

NPB, 52, 54, 62

object, 2

optimistic, 44

order, 76

output training phase, 166

outside, 64

outside probability, 41–44

overfitting, 180, 181, 185, 187, 192, 281

parent, 54, 59

parent constituent, 26

parent headword, 26

parent nonterminal, 26

PCA, 124, 127, 131–134, 136–138, 175, 211

PCFG, 16, 22, 25, 26, 28, 29, 31, 48, 49, 54

perplexity, 136

phrase, 8

polysemous, 115

POS, 11, 32, 36, 74–77, 79, 81, 100, 101, 106,

110, 143, 145, 146, 170, 171, 173, 175,

178, 182, 184, 208, 211, 212, 222–225,

231, 280

precision, 19, 284

primed, 4

principal component, 135

Principal component analysis, 124

probabilistic grammar, 4

production, 20

pseudo-event, 156

recall, 19, 284

reducing, 278

regular expression, 130

RMS, 133, 134, 138–142, 146, 168, 171

serial learning, 191, 201

shifting, 278

sibling production, 25

singular value decomposition, 136

skiplist, 84, 85

smoothing, 33, 75

SOM, 121

stagnate, 166

statistical parsing, 1

stop, 62

sub-event, 72

subcategorisation list, 10, 53, 54

subject, 2

supervised learning, 166

surprise, 109, 140

SVD, 136, 143, 144, 148, 171, 190, 191

SVM, 122, 124

T/G, 128–130, 137

TAG, 16

timeout, 166, 285

tokenisation, 129, 130

TOP, 52

top-down, 11, 276

tree, 7

treebank, 17, 29

trigram, 35, 113, 119

unary, 54

291

unigram, 34

unit vector, 134–136, 138

victory error criterion, 166, 276, 281

Viterbi, 45, 46, 48, 82

well-formed sentences, 2, 7

word space, 135, 146

WSJ, 17–20, 27, 33, 49, 50, 52, 54, 68, 70, 72,

77–79, 94, 105, 106, 109, 110, 113, 114,

119, 121, 127–130, 132, 151, 153, 157,

158, 160, 161, 170, 171, 173, 196–198,

205, 207, 210

XBAR, 16

Zipf’s law, 4, 22, 25, 56, 102, 105, 119, 134, 140

292