lexical approaches to backoff in statistical parsing - citeseerx
TRANSCRIPT
Lexical Approaches to Backoff in
Statistical Parsing
Corrin Lakeland
S A P E R E - A U D E
a thesis submitted for the degree of
Doctor of Philosophy
at the University of Otago, Dunedin,
New Zealand.
31 May 2005
Abstract
This thesis is an investigation of methods for improving the accuracy of a sta-
tistical parser. A statistical parser uses a probabilistic grammar derived from a
training corpus of hand-parsed sentences. The grammar is represented as a set of
constructions — in a simple case these might be context-free rules. The probabil-
ity of each construction in the grammar is then estimated by counting its relative
frequency in the corpus.
A crucial problem when building a probabilistic grammar is to select an appro-
priate level of granularity for describing the constructions being learned. The
more constructions we include in our grammar, the more sophisticated a model
of the language we produce. However, if too many different constructions are
included, then our corpus is unlikely to contain reliable information about the
relative frequency of many constructions.
In existing statistical parsers two main approaches have been taken to choosing
an appropriate granularity. In a non-lexicalised parser constructions are speci-
fied as structures involving particular parts-of-speech, thereby abstracting over
individual words. Thus, in the training corpus two syntactic structures involving
the same parts-of-speech but different words would be treated as two instances
of the same event. In a lexicalised grammar the assumption is that the individ-
ual words in a sentence carry information about its syntactic analysis over and
above what is carried by its part-of-speech tags. Lexicalised grammars have the
potential to provide extremely detailed syntactic analyses; however, Zipf’s law
makes it hard for such grammars to be learned.
In this thesis, we propose a method for optimising the trade-off between infor-
mative and learnable constructions in statistical parsing. We implement a gram-
mar which works at a level of granularity in between single words and parts-
of-speech, by grouping words together using unsupervised clustering based on
bigram statistics. We begin by implementing a statistical parser to serve as the
basis for our experiments. The parser, based on that of Michael Collins (1999),
contains a number of new features of general interest. We then implement a
model of word clustering, which we believe is the first to deliver vector-based
word representations for an arbitrarily large lexicon. Finally, we describe a series
of experiments in which the statistical parser is trained using categories based on
these word representations.
iii
Acknowledgements
It is difficult to overstate my gratitude to my Ph.D. supervisor, Dr. Alistair Knott.
While I always enjoyed working on my thesis, it was your drive that pushed me
to complete it. Especially throughout my writing period, you provided encour-
agement, sound advice, and an infinite supply of patience. I would never have
finished without you.
I would also like to thank my assistant supervisor, Dr. Richard O’Keefe. No
matter how difficult my problem, you always helped me out. To my surrogate
supervisor, Dr. Peter Andreae. You accepted me turning up in your office as
another student to look after, and still provided encouragement, guidance and
direction with good humour. Similarly, to the many people who helped me when
I was out of my depth: Simon McCallum, Dr. Anthony Robins, Dr. Marcus Frean,
Dr. Michael Albert, Dr. Karsten Worm and Dr. Joshua Goodman.
To my wife, Andrea, for always being there but never asking why it was taking
so long, or when I would start earning some money. . . I love you. And to my
parents, who put up with me saying “I’ll finish next year” more times than I care
to recount, while providing encouragement and support, quite literally at the
end by accepting the job of a last-minute thesis editors with far more grace than
I would have.
To Nathan Rountree and Dr. Janet Rountree, for support and many shared meals
over the years. For showing that after spending a week in the debugger it is
great to just sit back, forget about my thesis, and enjoy a glass of wine. Also,
to Nathan, for the many hours we chatted about data mining and other areas of
computer science, and to Janet for being the only example I know of someone
who actually finished a Ph.D.
I am indebted to the many students who kept me company as I whiled away the
years: Peter Vlugter, Richard Mansfield, Andrew Webb, Samson De Jager, Mike
Liddle, Tom Eastman, Robin Sheat, Pont Lurcock, Chris Monteith, and Alicia
Monteith. Without the tea parties, innumerable games of croquet, cricket, xrisk,
and petanque I would have tired of the student life years ago. Of course, we
didn’t only play games: I’m sure the snack-box really did need a mysql database
backend, though it currently escapes me why we all sat down one day and wrote
a Markov-based sentence generator.
v
Contents
Contents vii
List of Tables xiii
List of Figures xv
1 Introduction 11.1 What is parsing for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Statistical parsing and its problems . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Main aims of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Statistical parsing 72.1 Deterministic grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Lexical heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 HPSG and subcategorisation lists . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Deterministic parsing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Chart parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Problems with deterministic parsing . . . . . . . . . . . . . . . . . . . . 14
2.3 Probabilistic grammars and corpus-based NLP . . . . . . . . . . . . . . . . . . 152.3.1 Building a corpus of hand-parsed sentences . . . . . . . . . . . . . . . 162.3.2 Evaluating parser performance . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Probabilistic grammar formalisms . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1 Lexical semantics in probabilistic grammars . . . . . . . . . . . . . . . 212.4.2 Black et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.3 Exhaustive grammars: Bod and Scha’s approach . . . . . . . . . . . . . 292.4.4 Klein and Manning’s statistical parser . . . . . . . . . . . . . . . . . . . 31
2.5 Backoff, interpolation and smoothing . . . . . . . . . . . . . . . . . . . . . . . 322.5.1 Backoff and interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.3 Combined interpolation and smoothing techniques . . . . . . . . . . . 38
2.6 Probabilistic parsing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.6.1 Inside and outside probabilities . . . . . . . . . . . . . . . . . . . . . . . 412.6.2 Parsing as state space navigation . . . . . . . . . . . . . . . . . . . . . . 432.6.3 Viterbi optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Three statistical parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.7.1 Klein and Manning’s statistical parser . . . . . . . . . . . . . . . . . . . 462.7.2 Bod’s statistical parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.7.3 Collins’ statistical parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
vii
2.8 Summary and future direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 A description of Collins’ parser 513.1 Collins’ grammar formalism and probability model . . . . . . . . . . . . . . . 51
3.1.1 New nonterminal categories: NPB and TOP . . . . . . . . . . . . . . . 523.1.2 A distance metric: adjacency and verbs . . . . . . . . . . . . . . . . . . 523.1.3 Preprocessing the Penn treebank . . . . . . . . . . . . . . . . . . . . . . 533.1.4 Collins’ event representation . . . . . . . . . . . . . . . . . . . . . . . . 543.1.5 Backoff and interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 553.1.6 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2.1 Dependency productions, and the use of a reference grammar . . . . . 593.2.2 Unary productions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2.3 Search strategy in Collins’ parsing algorithm . . . . . . . . . . . . . . . 643.2.4 Summary of Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . 65
4 A reimplementation of Collins’ parser 674.1 The complexity of Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . . 674.2 Implementation of the treebank preprocessor . . . . . . . . . . . . . . . . . . . 704.3 Implementation of the probability model . . . . . . . . . . . . . . . . . . . . . 724.4 Implementation of a POS tagger . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.1 The relationship between POS tagging and lexicalised statistical parsing 744.4.2 Part of speech tagging using hidden Markov models . . . . . . . . . . 754.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Implementation of the chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.6 Implementing add singles stops and beam search . . . . . . . . . . . . . . . . 83
4.6.1 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.6.2 Skiplists for implementing beam search . . . . . . . . . . . . . . . . . . 85
4.7 Some software engineering lessons learned . . . . . . . . . . . . . . . . . . . . 894.7.1 Programming languages for statistical parsing . . . . . . . . . . . . . . 904.7.2 Revision control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.7.3 Efficiency and debuggability . . . . . . . . . . . . . . . . . . . . . . . . 914.7.4 Debugging methodology and test suites . . . . . . . . . . . . . . . . . . 924.7.5 Naming of variables and parameters . . . . . . . . . . . . . . . . . . . . 93
4.8 Results of the parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.8.1 A re-evaluation of Collins’ parser: precision and recall . . . . . . . . . 944.8.2 Evaluation of my preprocessor and parser: precision and recall . . . . 964.8.3 The complexity of Collins’ and my parsers . . . . . . . . . . . . . . . . 974.8.4 Evaluation of my parser with my new POS tagger . . . . . . . . . . . . 1004.8.5 An analysis of the errors in Collins’ parser . . . . . . . . . . . . . . . . 100
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5 Thesaurus-based word representation 1055.1 An example of the benefits of grouping similar words . . . . . . . . . . . . . . 1065.2 Criteria for semantic relatedness measures . . . . . . . . . . . . . . . . . . . . 107
5.2.1 Attention to infrequently occurring words . . . . . . . . . . . . . . . . 1075.2.2 Multidimensional representations of word semantics . . . . . . . . . . 107
5.3 A survey of approaches for computing semantic similarity between words . . 108
viii
5.3.1 Hand-generated thesauri: WordNet and Roget . . . . . . . . . . . . . . 1085.3.2 Unsupervised methods for thesaurus generation . . . . . . . . . . . . . 1085.3.3 Finch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.3.4 Brown et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.3.5 Smrz and Rychly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.3.6 Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.3.7 Elman/Miikkulainen/Liddle . . . . . . . . . . . . . . . . . . . . . . . . 1165.3.8 Bengio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3.9 Honkela (Self Organising Maps) . . . . . . . . . . . . . . . . . . . . . . 1215.3.10 Joachims (Support Vector Machines) . . . . . . . . . . . . . . . . . . . . 1225.3.11 Schutze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6 A derivation of word vectors 1276.1 Obtaining a training corpus: Tipster and Gutenberg . . . . . . . . . . . . . . . 1276.2 Preparing the corpus for clustering . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.1 Processing the corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2.2 Off-the-shelf tools for clustering: a brief survey . . . . . . . . . . . . . 1316.2.3 Dealing with large matrices . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 An implementation of Schutze’s algorithm for word clustering . . . . . . . . . 1336.3.1 Building a table of bigram counts . . . . . . . . . . . . . . . . . . . . . . 1336.3.2 Normalising the bigram table . . . . . . . . . . . . . . . . . . . . . . . . 1346.3.3 The PCA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4 Tuning the clustering process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.4.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.4.2 Dimensions of the bigram matrix . . . . . . . . . . . . . . . . . . . . . . 1376.4.3 Normalising bigram vectors . . . . . . . . . . . . . . . . . . . . . . . . . 1386.4.4 Choice of feature words . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.5 Window size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.6 Iterated clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.7 Integrating POS tag representations . . . . . . . . . . . . . . . . . . . . 1436.4.8 Windows revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1466.5.1 Results for the first four thousand words . . . . . . . . . . . . . . . . . 1486.5.2 Results for the second four thousand words . . . . . . . . . . . . . . . 1486.5.3 Results for the last four thousand words . . . . . . . . . . . . . . . . . 151
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7 Improving backoff using word representations 1537.1 Feasibility study: Noise in backoff . . . . . . . . . . . . . . . . . . . . . . . . . 1547.2 Parsing by grouping nearest-neighbour words . . . . . . . . . . . . . . . . . . 156
7.2.1 Integrating neighbours in parsing . . . . . . . . . . . . . . . . . . . . . 1567.2.2 Reversing the neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . 1577.2.3 How to select a group of neighbours for a word . . . . . . . . . . . . . 1577.2.4 Avoiding swamping counts . . . . . . . . . . . . . . . . . . . . . . . . . 1597.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.2.6 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.3 Parsing using a neural network probability model . . . . . . . . . . . . . . . . 1657.4 Cascade Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
ix
7.5 Testing the vector representation of words . . . . . . . . . . . . . . . . . . . . . 1667.5.1 Mapping words to words . . . . . . . . . . . . . . . . . . . . . . . . . . 1677.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.6 A vector representation of tags . . . . . . . . . . . . . . . . . . . . . . . . . . . 1707.7 A vector representation of nonterminals . . . . . . . . . . . . . . . . . . . . . . 1737.8 Neural network design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.8.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1767.8.2 Neural network parameters . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.9 Training the tag network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1827.9.1 The initial tag network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1837.9.2 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1857.9.3 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1917.9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.10 Training the other networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1967.10.1 Training the prior network . . . . . . . . . . . . . . . . . . . . . . . . . 1967.10.2 Training the top network . . . . . . . . . . . . . . . . . . . . . . . . . . 1977.10.3 Training the unary network . . . . . . . . . . . . . . . . . . . . . . . . . 1987.10.4 Training the subcategorisation network . . . . . . . . . . . . . . . . . . 1997.10.5 Training the dependency network . . . . . . . . . . . . . . . . . . . . . 202
7.11 Final evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8 Conclusion 2058.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.1.1 Implementing Collins’ 1997 parser . . . . . . . . . . . . . . . . . . . . . 2078.1.2 Word and nonterminal representations . . . . . . . . . . . . . . . . . . 2088.1.3 Experiments in using word vectors for backoff . . . . . . . . . . . . . . 208
8.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2098.2.1 Reimplementing Collins . . . . . . . . . . . . . . . . . . . . . . . . . . . 2098.2.2 Word vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2108.2.3 Backoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.2.4 Using Maximum Entropy methods instead of a neural network . . . . 2118.2.5 Using a different parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
References 213
A Tags and Nonterminals used 221A.1 Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221A.2 Nonterminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
B Code specifications for my parser 227B.1 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227B.2 The node data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230B.3 The beam data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
C Relevant source code 235C.1 Build script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236C.2 R scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240C.3 Funnelweb code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
x
C.4 Processing the treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243C.4.1 Transforming the corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 244C.4.2 Deriving a grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
C.5 Processing bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255C.5.1 Counting bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255C.5.2 Scaling bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Glossary 275
Index 288
xi
List of Tables
2.1 Context-free rules needed to parse The cat sat on the big brown mat. . . . . . . . 82.2 A context-sensitive rule using features for number agreement. Number is a
variable that denotes singular and plural. . . . . . . . . . . . . . . . . . . . . . 92.3 Some example (simple) HPSG rules . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Frequencies of rules in the simple corpus . . . . . . . . . . . . . . . . . . . . . 222.5 Probabilities of rules in the simple corpus . . . . . . . . . . . . . . . . . . . . . 232.6 Klein and Manning’s parsing accuracy and grammar size for different model
complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Collins’ unary and subcat events . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2 Collins’ dependency events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3 Collins’ TOP events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.4 Collins’ Prior events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Part of Speech event representation . . . . . . . . . . . . . . . . . . . . . . . . . 774.2 Actual position of the tag that should be in first position . . . . . . . . . . . . 794.3 Results from Collins’ 1997 parser including my code hooks . . . . . . . . . . . 954.4 My evaluation of the parser in Collins’ thesis (Collins, 1999) . . . . . . . . . . 954.5 Results from my parser using Collins’ preprocessor . . . . . . . . . . . . . . . 964.6 Results from my parser using Collins’ output as a gold standard . . . . . . . . 974.7 A selection of correctly parsed sentences . . . . . . . . . . . . . . . . . . . . . . 1024.8 A selection of poorly parsed sentences . . . . . . . . . . . . . . . . . . . . . . . 104
5.1 An example of bigram counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.1 Bigram counts in the process of being normalised . . . . . . . . . . . . . . . . 1406.2 Parameters chosen for the generation of word vectors . . . . . . . . . . . . . . 1466.3 A sample of nearest-neighbour words from the first four thousand words . . 1496.4 A sample of nearest-neighbour words from the second four thousand words . 1506.5 A sample of nearest-neighbour words from the last four thousand words . . . 151
7.1 Performance of Collins’ 1996 parser over Section 23 before and after integrat-ing neighbour information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2 Performance of Collins’ 1996 parser over a sub-corpus of two hundred sen-tences containing rare verbs, before and after integrating tag information . . . 162
7.3 Performance of Collins’ 1999 parser over Section 23 before and after integrat-ing neighbour information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4 Performance of Collins’ 1999 parser over a sub-corpus of two hundred sen-tences containing rare verbs, before and after integrating tag information . . . 163
xiii
7.5 Learned mapping of words to words from four hundred words . . . . . . . . 1697.6 Evaluation of network generalisation after learning from the first four hun-
dred words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1707.7 Hand encoded categories for POS tags . . . . . . . . . . . . . . . . . . . . . . . 1757.8 Tagger accuracy as hidden units are added to the neural network . . . . . . . 1857.9 Tagger accuracy as extra training data is provided to the neural network . . . 1917.10 Tagger accuracy as extra training data is incrementally provided to the neural
network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1927.11 Performance of different taggers using half unique and half duplicate training
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
A.1 Tags related to symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222A.2 POS tags used for nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222A.3 POS tags used for verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223A.4 POS tags used for adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224A.5 POS tags used for pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224A.6 Other POS tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225A.7 The main nonterminal categories . . . . . . . . . . . . . . . . . . . . . . . . . . 226
B.1 Data structure for phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231B.2 Member functions for phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232B.3 High level API for the beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
xiv
List of Figures
1.1 An example parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Example parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 A chart parser after parsing the large can. The top part of the diagram shows
the partial phrases while the bottom part shows the completed phrases. . . . 132.3 A chart parser after parsing the large can can hold. The top part of the diagram
shows the partial phrases while the bottom part shows the completed phrases. 142.4 Some analyses which a wide-coverage grammar should include . . . . . . . . 152.5 A dubious parse by a deterministic grammar . . . . . . . . . . . . . . . . . . . 152.6 Example of Probabilistic CFG rules . . . . . . . . . . . . . . . . . . . . . . . . . 162.7 Example sentence from the Penn treebank . . . . . . . . . . . . . . . . . . . . . 182.8 Two phrases showing that WSJ phrases contain little attachment information 182.9 A simple corpus of hand-parsed sentences . . . . . . . . . . . . . . . . . . . . 212.10 Parse trees with associated prior and conditional probabilities for John saw Mary 232.11 Syntactically valid but unlikely parse of “The man saw the dog with the tele-
scope.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.12 Likely parse of “The man saw the dog with the telescope.” . . . . . . . . . . . 242.13 Head and sibling productions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.14 A simple lexicalised parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.15 Sample representation of “with a list” in the HBG model, taken from Black,
Jelinek, Lafferty, Magerman, Mercer, and Roukos (1992) . . . . . . . . . . . . . 282.16 Sample DOP grammar for a tiny corpus . . . . . . . . . . . . . . . . . . . . . . 302.17 Partial parse showing the different areas examined by the inside and the out-
side probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.18 Two alternate interpretations of saw the girl with the telescope, showing the ef-
fect of the Viterbi optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.19 Using DOP to parse Mary likes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.20 Idealised pseudocode for Bod’s statistical parser . . . . . . . . . . . . . . . . . 48
3.1 Conversion from a WSJ style tree to head driven . . . . . . . . . . . . . . . . . 543.2 Collins’ representation of a left production event . . . . . . . . . . . . . . . . . 553.3 Collins’ smoothing function as implemented . . . . . . . . . . . . . . . . . . . 583.4 Simplest possible chart parser pseudocode . . . . . . . . . . . . . . . . . . . . 593.5 Pseudocode for combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.6 Simple example showing how an (incomplete) VP can have a (complete) NP-
C added as a right sibling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.7 Pseudocode for joining two edges (dependency events). . . . . . . . . . . . . . 613.8 Simple example showing the steps in building an NP constituent the man. . . 62
xv
3.9 Pseudocode for add singles. The previous parent is demoted to a head, andnew parents are generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.10 Pseudocode for add stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.11 Pseudocode for add singles stops . . . . . . . . . . . . . . . . . . . . . . . . . . 643.12 Collins’ parsing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Simplified data flow diagram for my implementation of Collins’ parser . . . . 694.2 Actual high-level code for preprocessing the treebank . . . . . . . . . . . . . . 704.3 Pseudocode to implement Magerman’s headword algorithm . . . . . . . . . . 714.4 The tagger’s control structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Histogram of the probability assigned to the correct tag, in the cases where
the tagger chooses a wrong tag as best . . . . . . . . . . . . . . . . . . . . . . . 804.6 Code for a beam search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.7 A simple skiplist showing the first sixteen items . . . . . . . . . . . . . . . . . 864.8 Code to find the highest node in a skiplist with priority ≤ n . . . . . . . . . . 864.9 Code for inserting a node into a skiplist . . . . . . . . . . . . . . . . . . . . . . 874.10 Time taken by the skiplist to insert random elements with different beam sizes 894.11 Scatter-plot of time taken by my parser to parse sentences of different lengths 984.12 Scatter-plot of log(time) versus log(sentence length) — the gradient is the
parser’s complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.13 Parsing accuracy versus sentence length. . . . . . . . . . . . . . . . . . . . . . 1014.14 Two parse trees showing that changing ‘bore’ to ‘fool’ corrects the parse. . . . 1014.15 Rank of the sentence’s least frequent head word versus parse accuracy . . . . 103
5.1 A figure from Finch’s thesis showing the internal structure from several partsof the dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Finch’s dendrogram generation algorithm . . . . . . . . . . . . . . . . . . . . . 1125.3 Sample clusters from Brown, deSouza, Mercer, Pietra, and Lai’s algorithm . . 1135.4 Pseudocode of Smrz and Rychly’s clustering algorithm . . . . . . . . . . . . . 1145.5 Analysis of the weights in Elman’s network, showing the linguistic knowl-
edge which had been learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.6 Liddle’s network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.7 Clusters of Liddle’s output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.8 A sample of Honkela’s word map . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.9 Two-dimensional version of Schutze’s output . . . . . . . . . . . . . . . . . . . 125
6.1 Graph of the frequency of every word in the WSJ against that word’s fre-quency in T/G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Pseudocode to count all co-occurrences in the corpus . . . . . . . . . . . . . . 1336.3 Word dendrogram with RMS scaling . . . . . . . . . . . . . . . . . . . . . . . . 1396.4 Word dendrogram with log applied to all counts before processing . . . . . . 1416.5 Dendrogram from iterating SVD four times . . . . . . . . . . . . . . . . . . . . 1446.6 Dendrogram where POS tags are used as extra features . . . . . . . . . . . . . 1456.7 Dendrogram using the final parameters (a window of twenty words and tag
information). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1 Graph of noise against parser accuracy . . . . . . . . . . . . . . . . . . . . . . . 1557.2 Graph of the log of a word’s frequency versus the distance to its nearest neigh-
bour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.3 Dendrogram of tag representation . . . . . . . . . . . . . . . . . . . . . . . . . 172
xvi
7.4 Dendrogram of nonterminals produced using only unsupervised training . . 1747.5 Hand encoded representation of nonterminals . . . . . . . . . . . . . . . . . . 1757.6 Dendrogram of the representation of nonterminals . . . . . . . . . . . . . . . . 1777.7 Probability of different outputs from genprob, after outputs of zero are excluded1797.8 Plot of errors in the tag network against units used . . . . . . . . . . . . . . . . 1837.9 Scatter plot of output in the tag network using six hundred hidden units
against genprob’s output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1847.10 Graph of the training set and the test set error as hidden nodes are added to
the tag network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1867.11 Scatterplot of output from the tag network against genprob’s output, using
just twenty hidden nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1877.12 Density plot of output from the tag network against genprob’s output, after
twenty hidden nodes have been added . . . . . . . . . . . . . . . . . . . . . . . 1887.13 Scatter plot of output from the tag network against genprob’s output, using
unique training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1937.14 Scatter plot of output from the tag network against genprob’s output, using a
mix of unique and duplicate training data . . . . . . . . . . . . . . . . . . . . . 1957.15 Scatter plot of output from the unary network against genprob’s output, using
100k training patterns and eighty hidden nodes . . . . . . . . . . . . . . . . . 2007.16 Scatter plot of output from the unary network against genprob’s output, using
the network trained directly on the raw event file . . . . . . . . . . . . . . . . 2007.17 Scatter plot of output from the subcat network, trained with ten thousand
events with ten hidden units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.1 Data flow diagram for the entire thesis (simplified) . . . . . . . . . . . . . . . . 206
B.1 Data flow diagram of the parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 227B.2 Class structure of the parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
xvii
Chapter 1
Introduction
Computational linguistics has undergone something of a revolution in the last fifteen years.
Until recently, computational linguists were primarily grammar writers, building systems to
process natural language by formulating grammatical rules themselves and implementing
these rules in specialised high level programming languages. This kind of work still goes
on; see for example Copestake and Flickinger (2000); Ginzburg and Sag (2000). However,
there is a new dominant paradigm in natural language processing, which involves the ap-
plication of statistical techniques to learn appropriate rules from large corpora of examples.
The emphasis has moved from directly implementing systems with knowledge of language
to implementing systems which can acquire such knowledge, by various supervised and
unsupervised learning techniques.
This thesis is about statistical natural language processing (statistical NLP). It focuses on
one statistical NLP technique in particular, namely statistical parsing — i.e. finding the most
probable syntactic analysis of an input sentence. I will begin by motivating and introducing
statistical parsing in Sections 1.1 and 1.2. In Section 1.3, I outline the goals of the thesis, and
in Section 1.4 I give an overview of the thesis chapter by chapter.
1.1 What is parsing for?
Parsing is such a well accepted topic in NLP that it is easy to think of it as an end in its own
right. However, a parse tree is not intrinsically useful; we only need it as a means to other
ends. It is worth thinking about what these ends are because these give us useful guidelines
about the kind of parsers we want.
To begin at the beginning, we need a definition of parsing. Parsing is the determina-
tion of the syntactic structure of a sentence. What is a syntactic structure? A useful way of
answering this question is to make reference to a grammar: the abstract mechanism which
is able to generate every sentence in the language under investigation (and only these sen-
1
tences). The syntactic structure of a sentence is a description of how this grammar gener-
ated this particular sentence. The generative process is a recursive one, and thus the syn-
tactic structure of a sentence is a hierarchical, tree-like structure. We assume that a set of
context-free rules can be used to generate all and only the well-formed sentences in a lan-
guage (Chomsky, 1965). A context-free grammar rule looks something like S→NP, VP. This
means that a sentence (S) can be decomposed into a noun phrase (NP) followed by a verb
phrase (VP). Similarly, the rule VP → Verb, NP means a verb phrase can consist of a verb
followed by another noun phrase. An example of a noun phrase is “The cat”, an example of
a verb is “chased”, and an example of another noun phrase is “a rat”. Sometimes the first
verb phrase is referred to as the subject, while the second is referred to as the object. The
rules involved in this example can be expressed together by drawing a tree, as shown in
Figure 1.11 In this tree the parent node is decomposed into the children nodes. To go from
S
NP
Det
The
Noun
cat
VP
Verb
chased
NP
Det
a
Noun
rat
Figure 1.1: An example parse tree
a grammar to a parse tree, all the computer has to do is try every combination of grammar
rules and see if they match the input sentence. This can be made moderately efficient with
very little effort, as will be discussed in Section 2.2.
So, what is the point in computing a structure such as this one? As I have already said,
parsing is not an end in itself. The reason for deriving a sentence structure is because it
is required for the process of sentence interpretation; the syntactic structure of a sentence
is essentially a set of instructions for how to derive the meaning of the sentence from the
meanings of its individual words. Once we can compute sentence meanings, the list of
applications is very large — for instance, natural language interfaces, machine translation
systems, intelligent web browsers.
An important point about most example applications of parsing is that a parser is useless
unless it has good coverage. The parser has to be able to deal with a fairly high proportion of
the user’s input sentences if it is going to be a useful tool. In practice most early work in com-
putational linguistics concentrated on building grammars and parsers for small fragments
of natural language. They did not lend themselves well to use in practical applications. It is
1This thesis will not include descriptions of grammatical terms in the text, but such terms are defined in
Appendix A on page 221.
2
only quite recently that the creation of wide coverage grammars has been seen as a realistic
goal.
Wide coverage grammars and the problem of ambiguity
The grammars discussed above are the kind that have been used in computational linguis-
tics from the introduction of computer processing of language in the 1950s to the early 1990s.
The problem with such grammars does not surface until one attempts to write a full gram-
mar for a real language. Real language usage includes a huge number of infrequently oc-
curring syntactic constructions and the grammar builder has to include these constructions
if they want their grammar to be able to parse most sentences. However these constructions
very quickly lead to many incorrect parses for a sentence. Consider the following example
(taken from Lee (2004)):
(1.1) At last, a computer that understands you like your mother.1985 McDonnell-Douglas ad.
There are at least three different but valid interpretations of this sentence:
1. There is a computer which understands you as well as your mother understands you.
2. There is a computer which understands that you like your mother.
3. There is a computer which understands you as well as it understands your mother.
The problem here is that the input is ambiguous but that humans are so good at resolving
ambiguity they hardly notice. In reality the problem is less obvious forms of ambiguity,
which will be discussed in the Section 2.2.2, but for now it is sufficient to state that ambiguity
is the problem. It seems that it is impossible to develop a grammar that can understand a
large chunk of the English language, without also producing many incorrect interpretations
for valid sentences.
To address this problem, linguists have applied various different strategies. One ap-
proach is to add semantics to grammars and use a database of domain knowledge to try and
infer which parses of a sentence are semantically implausible. For instance, the message in
the third interpretation of Example 1.1 above is semantically rather anomalous. However,
this solution demands very good symbolic knowledge-bases and a good theorem prover,
as well as a good theory of compositional semantics; it is essentially ‘AI-complete’. An-
other approach is to use discourse context to help resolve the sentence’s ambiguity. For
example, if Example 1.1 appears in a context where it has just been mentioned that you like
your mother, this would lend support to reading 2. However, context can never provide a
complete source of information about how to correctly read a sentence.
3
1.2 Statistical parsing and its problems
Another approach to disambiguation is to annotate the grammar rules with frequency infor-
mation. To return to Example 1.1, it might be the case that the third parse given above uses
rules which are significantly less frequently applied than the other two parses. In this case,
we can reason that this parse is less likely to be the correct one. This frequency information
could be obtained, with some degree of accuracy, by simply asking the linguists making the
rules to estimate how often this rule occurred. Just as parse trees have always meant that
two grammatical rules occur, the frequencies of these rules can be combined in this prob-
abilistic grammar to give an estimate of the likelihood for any given interpretation. The
advantage of this method is that the parser would favour obscure grammatical structures
over no interpretation, and favour common structures over obscure structures. This should
lead to the parser picking the same interpretation as a human.
However the approach still has a number of problems. Firstly all humans, including
linguists, are notoriously bad at estimating probabilities — if this wasn’t so then casinos
would be far less successful. It soon became necessary to derive the probabilities from their
actual frequency of occurring in a corpus. Secondly, the way probabilities are combined is
normally by multiplying them, just like everywhere else in statistics. But this presupposes
that the two grammatical constructions being combined are independent. However this is
almost never the case, when we see certain words or phrases we are primed for certain
other phrases, simply because they frequently occur together. The grammatical rules need
to be modified so that their probabilities are dependent on surrounding context. Unfortu-
nately the second problem compounds the first. It means the corpus from which frequencies
are derived has to be big enough not only to include every type of grammatical structure
occurring multiple times, but that we need these structures occurring with every possible
different context as well. Of course, corpora containing this much information are simply
not available. The problem of gathering an adequate corpus is compounded by Zipf’s law,
which states that “the nth most frequent word occurs roughly 1/n times the frequency of
the most frequent word” (Li, 1992). This means we have excellent counts for a few words,
but that assembling a corpus with adequate counts for rare words is essentially impossible.
The challenge for a statistical parser is to simplify its grammar rules and demand for context
enough to make maximum usage of the data it has available.
Some statistical parsers do not simplify the grammar rules at all, and instead rely on very
good estimation functions. Others are fully configurable in how much context information
they use. It is also interesting to compare some of the earlier statistical parsers to the latest
statistical parsers. The earlier models used less context because they had a smaller corpus
to train with. Overall, it cannot be denied that this trade-off is one of the most important
issues in statistical parsing.
4
1.3 Main aims of the thesis
There are two main aims. The first is to develop a statistical parser which is easily con-
figurable, and whose code is readable, to serve as the basis for new experiments. Many
existing statistical parsers are not distributed (for example, Bod and Scha (1996), Goodman
(1998), Magerman (1996)) and those which are distributed, are often heavily optimised at
the expense of readability and modifiability (for example, Collins (1999)).
The second (and primary) aim of the thesis is to find some way of extending a statistical
parser to address the critical problem of statistical parsing just mentioned — the trade-off
between the frequency of grammatical structures in a training corpus, and their linguistic
usefulness in a parser. The main idea I pursue involves generalising between words so
that high-frequency words will share some counts with low frequency words, and later
extending this idea to generalising between any related events.
1.4 Overview of the thesis
Chapter 2 reviews previous work in statistical parsing. I begin by briefly considering pars-
ing in general, what it is for, and briefly introduce the ‘classical’ non-statistical paradigm
of grammars and parsing. After this I introduce the statistical parsing paradigm, and
summarise what I take to be the key challenges for current work in this field. Hav-
ing summarised the non-statistical approaches I then describe their statistical counter-
parts. Here I introduce the system I will focus on in my thesis, Collins’ 1997 statistical
parser, and situate it in the context of other statistical parsers.
Chapter 3 describes Collins’ statistical parser in detail. This chapter serves two purposes:
firstly, it provides the background necessary to understand my reimplementation of
Collins in Chapter 4, but equally importantly it provides a description of the operation
of Collins’ parser which in places extends that given by Collins himself.
Chapter 4 describes my reimplementation of Collins’ parser. All aspects of the system are
examined, including data structures, algorithms and any major design decisions and
their impact. An analysis of the performance of Collins’ parser and its reimplementa-
tion is given. The conclusion drawn is that Collins’ parser has a number of problems
that can be reduced by a different algorithm for backoff of rare events.
Chapter 5 discusses the problem of word representation, from the perspective of statistical
parsing. It begins by examining the properties of a good word representation from the
perspective of statistical parsing. Considering these properties, a number of different
techniques are examined and I conclude that the most suitable for my purposes is one
developed by Hinrich Schutze.
5
Chapter 6 describes my implementation of Schutze’s approach. It begins by explaining how
I modified Schutze’s approach to support a large lexicon. However, the majority of
this chapter summarises the work in adjusting the algorithm, its parameters, and its
input data to produce word vectors that are useful in a statistical parser. Specifically,
I found previous research concentrated on developing good representations for com-
mon words, whereas our concern here is in developing a good representation for rare
words.
Chapter 7 starts with the word representations derived in Chapter 6 and looks at how they
can be integrated into the statistical parser. As a first approach, they are integrated by
creating a new intermediate level of backoff, where words are grouped into clusters
with similar syntax and semantics. Due to the limitations in this approach, a much
more ambitious approach is proposed, involving the use of a large neural network to
compute event probabilities. The chapter develops an input representation suitable
for training a neural network, demonstrates the feasibility of the approach by training
a neural network based tagger and then works through the development of neural
networks for several different types of probability calculations. At the end a complete
system is presented that uses a neural network hybrid for the probability model, and
the performance of this new hybrid model is analysed.
Chapter 8 summarises the results that have been found and discusses what has been learned
for possible directions for further research.
6
Chapter 2
Statistical parsing
This chapter is a survey of current work in statistical parsing, presented in a textbook style.
In Sections 2.3 to 2.6 I examine the design and implementation issues present in building
a statistical parser. This covers classical statistics such as backoff as well as describing the
standard algorithms such as Earley parsing in their simplest form. I then describe some in-
fluential statistical parsers, concluding with an overview of the parser developed by Michael
Collins (1997) which forms the basis for my own implementation.
2.1 Deterministic grammars
Determining the meaning of a natural language sentence is commonly assumed to involve
determining its grammatical structure. In this section, we discuss the traditional concep-
tion of grammars within computational linguistics, in order to motivate the presentation of
statistical grammars, and to introduce some core linguistic concepts.
A grammar is in essence a set of rules, which operate together to specify the space of
well-formed sentences in a language. A well-formed sentence is one which a native speaker
of the language accepts as being part of the language. To take some clear examples, Sen-
tence 2.1 is a well-formed English sentence, whereas Sentence 2.2 is ill-formed.
(2.1) The cat sat on the big brown mat.
(2.2) *1 Mat brown big the on sat cat the.
Grammars typically operate on the assumption that the well-formedness of a sentence is
defined recursively in terms of the well-formedness of its constituent parts. The assumption
is generally that sentences are hierarchical entities, which can be described using trees. The
leaf nodes of these trees are the individual words in the sentence, and the non-terminal
1By linguistics convention, * denotes a grammatically ill-formed sentence and ? denotes a dubious sentence.
7
nodes are sequences of adjacent words which ‘group together’ particularly closely, known
as phrases. An example of a syntactic tree is given in Figure 2.1.
S
NP
Det
The
N’
Noun
cat
VP
VP
Verb
sat
PP
Prep
on
NP
Det
the
N’
Adj
big
N’
Adj
brown
N’
Noun
mat
Figure 2.1: Example parse tree
There are several criteria for these groupings — for instance, in Figure 2.1, the phrase the
big brown mat can be replaced by a single word it and still be well formed. Some common
nonterminals are noun phrases, verb phrases, prepositional phrases and sentences, although
the exact terms vary depending on the formalism being used. These will be abbreviated to
NP, VP, PP, S in future and a comprehensive list is given in Table A.7. For a comprehen-
sive introduction to syntactic analysis which motivates these nonterminals, see for example
Haegeman (1991).
The simplest grammars are collections of context-free grammar (CFG) rules. A context-
free rule is a rule that specifies how a complex phrase decomposes into simpler phrases. For
example, to describe the structure of Example 2.1 above, we would use the set of rules given
in Table 2.1.
S→ NP VP Noun→ cat, mat, . . .
NP→ Det N’ Det→ the, . . .
N’→ Adj N’ Adj→ big, brown, . . .
N’→ Noun Prep→ on, . . .
VP→ VP PP Verb→ sat, . . .
VP→ Verb
PP→ Prep NP
Table 2.1: Context-free rules needed to parse The cat sat on the big brown
mat.
8
Context sensitive grammar: CFG with features
In modern grammar formalisms, phrases are not atomic entities, but are parametrised using
features. For instance, our grammar needs to be able to distinguish between singular and
plural noun phrases and verb phrases, so as to prevent number mismatches such as the
following:
(2.3) * The dogs is happy.
To enforce agreement, we can say that the phrases NP and VP are each annotated with a
feature called number, which can take the alternative values singular or plural. As well as
number features, NPs and VPs need to be annotated with several other agreement features,
such as person and gender. We can then write CFG rules which use variable binding to
enforce number agreement, as in Table 2.2.
S→ NP VP Noun→ cat, mat, . . .
NP→ Det N’ Det→ the, . . .
N’→ Adj N’ Adj→ big, brown, . . .
N’→ Noun Prep→ on, . . .
VP→ VP PP Verb→ sat, . . .
VP→ Verb
PP→ Prep NP
Table 2.2: A context-sensitive rule using features for number agreement.
Number is a variable that denotes singular and plural.
2.1.1 Lexical heads
Naturally, the agreement features specified on an NP or VP phrase have to come from some-
where. The assumption is that they come from a lexical item within the phrase. Another key
concept for modern grammar formalisms is the idea that every phrase has a lexical head or
headword. The lexical head of a phrase is the word within the phrase from which its key
syntactic features (such as number) are inherited. Intuitively, the head of a phrase is the
word which contributes the core of its syntactic and semantic characteristics. For instance,
the semantics of the noun phrase the big brown mat will be taken to be primarily a function
of the semantics of its lexical head mat. Decomposing the phrase one level at a time, we
should say more properly that the head of the NP is its N’ child, the head of the N’ child is
its N’ child, and the head of this N’ child is the noun mat. For an in-depth discussion of the
role of heads in modern deterministic grammars, see Pollard and Sag (1986). The notion of
9
lexical heads is a notion which will be of crucial importance in our discussion of statistical
grammars.
2.1.2 HPSG and subcategorisation lists
Agreement features are just one kind of grammatical feature used to annotate phrases. An-
other important kind of feature is used to distinguish between verbs which select for dif-
ferent complements. For example, the verbs chase and hiccup differ in that chase requires a
direct object to be specified, while hiccup does not:
(2.4) John chased the dog.
(2.5) ? John chased.
And vice versa:
(2.6) John hiccuped.
(2.7) ? John hiccuped the dog.
To represent different verb types, we could simply use different atomic phrase types
(e.g. ‘trans verb’ and ‘intrans verb’). However, there are many properties of verbs which do
not depend on their pattern of complements, such as agreement features. Consequently, it
makes sense to specify a verb’s complements as a feature.
This feature is termed the verb’s subcategorisation list: it is basically an ordered se-
quence of the complements which a verb must accept in order to produce a complete VP.
For example, the subcategorisation (subcat) list of the verb chase would be the list [NP] ,
and the subcategorisation list of hiccup would be the empty list [] ; the subcategorisation list
of a complex verb like put or introduce would be the list [NP, PP] . Subcategorisation lists
considerably reduce the number of rules needed within a grammar. Instead of needing one
rule for each verb type, we can now have a single rule with a recursive structure. This is one
of the key innovations in a grammatical formalism known as Head-Driven Phrase Structure
Grammar (HPSG, Pollard and Sag (1986)). HPSG also has the concept of lexical heads as
just described. Given these two concepts it would be reasonable to write HPSG–like rules
such as those in Table 2.3, although real HPSG rules tend to be extremely complex.
In these rules, lexical items carry with them a subcat list specifying the complements
which they require. For instance, the verb sat has a subcat list containing a PP. Rather than
a rule explicitly allowing a VP to be made up of a verb like sat and a PP, there is a general
recursive rule (the last line) that splits the verb’s subcat list into a head and a tail; the head
appears to the right of the verb, and the tail is the new subcat list of the parent node. Note
that a similar treatment has been used for PPs: the lexical item on has a subcat list specifying
10
S→ NP, V([])
NP→ Det, N
sat→ V(subcat[PP])
on→ P(subcat[NP])
the→ Det
cat→ N
mat→ N
P(Tail)→ P([Head—Tail]), Head
V(Tail)→ V([Head—Tail]), Head
Table 2.3: Some example (simple) HPSG rules
that its complement is an NP. The two recursive rules are almost identical; in HPSG, these
rules are replaced with a general rule of the form X(Tail)→ X([Head—Tail]), Head, which
works for all cases where a word’s complements appear to its right.
We will not be making any further reference to HPSG. But subcategorisation lists are
another grammatical notion which will be extensively used in our discussion of statistical
parsing in Section 2.3.
2.2 Deterministic parsing algorithms
Having decided on a grammar formalism to use, the next stage is to process sentences using
this formalism. Firstly we need to categorise the input, a process known as part-of-speech
(POS) tagging, and then we need to combine the categories into larger phrases using a
parsing algorithm.
A parsing algorithm takes a grammar and an input sentence, and produces a parse tree
such as in Figure 2.1. The algorithm can be either goal driven (top-down) where it starts
with a special nonterminal in the grammar (i.e. “S”) and searches all the trees the grammar
can produce to find the words, or data driven (bottom-up) where it starts with the words
and searches all the phrases that can be produced. Either way, the parser maps between
grammar rules and sentences. When more than one mapping exists, the parser is typically
designed to return all possible mappings. When there are multiple mappings, the parse is
ambiguous, and the problem of selecting the correct parse is known as resolving ambiguity.
These multiple mappings are extremely common in natural language. For example, in
The man saw the girl with a telescope, who has the telescope? Additionally, many words such
as fly can act either as a noun or a verb. Because the human brain is extremely good at
resolving ambiguity, it is rarely noticed, although it is possible to derive exceptions such as
11
when reading the sentence: A list of lecturers broken down by age and sex will be posted in the
lobby. Deterministic grammars do not have this capacity for resolving ambiguity and simply
pass the problem on to the next stage in the process.
2.2.1 Chart parsing
A standard mechanism for generating parse trees is chart parsing. A chart parser works
by considering every subsequence of words in the input sentence, finding every possible
phrase in each of these subsequences. (Because every sequence is considered only once,
a chart parser avoids the expensive backtracking involved in more straightforward search
algorithms.) For any real grammar the time spent analysing unimportant subsequences is
much less than the time spent backtracking. Additionally, a chart parser is more useful for
parsing naturally occurring text because it can ignore the errors to find the largest possible
‘chunks’.
The core of a chart parser is the chart data structure. This is a multidimensional array,
indexed by the start and end of each phrase. (A phrase is often referred to as an arc in chart
terminology.) Each element in the array is called an edge Edges can either be complete or
incomplete. An incomplete edge (also called an active edge) is one that corresponds to a
phrase not all of which has yet been found. For example, after seeing the, a chart parser will
use the rule NP → Det, N to generate an incomplete edge for this word labeled with the
category NP, with the Det marked as already found and the Non an list of phrases still to be
found. By contrast, complete edges (also known as inactive edges) are phrases the parser
has found. This is illustrated in Figure 2.2 which shows a chart after parsing the large can.
Incomplete edges are at the top of the chart and use the symbol ◦ to denote the division
between the parsed section and the section the parser is looking for.
The chart parsing algorithm, called the Earley algorithm (Earley, 1970) is sufficiently
complex to warrant explanation. Parsing is performed using two functions: extend com-
bines an incomplete edge with a complete edge to form a new edge which may now be
complete; and parse takes newly completed edges and finds grammar rules that they can
start, generating incomplete edges. This is all better understood by way of example. Figure
2.2 shows the parser’s state after the large can. Now consider the next input word can. This
is first tagged as possibly a noun or a verb, with both alternatives entered into the chart as
complete edges. Next the parser looks for rules that can be started with the newly com-
pleted edges as well as incomplete edges that can be extended with the newly completed
edges. The whole process is repeated for hold and the result is given in Figure 2.3. Read-
ers interested in a gentler introduction to chart parsing are referred to Allen (1995) or most
introductory artificial intelligence texts.
12
NP→ D ◦ ADJ N
NP→ D ◦ N
NP→ D ◦ N
NP→ D ADJ ◦ N
S→ NP ◦ VP
S→ NP ◦ VP
VP→ AUX ◦ VP
VP→ V ◦ NP
NP
NP
D ADJ N
AUX
V
1 the 2 large 3 can
Figure 2.2: A chart parser after parsing the large can. The top part of the
diagram shows the partial phrases while the bottom part shows the com-
pleted phrases.
13
S→ NP ◦ VP
S→ NP◦ VP
VP→ AUX ◦ VP
VP→ V ◦ NP
VP→ AUX ◦ VP
VP→ V ◦ NP
VP→ V ◦ NP
NP
NP
N N
V V V
D ADJ AUX AUX N
1 the 2 large 3 can 4 can 5 hold
Figure 2.3: A chart parser after parsing the large can can hold. The top part
of the diagram shows the partial phrases while the bottom part shows the
completed phrases.
2.2.2 Problems with deterministic parsing
Until recently, most deterministic grammars were fairly small, in terms of both numbers of
lexical items and grammatical rules. They were frequently designed to deal very well with
a limited set of grammatical rules. However, such parsers tended to fail miserably when
tested on real corpora because they do not have broad coverage.
It might be thought that the solution to the coverage problem is to augment the number
of rules and words in the grammar. This does indeed result in better coverage; however a
new problem now arises — as mentioned in Section 1.1 — a proliferation of ambiguity.
To illustrate this problem, begin by assuming we need our wide-coverage grammar to
deal with the following perfectly reasonable sentences:
(2.8) Here is the equation for a Laplace transform
(2.9) Mustang Sally walked into the bar.
(2.10) Who volunteers to mark the assignment? Me.
We therefore need rules to generate the analyses shown in Figure 2.4.
However, if the grammar contains these rules — all of them fairly unusual — then if we
try and parse an ordinary sentence like John saw Mary, we end up with an entirely spurious
parse, as shown in Figure 2.5.
14
NP
Det
a
N
PN
Laplace
N
transform
PN
N
Mustang
PN
Sally
S
NP
Me
Figure 2.4: Some analyses which a wide-coverage grammar should in-
clude
S
NP
PN
N
PN
John
N
saw
PN
Mary
Figure 2.5: A dubious parse by a deterministic grammar
In the spurious parse (on the right-hand side), ‘John saw Mary’ is interpreted as a single
word answer to a question, referring to a strange character called ‘John saw Mary’. John saw
is a N, just like Laplace transform, and John saw Mary is then analogous to Mustang Sally. Any
wide-coverage grammar is bound to contain massive ambiguity.
Some grammarians have dismissed this argument by saying disambiguation is a post-
processing phrase and not part of parsing, but this dismissal is weak: the problem of am-
biguity is so severe that assuming an oracle for disambiguation is basically giving up on
doing syntax altogether. The deterministic grammar has not found the correct parse of the
sentence, it has just rejected the obviously wrong parses. Sophisticated linguistic parsers
may yield fascinating results about the structure of the language, but they are useless for
parsing real unrestricted natural language input.
2.3 Probabilistic grammars and corpus-based NLP
Probabilistic grammars are a method of overcoming the ambiguity inherent in deterministic
grammars. Instead of enumerating all grammar rules as if they were equal, rules are an-
notated with their relative frequency. This means the parser can state not only that a rule
matches, but can also sort alternative parses by their likelihood of being correct. This in-built
mechanism for ambiguity resolution is a huge advantage over deterministic grammars. Ad-
ditionally, the improved measurement of ambiguity in probabilistic grammars means that
15
probabilistic grammars can be much, much bigger than deterministic ones. For instance, a
probabilistic grammar is allowed to have a very large number of very specific rules describ-
ing how an NP can be formed. This will often result in a large number of spurious parses of
a given sentence, but this does not matter, provided we have chosen suitable probabilities
for the rules in the grammar, because we can be confident the spurious parses will have low
probabilities.
The first probabilistic grammars were written by hand in a similar way to deterministic
grammars. The only difference was that every grammar rule had an associated probabil-
ity. These probabilities were derived by educated guesses on the part of the grammar en-
coders. The grammars were called probabilistic context free grammars or PCFG (Booth and
Thompson, 1973) and had rules of the form given in Figure 2.6. (The probabilities are of a
VP constituent being rewritten as VT NP and as V0 NP are 0.8 and 0.2 respectively in this
case, these two alternatives exhaust the possible ways VP can be expanded, because they
sum to 1.) Analogous extensions to CFG, TAG, XBAR and many other formalisms were also
developed. Conceptually, parsing using a grammar uses exactly the same process as with
deterministic grammars. The only difference is that since every rewrite rule has an associ-
ated probability, the probability of the derivation can be obtained by multiplying together
the probability of all of the rules used.
VP→ VT NP (0.8)
VP→ V0 (0.2)
Figure 2.6: Example of Probabilistic CFG rules
2.3.1 Building a corpus of hand-parsed sentences
Adding probability information to a grammar by hand-annotating every grammar rule with
a likelihood is prone to many errors, as human annotators are poor at guessing probabilities.
In addition, building a probabilistic grammar by hand does not take full advantage of the
fact that probabilistic grammars can be extremely large. Rather than guessing probability
information it is more natural to derive the probabilities from a corpus of annotated text.
Probabilistic grammars can be annotated with context information and semantic require-
ments in the same way as deterministic grammars.
Perhaps the most interesting example of an early probabilistic grammar was History
Based Grammar (HBG) by Black et al. (1992). This was the first project to bridge the gap
between manually guessed probabilities and automatic probability generation because as
part of developing the parser, Black et al. developed a large corpus of parsed text from
which they extracted probabilities. HBG uses a feature based grammar with twenty-one
16
features, where the features are enumerated sets with an average of eight different values.
For example, ‘past’ is a valid value for the ‘tense-aspect’ feature.
While this approach leads to a fairly good statistical parser, it requires a huge amount of
effort to build. More work is needed to build the corpus than to build the parser. The corpus
needs to be large enough to have statistically significant frequencies. An obvious solution
to this is to make generating the corpus an entirely separate project which can then be used
by a statistical parser. This is the goal of the Penn treebank project (Marcus, Santorini, and
Marcinkiewicz, 1993). The Penn treebank was the second large corpus developed (the first
was the Lancaster corpus used by Black et al.). The Penn treebank is essentially the only
treebank available for building a parser in English and so any deficiencies in it will lead to
deficiencies common to all statistical parsers. It is therefore particularly important to discuss
its peculiarities.
Building a corpus such as the Penn Treebank is a huge undertaking. The corpus must
be large in order to obtain accurate statistics, and while statistical methods are robust to
errors in their training data, they are much less robust to errors in the testing data. This is
a major problem if the corpus is to be used to compare the accuracy of different parsers.
Additionally, the people who built the corpus did not know what the best grammatical for-
malism would be for a statistical parser so they had to make the corpus independent of
the grammatical formalism being used. Figure 2.7 gives an example tree from the treebank
demonstrating the minimal syntactic information present. There are many good reasons for
this, one of them being not knowing which formalism would be best, as was just mentioned.
However it also leads to a number of problems as there is very little information that every
formalism agrees on. For example, GB and HPSG both contain a lot of attachment informa-
tion but the information they contain is quite different and so the corpus contains none of
this information.
Another side effect of this representation problem on the representation is that trees are
very flat, as demonstrated in Figure 2.8. While this does not prevent a parser obtaining good
results when compared to the test corpus, it may mean that such a parser is still not good
enough for many uses. The section on evaluation (Section 2.3.2) examines this point.
Because of the huge amount of work required to build an accurate corpus, the corpus
is relatively small. It contains a total of fifty thousand sentences, and a similar number of
words. Compared to the corpora used in unsupervised learning, which typically run to
millions of sentences, this is really tiny. Furthermore, it is based solely on a number of years’
worth of articles from the Wall Street Journal (the WSJ). This restriction to the domain of
carefully edited discussions of financial data means a good statistical parser is only going
to perform well on other Wall Street Journal style sentences. No solutions to this have been
advanced in the literature, perhaps because everybody is still concerned with getting a good
17
TOP
S
NP-SBJ
PRP
We
VP
VBP
’re
VP
VBG
talking
PP-CLR
IN
about
ADVP-TMP
ADVP
NP
NNS
years
IN
ago
SBAR
IN
before
S — see below
. . .
.
S from above
NP-SBJ
NN
anyone
VP
VBD
heard
PP-CLR
IN
of
S-NOM
NP-SBJ
NN
asbestos
VP
VBG
having
NP
DT
any
JJ
questionable
NNS
properties
Figure 2.7: Example sentence from the Penn treebank
NP
PRP$
its
NNP
nielsen
NNP
marketing
NNP
research
, NNP
nielsen
NNP
clearing
NNP
house
CC
and
NNP
donnelley
NNP
marketing
NNS
businesses
NP
DT
a
NNP
san
NNP
francisco
NN
food
NNS
products
CC
and
NN
building
NNS
materials
NN
marketing
CC
and
NN
distribution
NN
company
Figure 2.8: Two phrases showing that WSJ phrases contain little attach-
ment information
18
statistical parser. However, it is a big problem which will need to be addressed at some stage.
Because the Penn treebank is so closely tied to the Wall Street Journal, I will usually refer to
it as the WSJ.
2.3.2 Evaluating parser performance
Once you decide to take real natural language corpora seriously, you need to have in place
some proper quantitative measures for evaluating the performance of a parser on a given
corpus. The goal is to minimise the overall error, which differs significantly from previous
work in linguistics where the aim was to correctly parse a few very complex sentences.
The most obvious metric for determining a parser’s performance is the percentage of
sentences it gets completely correct, called the exact match. This method does not work
very well in practice because the current generation of parsers get all but the very simple
sentences slightly wrong, and so maximising this metric becomes a task of getting simple
structures right at the expense of complex structures, hardly a laudable goal.
An alternative method is to measure the percentage of phrases the parser gets right.
This is a good metric in that a high score implies a better parser but it is surprisingly diffi-
cult to formalise how incorrect phrases are scored. For example, what if the parser finds a
large phrase from the treebank but not the component phrases, or if it finds the component
phrases but incorrectly labels the parent? The most common method used is to split the
parser’s accuracy into precision and recall. Precision is the percentage of phrases found by
the parser which are in the ‘correct’ analysis of a sentence, while recall is the percentage of
phrases in the ‘correct’ analysis of the sentence which are found by the parser. Perhaps the
best way of illustrating the difference is with an extreme example: a parser that labels every-
thing as a phrase has perfect recall but terrible precision while a parser that labels nothing
as a phrase has perfect precision but terrible recall. For some tasks one metric is more useful
than the other, but in general they are weighted equally. Most current parsers score about
the same in both metrics.
The precision/recall metric is not without problems. One problem is that the scores are
too high given the low overall standard of parsers. The best parsers at the moment score
around 85% precision and recall, but this means that a parse of a fifteen word sentence with
around 20 constituents will probably have around three errors in it. Another problem is
that a higher average score does not necessarily mean an empirically better parser. Because
average scores are relatively high, a single sentence parsed very badly will significantly de-
crease results. Obtaining a high score becomes a problem of tweaking the parser to correctly
handle the strange cases in the treebank such as sentences with unusual punctuation or not
ending in a full stop. The difference between 75% and 80% on ordinary sentences is quickly
lost if the parser performs poorly on these edge cases. One excellent way of avoiding this
19
problem, which I have not seen mentioned in the literature so far, would be to use median
precision/recall scores instead of mean scores.
If we are trying to obtain a precision/recall above 85%, it is sensible to ask how much
higher it is possible to go. After all, 85% sounds quite accurate. The answer is not yet known
but appears to be slightly above 90% since this is the accuracy that can be obtained by hand-
picking between a selection of automatically derived parses. Additionally, the ability of
parsers to generalise to genres of text different from those contained in the training data is
expected to be quite low. So, while we may have 85% when testing in the same genre, it will
be some time before this is achieved in different genres.
One final problem with the high precision/recall scores is that it forces parsers to tightly
conform to the WSJ representation. This is a problem because the representation does not
include enough information to be useful for many tasks. A third metric for evaluating parser
accuracy is the number of crossing brackets found when comparing the parser’s output to
the correct parse. This metric has the advantage that over- or under-specific constituents
do not get penalised, and somewhat alleviates the problem of tight conformity to the WSJ
annotations.
Another aspect of a parser’s performance that is hardly mentioned in the literature is
parsing speed. Many cited uses of parsers, such as automatic translation and summarisa-
tion, require the parser to parse around twenty words per second, yet one of the best parsers
requires an hour to parse twenty words (Bod and Scha, 1996). Even very fast parsers such
as Collins (1999) are unable to parse very large sentences in a reasonable length of time.
This area may become more interesting in the future but in this thesis I will be focusing on
precision and recall rather than parsing time.
2.4 Probabilistic grammar formalisms
The first step in generating a probabilistic grammar is to count all the events that occur in the
training corpus. An event can be understood as some component of a parse tree. If we are
trying to learn a context-free grammar, the most obvious events to count are productions;
i.e. individual applications of context-free rules. For instance, assume we are building a
probabilistic context-free grammar to disambiguate a sentence with spurious ambiguity,
such as the sentence John saw Mary discussed in Section 2.2.2. If we take a mini-corpus of
sentences such as those in Figure 2.9, we can estimate the probabilities of rules by counting
the number of occurrences of each rule. The frequencies are shown in Table 2.4. (Note that
an unusual rule like PN→ N, PN is relatively rare.)
To estimate the probability of a complete parse tree from this frequency information,
we need to break the parse tree in question into its constituent events. Any tree can be
20
S
NP
PN
N
Mustang
PN
Mary
VP
VT
saw
NP
PN
Bill
S
NP
PN
Mary
VP
VT
saw
NP
Det
a
N
PN
John
N
saw
S
NP
PN
Mary
VP
VT
saw
NP
PN
John
S
NP
PN
Mary
Figure 2.9: A simple corpus of hand-parsed sentences
thought of as a set of productions, but crucially, these productions are not fully independent
of one another; the children of the highest production in the tree determine what the parents
are for the next productions down, and so on recursively. What we need, therefore, is the
conditional probability of each production given the occurrence of its parent. To estimate
this we simply count the number of times the rule is applied in the corpus, and divide by
the number of times the rule’s parent occurs. The conditional probabilities derived by this
method are given in Table 2.5.2 (We also need the prior probability of the node at the root of
the parse tree being the root of a tree, which for our corpus is 1 for S, and 0 for every other
phrase.) Given that we are working with a context-free grammar, in which the way a node is
expanded does not depend on the context in which it appears, the probability of a complete
parse tree is then simply the product of all of these probabilities. For the two interpretations
of John saw Mary, these probabilities are given in Figure 2.10.
These are sufficient to strongly prefer the left-hand parse over the spurious parse on the
right.
2.4.1 Lexical semantics in probabilistic grammars
While simple probabilistic context-free grammars such as that just described are very help-
ful in giving preferences for commonly used productions, there are many cases where a
sentence has alternative readings which both involve common productions. For instance,
consider the sentence “The man saw the dog with the telescope”, for which two alternative
2Henceforth, we will often leave it implicit that probabilities are being estimated from relative frequencies,
where this is obvious from context.
21
Rule Frequency in corpus
S→ NP, VP 3
S→ NP 1
NP→ PN 6
NP→ Det, N 1
VP→ VT, NP 3
N→ PN, N 1
N→mustang 1
N→ saw 1
VT→ saw 1
det→ a 1
PN→Mary 2
PN→ John 1
PN→ N, PN 1
Table 2.4: Frequencies of rules in the simple corpus
parses are given in Figures 2.11 and 2.12.
Both of these syntactic structures are very common and while a PCFG would be able to
select one over the other, it would be expected to give both an approximately equal weight-
ing. One possible solution to this problem is to include information about the lexical items
in the sentence in the phrases involved in its analysis. Intuitively, we expect events of seeing
to frequently involve telescopes, while we expect dogs infrequently to have telescopes. The
HPSG notion of a lexical head is useful in spelling out this intuition. We expect a VP headed
by the verb saw to be quite frequently modified by a PP involving the word telescope in a
representative corpus, while we expect an NP headed by dog only rarely to be modified by
a PP involving the word telescope in such a corpus.
How can we modify our grammar to include the appropriate lexical information? A use-
ful solution, also originally proposed by Black et al. (1992), basically involves a huge increase
in the number of phrases in the grammar. Instead of simply having a phrase NP, we need
one phrase for each possible headword of an NP: that is, NP-headed-by-dog, NP-headed-
by-telescope, and so on. At this point, unfortunately, we are faced with a data sparseness
problem: we are unlikely to find sufficient counts for individual productions, even with a
very big corpus. The problem is partly due to Zipf’s law; most words in the language only
occur very infrequently, so most grammatical categories, when tagged with an open-classed
headword, will be fairly rare. The problem is compounded by the fact that many grammars
allow a node to take several children. If each child is already rare, then the combination of
22
Rule Frequency in corpus Estimated P(Rule|Parent)
S→ NP, VP 3 3/4 = .75
S→ NP 1 1/4 = .25
NP→ PN 6 6/7 = .86
NP→ Det, N 1 1/7 = .14
VP→ VT, NP 3 1
N→ PN, N 1 1/3 = .33
N→mustang 1 1/3 = .33
N→ saw 1 1/3 = .33
VT→ saw 1 1
det→ a 1 1
PN→Mary 2 2/4 = .5
PN→ John 1 1/4 = .25
PN→ N, PN 1 1/4 = .25
Table 2.5: Probabilities of rules in the simple corpus
p(S is a root node) = 1.00
S(0.75)
NP (0.86)
PN (0.25)
John
VP (1.00)
VT (1.00)
saw
NP (0.86)
PN (0.5)
Mary
p(S is a root node) = 1.00
S(0.25)
NP (0.86)
PN (0.25)
N (0.33)
PN (0.25)
John
N (0.33)
saw
PN (0.50)
Mary
Figure 2.10: Parse trees with associated prior and conditional probabilities
for John saw Mary
23
S
NP [man]
Det
The
Noun
man
VP [saw]
VP [saw]
Verb
saw
NP [dog]
NP [dog]
Det
the
Noun
dog
PP [with]
Prep
with
NP [telescope]
Art
the
N
telescope
Figure 2.11: Syntactically valid but unlikely parse of “The man saw the
dog with the telescope.”
S [saw]
NP [man]
Det
The
Noun
man
VP [saw]
VP [saw]
Verb
saw
NP [dog]
Det
the
Noun
dog
PP [with]
Prep
with
NP [telescope]
Art
the
N
telescope
Figure 2.12: Likely parse of “The man saw the dog with the telescope.”
24
n such children will be exponentially so. With low counts, we cannot be confident in the
probabilities we derive as we discuss in detail in Section 2.5.
The problem of Zipf’s law and the problem of multiple children needs to be addressed in
different ways. Very few solutions have been proposed for the former problem; in fact this
thesis will focus largely on the problems caused by Zipf’s law. The latter problem can be
addressed by finding a way of splitting a parse tree into events that are smaller than single
context-free rule applications. The rest of this section will discuss how this can be done.
One idea, originally proposed by Magerman (1995), is to break each single rule appli-
cation into several components: a head production which takes a phrase and generates its
head constituent, and a set of sibling productions which take a phrase and its head con-
stituent, and generate the remaining child constituents, either to the left or the right of the
head. The occurrence of a parent node decomposing into a set of children is now represented
using the kinds of events shown in Figure 2.13.
Parent
Head
Parent
Head . . . Right sibling
Parent
Left sibling . . . Head
Figure 2.13: Head and sibling productions
The conditional probabilities we are interested in are the probability of a head constituent
given its parent (for a head production) and the probability of a sibling constituent given its
parent and its head (for a sibling production). These probabilities can be estimated from
relative frequencies of events, as described in Section 2.4. The events this time are not pro-
ductions of context-free rules but partial descriptions of such productions:
P(Head|Parent) =Count(Head, Parent)
Count(Parent)(2.1)
P(Le f t|Head, Parent) =Count(Le f t, Head, Parent)
Count(Head, Parent)(2.2)
P(Right|Head, Parent) =Count(Right, Head, Parent)
Count(Head, Parent)(2.3)
The notation here needs some explanation. Taking Equation 2.1 as an example, if you
know the parent and you are trying to derive the probability for a given head, you can
estimate the probability by counting the number of times that head occurs as the head of
that parent, and dividing by the total number of times that parent occurs.
If we move to a lexicalised grammar, the data sparseness problem due to Zipf’s law
is now reduced; head productions only involve one lexical item, and sibling productions
only involve two. A concern when shifting to a HPSG approach, is that the parser will lose
dependency information. In a PCFG approach, you list all the arguments when you list a
25
grammar rule, but in the HPSG approach, you add the arguments one at a time. How do
you ensure only the right number of arguments are assigned? For instance, consider the
fragment kicked the ball the table: In PCFG this can be quickly rejected as a VP because kicked
only takes one argument while the naive HPSG already described would accept either and
therefore both ball and table as arguments to kicked. Magerman’s solution to this is to split
arguments into two classes, adjacent arguments and non-adjacent arguments.
Of course, the above equations do not yet actually refer to words. To introduce words
into these equations, we should first introduce some new terminology. Consider a con-
stituent C1, which decomposes into a head constituent Chead and a left sibling constituent
Csib, as shown in Figure 2.14.
sibC
headC VP[chased]
chased a mouseThe cat
NP[cat]
C1
S[chased]
Figure 2.14: A simple lexicalised parse tree
When we refer to the ‘head’ of C1, we could be referring to the entire constituent Chead,
or to the label of Chead (i.e. VP), or to the head word of Chead (i.e. chased). Similarly, when we
refer to the ‘parent’ of Chead, we could be referring to the whole constituent C1, or to the label
of C1 (i.e. S), or to the headword of C1 (i.e. chased). To disambiguate, we will say that Chead
is the head constituent of C1, VP is its head nonterminal label, and chased is its headword,
while C1 is the parent constituent of Chead, S is its parent nonterminal, and chased is its
parent headword. (Since there is some redundancy in recording the headword of a head
constituent and its parent constituent, we do not in fact need to record this latter piece of
information.) We abbreviate ‘head constituent’ as H, ‘head nonterminal’ as Hnt, ‘headword’
as Hw, ‘parent constituent’ as P, ‘parent nonterminal’ as Pnt. We likewise abbreviate ‘left
sibling constituent’ as L, ‘left sibling nonterminal’ as LNT, and ‘left sibling headword’ as Lw,
and similarly for right siblings. To estimate the probability of lexicalised productions, we
can now use the modified equations given below.
P(HNT|PNT, Hw) =Count(HNT, PNT, Hw)
Count(PNT, Hw)(2.4)
P(LNT, Lw|HNT, Hw, PNT) =Count(LNT, Lw, HNT, Hw, PNT)
Count(HNT, Hw, PNT)(2.5)
26
P(RNT, Rw|HNT, Hw, PNT) =Count(RNT, Rw, HNT, Hw, PNT)
Count(HNT, Hw, PNT)(2.6)
These equations can be applied to the ambiguous sentence we started with, The man saw
the dog with the telescope. Recall that we are looking for a way of preferring the parse in
Figure 2.12 over that in Figure 2.11. Informally, we need to find that the probability of a PP
headed by telescope is more likely to occur as the right sibling of a VP headed by saw than as
the right sibling of an NP headed by dog. More formally, substituting the actual heads and
parents into Equation 2.6 leads to the calculations given in Equations 2.7 and 2.8.
P(PP, telescope|VP, saw, VP) =Count(PP, telescope, VP, saw, VP)
Count(VP, saw, VP)(2.7)
P(PP, telescope|NP, dog, NP) =Count(PP, telescope, NP, dog, NP)
Count(NP, dog, NP)(2.8)
It is now reasonable to expect the correct parse to have a higher probability.3
Having discussed the theory behind probabilistic grammars it is now possible to exam-
ine how this has been used by a few real probabilistic parsers. We will discuss Black et al.
(1992), Bod and Scha (1996) and Klein and Manning (2003).
2.4.2 Black et al.
Black et al.’s history based grammar (HBG) differs significantly from the generative gram-
mars that are now standard. The first major difference is that it imposes a fixed derivation
order which essentially requires the leftmost derivation to be expanded. This requirement
on the derivation order means that every grammar rule can only depend on things derived
before (that is, to the left of it). For instance, in parsing the aggressive potato, the internal frame
will store a representation for the aggressive and attempt to coerce it with a representation of
potato. This differs from other formalisms such as HPSG where potato will be derived first
since it is the head of the phrase, and then aggressive will be coerced into the phrase headed
by potato.
The grammar rules themselves are simple context-sensitive rules, which Black et al.
refers to as context-free with features. Like HPSG, each state is represented as a frame
containing syntactic and semantic information. The grammar rules therefore show how fre-
3We have taken a couple of liberties in this example. Firstly, we assume a corpus with relatively frequent
mentions of the concepts see, dog or telescope. The WSJ is certainly not such a corpus! Secondly, we are assuming
that the head of a PP is a noun, to allow for telescope to be the head of a PP. In most NLP work, the head of a
PP is the preposition. However, there is a huge debate in linguistics about what the ‘correct’ headwords are
for different constituents. Here our sole concern is to give a simple example of how lexical items can help in
ambiguity resolution. The idea of multiple ‘headwords’ is explored more fully by Bod and Scha, discussed here
in Section 2.4.3.
27
quently a given frame combines with other frames. An example showing what the grammar
stores is given in Figure 2.15.
R: P1Syn: PP
Sem: With−DataH1: listH2: with
R: NBAR4Syn: NPSem: DataH1: listH2: a
R: N1Syn: NSem: DataH1: listH2: *
with
a
list
Figure 2.15: Sample representation of “with a list” in the HBG model,
taken from Black et al. (1992)
There are two significant advantages in having only one derivation. Firstly, it makes it
much easier to spend CPU cycles deriving each derivation once, much the same benefit as
using a chart parser. Secondly, writing the probability model is significantly simpler than for
PCFG. The probability for every production is based on the probability for each data field,
so for instance list includes the information that it is a noun and has a semantic role of data.
In matching, the probability for these two parameters is derived separately and combined
using a form of Naive Bayes.
In terms of exactly what to store in each event, Black et al. experimented with several dif-
ferent models but eventually chose p(Syn, Sem, R, H1, H2|Synp, Semp, Rp, Ipc, H1p, H2p). This
equation means that a mix of syntax and semantics is used at every stage. One point that
was particularly clever was the use of two headwords. It is generally accepted that the head-
word of a prepositional phrase is the preposition rather than the PP’s complement. Ignoring
the linguistic justification and choosing the complement instead will lead to problems when
the parent of the PP does not match the preposition. However, the initial motivation I made
for statistical parsers was the sentence the boy saw the girl in the park with a telescope. In this
sentence we noticed that attaching the PP with a telescope to saw makes sense because saw
hopefully co-occurs frequently with telescope. It is unreasonable to assume saw will co-occur
with with more frequently than girl will co-occur with with. The addition of an auxiliary
28
headword solves this problem nicely.
As already mentioned in Section 2.3.1 Black et al.’s system was the first to make use of a
treebank. Black et al. used a five thousand sentence treebank taken from technical manuals
called the Lancaster treebank which was created by IBM specifically for this project. The use
of a treebank was a huge departure from previous work and marks the real birth of statistical
parsing, all the other parsers examined here also use a treebank. However, Black et al. did
not use a treebank in the same way as later parsers. The grammar rules were still derived
by hand, whereas in later systems they are automatically extracted from the treebank.
Overall, HBG performed very well. Since we are discussing the grammar rather than
the parser I will not give detailed performance figures; but roughly speaking Black et al.
was able to achieve equivalent performance to a parser based on a PCFG grammar which
was trained on a corpus twice the size, and achieved thirty percent less errors than a hand-
encoded PCFG.
2.4.3 Exhaustive grammars: Bod and Scha’s approach
Bod and Scha developed a parser for Scha’s data-orientated parsing (DOP) (Bod and Scha,
1996). There are two major additions to the literature in Bod and Scha’s grammatical for-
malism. The first is the idea that the training treebank is not just used to train the grammar,
it is the grammar. Black et al. also used a treebank, but Bod and Scha were the first to derive
the grammatical rules from the treebank instead of just extracting frequency counts. The
second extension was moving away from the idea that a linguist can tell what is important
in a parse tree and instead leaving this job to the parser.
Simple context-free grammar rules do not contain any defeasible information while more
complex formalisms such as HPSG frequently do. Because probabilistic grammars include
information that does not have to be satisfied, it is natural to view probabilistic grammars
as a very fine grained form of defeasible reasoning. For example the first probabilistic gram-
mars only differed by including low-probability rules for structures a deterministic gram-
mar would probably reject. Next, Black et al. included a number of features, all of which are
defeasible; and more recently Collins includes around a dozen parameters (Collins, 1999)
that are better described as guiding the search space than being defeasible features because
they are so easily contradicted. Despite publishing in 1996, Bod has taken this trend to its
logical conclusion in data orientated parsing DOP, or TREE-DOP (Bod and Scha, 1996). In
this formalism the grammar is replaced with the entire training corpus.
Simply using the grammar as a corpus would make it impossible to parse novel sen-
tences. Instead, Bod and Scha stores every possible permutation of every tree in the corpus.
With this representation it is possible to find the number of trees in the grammar that match
a given parse structure and the number that cannot be matched. In this way every struc-
29
ture in the corpus leads weight to similar interpretations. For example, Figure 2.16 shows a
complete grammar formed from a corpus of three sentences.
S
NP
John
VP
V
Likes
NP
Mary
S
NP
Peter
VP
V
hates
NP
Susan
VP
V
likes
NP
Mary
NP
John
V
likes
S
NP VP
V
likes
NP
Mary
S NP
VP
V
hates
NP
Susan
VP
V NP
Mary
NP
Mary
V
hates
S
NP
John
VP
V NP
Mary
S
NP
Peter
VP
V NP
Susan
VP
V
likes
NP
NP
Peter
S
NP
John
VP
V
likes
NP
S
NP
Peter
VP
V
hates
NP
VP
V NP
NP
Susan
S
NP VP
V NP
Mary
S
NP VP
V NP
Susan
VP
V
hates
NP
Susan
S
NP VP
V
likes
NP
S
NP VP
V
hates
NP
VP
V NP
Susan
S
NP
John
VP
V NP
S
NP
Peter
VP
V NP
VP
V
hates
NP
S
NP VP
V NP
S
NP VP
V NP
VP
V NP
S
NP
John
VP
S
NP
Peter
VP
S
NP VP
S
NP VP
Figure 2.16: Sample DOP grammar for a tiny corpus
An obvious complication with this grammatical representation is how to store it. For
example, the Penn treebank contains fifty thousand sentences with an average of twenty
five words and forty phrases per sentences. Given that a tree of n nodes can be permuted
2n ways, this gives around 107 subtrees per sentence, or over 1012 subtrees in the training
corpus. Even if hard disk sizes continue to increase at the same rate, it will take years before
a typical research workstation has this sort of capacity. Bod and Scha has not come up with
30
an efficient solution to this yet. Bod and Scha’s current approach is simply throwing away
random grammar rules until the grammar comes down to a manageable size. This topic,
and ways of resolving it, will be discussed in Section 2.7.2.
2.4.4 Klein and Manning’s statistical parser
A recent parser developed by Klein and Manning (2003) is particularly interesting because,
despite forgoing a lexicalised grammar, it achieves extremely high accuracy. Initial exper-
iments with PCFG by e.g. Black et al. had shown that a lexicalised probabilistic grammar
leads to significantly improved parse (as was discussed in the last section). Before Klein
and Manning’s results were released, statistical parsing had became synonymous with lex-
icalised statistical parsing. However, people building statistical parsers invariably included
more information than just lexical items. For example, distinctions between verb adjunct
and verb arguments, information about semantic role of PPs, and subcategorisation lists
were commonplace. What Klein and Manning asked is: How well does a statistical parser
perform without lexical information, but with this extra information? The answer is, almost
as well as it would if it also had the lexical information. This result is extremely significant;
apart from the implications in linguistics, almost all the computational complexities in a sta-
tistical parser are side-effects of the lexicalised grammar. So it is worth examining Klein and
Manning’s grammar to see how the result was achieved.
Klein and Manning’s approach was to implement a basic PCFG and then to add informa-
tion to this grammar, measuring the improvements in the parser’s accuracy. The first step is
to include the parent nonterminal in attachments, so that for example when attaching the to
cat, the parent of NP is considered part of the grammar rule. This was motivated by noting
that a subject noun phrase is nine times more likely than an object noun phrase to expand as
just a pronoun. The second step is to shift to an approach which has been described in this
thesis as HPSG-like, where the head is generated first and the left and right siblings are then
generated. Since Klein and Manning’s approach is significantly different to HPSG, they in-
stead use the term ‘Markovize’ to describe the transformation. This step was motivated by
concerns that the grammar strongly disfavoured syntactic structures it had not seen during
training. One particularly nice observation of Klein and Manning’s was that these two steps
can be generalised by talking about the amount of vertical context (parents, grandparents,
etc.), and horizontal context (number of siblings) to use at once, so that a grammar formal-
ism could be described simply by saying v = 2, h = 1 to mean that the parent is used, but no
siblings are used. Under this formalism, Klein and Manning note that Collins’ parser could
be approximately represented by v = 2, h = 1. Klein and Manning investigated performance
with a number of parameters as shown in Table 2.6.
By noting that the first cell in this table is 72%, we can see that Klein and Manning were
31
Horizontal Markov Order
Vertical Order h = 0 h = 1 h ≤ 2 h = 2 h =∞v = 1, No annotation 72.27 72.5 73.46 72.96 72.62
854 3119 3863 6207 9657
v ≤ 2, Some parents 74.75 77.42 77.77 77.50 76.91
2285 6564 7619 11398 14247
v = 2, Parents 74.68 77.42 77.81 77.50 76.81
2984 7312 8367 12132 14666
v = 3, All GParents 76.74 79.18 79.74 79.07 78.72
7797 15740 16994 22886 22002
Table 2.6: Klein and Manning’s parsing accuracy and grammar size for
different model complexities
able to reduce the model perplexity by almost ten percent. However, 79% is still far from
state-of-the-art. Klein and Manning then investigated a very large number of improvements
which were applied cumulatively. As a random example, giving percentage signs their own
tag instead of sharing the symbol tag leads to four percent fewer errors. Klein and Man-
ning used a total of fifteen such improvements to get a cumulative performance improve-
ment of another ten percent, to 87%. The four most significant new features were TAG-PA
(providing the grammar with the parent’s POS tag), SPLIT-IN (passing the IN tag to its
parent), DOMINATES-VERB(set to true if the child includes a verb), and RIGHT-REC-NP
(for NPs with a recursive NP on the right.) RIGHT-REC-NP is a simple distance met-
ric — it is designed to discourage attachments that are excessively large. Further details
of these metrics and the others used can be found in Klein and Manning’s source code at
http://nlp.stanford.edu/downloads/lex-parser.shtml.
A similar approach was taken by Bikel (2004), which contains a more detailed analysis of
the factors underlying the success of Collins’ parser. Essentially, the conclusion is the same:
the crucial factors are not word representations, but clever preprocessing.
2.5 Backoff, interpolation and smoothing
Before we discuss how these probabilistic grammars can be used in parsing, it is important
to discuss some practicalities concerning how the probabilities are derived. Ideally they
would be produced by counting events in the training corpus as has already been described.
However, the training corpus is finite, and it may not contain exactly the events we need in
order to compute the probabilities for a new sentence being parsed. What if the event did
32
not occur frequently enough for counting its occurrences to be an accurate sample, or didn’t
occur at all? This problem is particularly difficult when the event representation is complex
because this will decrease counts, as mentioned in the previous section.
To illustrate the problem, consider parsing our example sentence The man saw the dog with
a telescope. As discussed in Section 2.4.1, we have to decide whether to attach the PP with a
telescope to the NP the dog or the VP saw the dog. If we are using a lexicalised probabilistic
grammar, we can estimate the probabilities of these two events by the occurrence counts
in the corpus for PPs headed by telescope attaching to NPs headed by dog or VPs headed
by saw, as shown in Equations 2.7 and 2.8. If we are using the WSJ as our training corpus
then unfortunately there are no such events; in fact there are only a few adjacency events for
telescopes (Hubble, space, was, instrument and the). Since telescope was never associated with
saw or dog in the corpus, the estimated probability of either attachment is zero. The problem
is that the corpus is not large enough to contain useful counts for every type of event. Even
if the previous example had occurred in the corpus, the single occurrence would not have
been enough to give an accurate probability.
There are two separate kinds of solution to this problem. One approach is to simplify
the events being looked up until they are general enough for there to be sufficient counts of
similar events in the corpus. At this point the relative frequencies in the corpus become an
accurate estimate of the correct probability of the event. This process is called backoff. A
second approach is to look for ways of determining how much we can trust the counts of any
given event in our training corpus. (If we trust them completely then it will be impossible
to generate any novel structures since they will all have a probability of zero.) In general,
the fewer instances of an event there are, the less we will trust the estimate derived from the
corpus. We need to work out how we can derive an estimate of the true frequency from the
counted frequency. This process is called smoothing.4
2.5.1 Backoff and interpolation
Let’s return to our example sentence The man saw the dog with the telescope. As just noted, we
need to look for PPs headed by telescope attaching to NPs headed by dog or VPs headed by
saw, and the problem is there are none in the corpus. Of course, if we were working with a
grammar that did not contain any lexical heads, then there would be no problem; there are
thousands of cases where PPs attach to both NPs and VPs and so the probability estimate
is very good. Thus we can back off by deciding to throw away lexical heads altogether,
4The terms ‘smoothing’ and ‘backoff’ are used in different ways by different writers. For instance, Niesler
uses the term discounting where I use the term smoothing. The definitions I have just given will be used in the
remainder of this text, except when I refer to the titles of existing techniques such as ‘Katz smoothing’ (which in
my terms is actually a combination of smoothing and backoff techniques).
33
and thereby derive a reliable estimate from the corpus. On the other hand, the backed-
off grammar will not be as sensitive in choosing the correct parse, precisely because it has
thrown away useful information about how to do this. The problem here is how far to back
off. Solving this problem is one of the most important issues in statistical parsing.
A first step towards a solution is to represent events at several different levels of granu-
larity. For instance, we might use lexical information when counts are high and then discard
it for a less accurate model when counts are low. Having decided the different levels, the
next step is to combine their probability estimates into a single probability. This process is
known as interpolation.
n-gram models
In order to describe backoff and interpolation formally, it is useful to think about a simple
example domain. While we have so far been thinking about sentences as hierarchical syn-
tactic structures, it is quite common in statistical NLP to think about a sentence simply as a
sequence of words. In this situation, the context of a given word in a sentence is simply the
sequence of words which precedes it. Formally, we would write that the probability of the
ith word in the sentence as shown in Equation 2.9. In this Equation, and elsewhere in the
thesis, the notation wi refers to the i’th word in the sentence, and wi−11 refers to the sequence
of words from the first word to the i− 1’th word.
P(wi|wi−11 ) (2.9)
To estimate the probability of a word appearing after a given sequence of words, we can
then simply count how many times the word appeared after this sequence divided by the
number of times the sequence itself occurred. This equation is not very useful in practice
because it makes it impossible to derive probabilities for words appearing in novel contexts.
To resolve this, we approximate the context by the last few words. This is known as an
n-gram approximation, where n is the number of words making up the context. Using this
idea, An n-gram approximation of this equation is given in Equation 2.10.
P(wi|wi−11 ) ≈ P(wi|wi−1
i−n+1) (2.10)
Essentially, we are assuming that the probability of seeing wi is independent of the words
seen much earlier in the sentence. Careful independence assumptions always underlie the
process of backoff: when we decide to throw away some piece of knowledge about an event
to increase counts, we are always assuming that this component is independent of the aspect
we are interested in.
By varying n we can sacrifice counts for greater discriminating power. So, for instance,
a unigram model would give the probability of the current word, regardless of context; a
34
bigram model takes the previous word into account to predict the likelihood of the current
word, while a trigram model takes into account the two previous words.
When using an n-gram model, we sometimes need to estimate the probability of a com-
mon event, and we sometimes need to estimate the probability of a rare event. In the former
case, we would like to derive our estimate by using a fairly large value of n, because there
will be high counts of the relevant events in the corpus. In the latter case, we would prefer
n to be small, since data will be sparser. To allow both situations, it makes sense to be able
to operate at several different levels of backoff, as mentioned in the previous section, by
counting events with several different values of n. A first step could be to adjust the model
complexity based on the counts, using something like Equation 2.11.
P(wi) =
{P(wi|wi−1
i−n+1) for high counts
P(wi|wi−1i−n+2) otherwise
(2.11)
This equation has the advantage that it is extremely simple, but it has several problems.
Firstly the use of two subequations like this makes it extremely difficult to ensure the prob-
ability distribution sums to one, and secondly it seems wasteful to discount the probability
estimate produced using the small to medium number of counts and only use the backed-off
model.
Interpolation using n-grams
To address both these problems, Equation 2.11 is never used in practice. Instead we inter-
polate between these models as shown in Equation 2.12.
P(wi) = λP(wi|wi−1i−n+1) + (1− λ)P(wi|wi−1
i−n+2) (2.12)
In this equation, λ is a value between zero and one which determines how much weight-
ing to give to the more complex model; when the counts are high, λ will be near one. The
equation for actually computing λ is not presented here since there is no single standard
equation. One key idea is that this approximation can be applied recursively, so the n− 1
model can be simplified to a n− 2 model, and so on. While in theory we could reduce the
complexity of the model very slowly using this method, in practice the number of probabil-
ity estimates is tightly constrained. Parsing time and memory usage are directly dependent
on the number of backoff levels because each level of backoff requires storing a set of events
covering the whole training corpus. So going from one level of backoff to two will almost
double parsing time and memory requirements.
35
Deciding which terms to treat as independent
In the previous equations, the n-grams were used to compute the probability of a word. In
this case, it is obvious that close words are a better predictor of the likelihood of the current
word than distant words, and so the simplified models simply discard the distant words. Let
us now return to the scenario where we are estimating the probability of whole grammatical
constructions. In this scenario, the order in which terms should be discarded is much less
obvious. For instance, should we discard the word first, or the POS tag? This problem will
be discussed in Chapter 7.
Another closely related point is that in parsing we frequently work with conditional
probability statements with more than one term on the left. Recall for instance that in Equa-
tion 2.6 we are computing the probability of seeing a constituent with a given nonterminal
category and a given headword in a particular syntactic context. (Thus in our PP attachment
example we are computing the probability of seeing a PP headed by the word telescope in
various contexts.) By specifying that we are computing both terms at once, we are assuming
that these terms are dependent on each other. That is, we cannot reduce the probability to a
simple product:
P(RNT, Rw|HNT, Hw, PNT) 6= P(RNT|HNT, Hw, PNT)× P(Rw|HNT, Hw, PNT) (2.13)
This may seem obvious — the word and the nonterminal are clearly related — but en-
forcing their dependence causes problems. For instance, if the word was never seen with
this nonterminal in the training corpus then the probability of the pair in any attachment
would be undefined. What we would like to do is break the dependency assumption when
the counts are too low, much as we discarded extra information from the more complex
models. The approach taken is by noting the basic Equation 2.14 from statistics.
P(a, b) = P(a|b)P(b) (2.14)
This equation says we can split any dependency event into a conditional probability.
An important point for later equations is that we can have extra terms after the b without
affecting the equation at all. Using Equation 2.14, we can replace the incorrect Equation 2.13
with a corrected version as Equation 2.15
P(RNT, Rw|HNT, Hw, PNT) = P(RNT|Rw, HNT, Hw, PNT)× P(Rw|HNT, Hw, PNT) (2.15)
Initially, this has not gained us anything, Equation 2.15 still involves the calculation of
counts(Rw, HNT, Hw, PNT) — it was the lack of reliable counts for this term that lead us down
this path in the first place. However, the simplification techniques just discussed, can now
36
be applied to each of these terms separately. Because of this, we can assume independence
as necessary to increase counts.
2.5.2 Smoothing
In any operations research it is quite normal to distinguish between a model and what is
expected from the population. For example, if a die is rolled twice and both times score a six
then a model of the die will say it always rolls six, yet it is entirely possible the die is fair and
the double six is a coincidence. At the same time we cannot conclude the die is fair because it
may be weighted, though by throwing the die more often we can gain increased confidence
in the model. Similarly in probabilistic grammars, just because an event has never occurred
before does not make it impossible. The probability estimate must balance the distribution
of language it has seen with an acceptance that it has only seen a small and probably biased
subset of possible sentences.
Maximum Likelihood estimation
The principle of maximum likelihood estimation (MLE) can be summarised by the state-
ment: Find the parameters that make the observed data most likely. If we treat the corpus
as our source of probabilities then we can derive a probability model from it. However, if
we instead treat the corpus as a random sample of the true language then we can instead
maximise the probability both of seeing our corpus and of seeing the input sentence.
In practice, it is impossible to compute MLE for complex domains and so most practical
techniques involve approximations. This usually comes down to the same steps as generat-
ing a probability with an additional tweak to provide some counts for events that were not
seen during training. There are a few approaches to approximating MLE; in this thesis I will
introduce two.
Add-one smoothing
A very simple approximation to use to account for the finite size of the corpus is simply to
add one to every count. For every event that occurs zero times this means we treat it as
if it occurred once. This approach works surprisingly well, and is included in many other
approximations of MLE.
Good-Turing estimate
Good-Turing is a technique to address the problem of low frequency events by estimating
the frequency of events that were never seen during training (Gale and Sampson, 1995).
To alleviate the problem of events which did not occur during training, the Good-Turing
37
estimate says we should replace the counts for how often an event actually occurred by an
estimate of how often it occurred. More precisely, it says:
r∗ = (r + 1)× E(nr+1)E(nr)
(2.16)
In this equation, r is the number of times an event occurred during training, and r∗ is our
reestimation of r. nk is the number of events that occurred k times in the training corpus and
the function E smooths nr since actual values of nr are subject to a lot of noise, especially for
large r. Since this equation adjusts the number of times an event occurs, this will affect the
probability of that event. Of course, this is our intention.
So what does this equation mean? The r + 1 term adds one to the frequency of every
event, in an analogous manner to add-one smoothing. The total number of events that
could be expected to have been seen is E(n0) which we approximate by E(n1), and then
further approximate by n1. Since we have given each of these n1 unseen events a frequency
of 1, we must subtract n1 events to keep the probability model summing to one, and since
we do not know where to subtract the events from, we distribute the subtractions across all
events. Since nr+1/nr will tend to one for high r, this term emphasises subtractions from
infrequent events.
2.5.3 Combined interpolation and smoothing techniques
We are now in a position to cover two backoff techniques. The techniques that will be ex-
amined are perhaps the most popular in the literature, Jelinek-Mercer smoothing and Katz
smoothing.
Jelinek-Mercer Smoothing
Jelinek and Mercer have developed a model that uses a Maximum Likelihood estimate in-
stead of the Good-Turing estimate just discussed. Jelinek’s equation is presented in Equation
2.17. As with the equations in Section 2.5.2, this equation is presented as an estimation of
the probability of a word in terms of n-grams.
pinterp(wi|wi−1i−n+1) ∆= λwi−1
i−n+1PML(wi|wi−1
i−n+1)
+ (1− λwi−1i−n+1
)Pinterp(wi|wi−1i−n+2) (2.17)
The key innovation in this equation is the use of recursion to combine backoff and
smoothing. The probability of a target word given a context of n− 1 preceding words is
first computed using some approximation to MLE (in the above equation, PML(wi|wi−1i−n+1)).
This probability is then interpolated with a recursive call to the same equation with n− 1.
38
Essentially, this equation is identical to Equation 2.12 except that instead of simply looking
up the counts for each level, the maximum likelihood estimate is used instead, and that Je-
linek gives a specific function for computing the interpolating term λ. The actual algorithm
used to compute λ is quite complex and precise details will not be presented here (see Jelinek
and Mercer (1980)). Like Good-Turing they look for events that occurred the same number
of times as the event being estimated, and they bucket neighbouring counts together to cope
with insufficient training data.
Katz smoothing
While Jelinek-Mercer builds on MLE, Katz smoothing is a direct extension of the Good-
Turing estimate to interpolate between different n-grams. It was developed and described
by Katz (1987) although the equations in this thesis are based in a slightly modified form
by Wu and Zheng (2000). Equation 2.18 shows how a probability is computed using Katz
smoothing, for an n-gram model.
Pkatz(wi|wi−1i−(n−1))
∆= λ(wi, wi−1i−(n−1))PGT(wi−1
i−(n−1)|wi) +
(1− λ(wi, wi−1i−(n−1)))Pkatz(wi−1
i−(n−2)) (2.18)
The similarly between this equation and Equation 2.17 is obvious. Apart from the use of
a different lambda function, the only difference is the use of PGT, the Good-Turing estimate,
instead of PML. Katz’s lambda term is significantly simpler than Jelinek’s; it is assumed to be
one (no interpolation) if there are any counts. A much more complete introduction to Katz
smoothing can be found in a number of references, such as Chen and Rosenfeld (2000).
Comparison of backoff techniques
Having covered two of the standard methods for combining interpolation and smoothing,
it seems reasonable to compare them and choose the best. Curiously, this is a step most
people in the literature skip, simply stating for instance that they are using Jelinek-Mercer
smoothing, Katz smoothing, Witten-Bell smoothing, or Wu’s enhanced version of Katz, etc.
The best comparison of the different methods is provided by Chen and Goodman (1996).
They found that both Jelinek-Mercer and Katz smoothing perform well, with Katz smooth-
ing performing slightly better on more complex models with more training data. They also
demonstrate several other smoothing algorithms which perform better; these will not be
covered here since we are discussing well-established algorithms.
39
2.6 Probabilistic parsing algorithms
The parsing algorithm most commonly used in statistical parsers is heavily based on the
standard chart parsing algorithm already discussed in Section 2.2. There are a number of
extensions needed to this algorithm in order to support statistical parsing. The most ob-
vious one is that, given every edge in the chart is associated with a probability, how can
a probabilistic grammar be incorporated into a chart parser? For one thing, a chart parser
is (almost always) a bottom-up parsing algorithm, while the statistical grammars we have
been working with use a top-down probability model. In fact there is no contradiction here,
but it is worth spelling out how the parser and the probability model work together. Recall
that in chart parsing we begin by creating an edge for each word in the input string, then we
derive all combinations of edges recursively, computing all possible parses of all substrings
of the input string, with the set of parses of the full input string being computed last. In
a statistical grammar with a top-down probability model, we compute the probabilities of
child nodes given parent nodes — for example the probability that a certain parent node
expands as a subtree with a certain node as head child and certain other nodes as left and
right sisters of this head. If a top-down probability model is used in a chart parser, this
means that the edges in the chart are all generated bottom-up, in the usual chart-parser way,
but that the probability of each of these edges is computed by beginning at its root node.
Some extensions, which are needed to adapt a regular chart parsing algorithm to a prob-
abilistic grammar, are specific to the particular grammar formalism we are using. (For in-
stance, if we adopt an HPSG-based formalism, we need to modify the operations for cre-
ating and combining edges; see Section 3.2 for a discussion of these extensions.) However,
the main extensions which are needed result from the fact that statistical grammars derived
from large corpora are typically far too big to permit the complete set of possible parses to
be derived for each constituent. Chart parsing has a complexity of O(n3m2) where n is the
sentence length and m is the number of rules in the grammar; if there are lots of grammar
rules then parsing is impossible for ordinary-sized input sentences. The solution is to move
away from computing every possible parse of every possible substring of the input. Clearly
what we want to do in practice is to throw away some of the edges with low probability.
This operation can be interpreted using the metaphor of heuristic search in classical AI, or
using the related metaphor of efficient graph search from computer science, or using the
slightly different metaphor of Markov modelling from probability theory. In practice, sta-
tistical parsers typically use algorithms ‘inspired by’ these metaphors, rather than precise
implementations, but it is worth presenting the underlying theory clearly before discussing
approximations in real implementations. Before we discuss these approaches, however, we
first need to say a little more about the notion of ‘the probability of an edge’, because this
term can actually be understood in two separate ways.
40
2.6.1 Inside and outside probabilities
When we say that the probability of a constituent is p, what does this mean exactly? One
thing it could mean is: the probability that the substring spanned by the constituent has this
parse is p. This is known as the inside probability of a constituent. There are a couple of
things to note here. Firstly, this probability is completely independent of probabilities of
structures elsewhere in the sentence; it is a statement about the substring in question, and
nothing else. Secondly, p is the probability of this interpretation of the substring as opposed
to another interpretation. The probabilities of all possible interpretations of the substring
will sum to 1.
There is a problem in using inside probabilities by themselves as a guide for deciding
which edges to focus on in order to reduce the complexity of parsing. Consider a substring
which splits a sentence in an unnatural way; e.g. the boy saw. This substring is unlikely to
feature as a constituent in the final parse. But it might nonetheless have some possible parses;
for instance, it might have a single parse as an NP (analogous to the band saw). In fact, the
inside probability for this parse will actually be very high; since we are just looking for
parses of this substring, and there is only one such parse, its inside probability will actually
be 1. Conversely, if a substring cuts an input string into a more likely constituent then it will
probably have several possible parses; for instance saw the girl can be analysed in several
ways. The inside probabilities of each of these parses must still sum to 1, and each will thus
be correspondingly lower. This is unfortunate; it has the effect that the probability of the best
parse of a good candidate substring will probably be lower than that of a bad one. Naturally,
this problem will resolve itself when we try to combine these edges with other edges. the
boy saw parsed as an NP will combine very badly with other edges in the chart, and result
in edges with very low probabilities, whereas the parses of saw the girl will combine more
successfully. However, during parsing, the spuriously high probability of constituents like
the boy saw, which are locally likely but globally implausible, could easily lead us to wasting
time. This is especially true if we are implementing a heuristic which leads us to concentrate
on the locally most likely edges.
Fortunately, there is another way of looking at the probability of a given parse of a sub-
string: the probability of this parse appearing as part of the final parse of the sentence. This
is known as the outside probability of the parse. This alternative interpretation nicely re-
solves the problem of saw the girl being globally unlikely, but it has the inverse problem. Say
S→ NP, VP is the most common rule application found in the training corpus. If we favour
edges with high outside probabilities, we will tend to interpret every edge as S→ NP, VP,
regardless of what material it actually contains. As with inside probabilities, this problem
is resolved at the end of parsing, but will easily lead to us wasting time trying to generate
constituents which are locally unlikely but fit well on a global scale.
41
In practice, when we are parsing, we want to use both the inside and the outside prob-
abilities of the edges we generate to help us focus on the most likely edges. If we have
a bottom-up parser, we can compute accurate estimates of the inside probabilities of the
edges we generate, but we have to estimate outside probabilities more crudely, because we
do not yet know the global structure of the input string. If we have a top-down parser, we
begin by computing hypotheses about the global structure of the input string, and so we can
compute accurate estimates of the outside probabilities of edges, but we have to use cruder
estimates of inside probabilities, because we have not yet generated their internal structure.
The cruder estimates in each case can simply be derived by looking at the relative frequency
of suitable events in the whole training corpus. For instance, if we have a bottom-up chart
parser, we can estimate the outside probability of an edge whose parent node is P simply by
counting the number of Ps in the whole corpus and dividing by the total number of nodes
in the corpus.
The difference between inside and outside probabilities is illustrated for saw the girl in
Figure 2.17. In this figure, the smaller triangle spans the constituent saw the girl, and it has
saw the girl
S(........)
VP(........)
Figure 2.17: Partial parse showing the different areas examined by the
inside and the outside probabilities
the analysis VP(. . .). The inside probability of this constituent is the probability of VP(. . .)
being the appropriate interpretation of these three words (the area with dark shading). The
outside probability of the constituent is the probability of VP(. . .) appearing in this position
in a parse of the rest of the sentence (the area with light shading).
Using this terminology it is much easier to follow what happens to the probabilities
of edges during chart parsing. Since we start with single words, each edge will have an
inside probability of one, and an outside probability estimated simply by counting the rel-
ative frequency of the word in the whole training corpus. As the words are expanded into
constituents, their inside probabilities will always decrease while their outside probabili-
ties will typically increase. Inside probabilities are strictly decreasing — the probability of a
combined edge will be the product of the two edges which form it and the probability of this
combination, and since all probabilities are bounded by one, this must be less than or equal
42
to the lowest of the three terms. However, the outside probability is not accumulated with
each edge; it must be recomputed after each combination and may be larger or smaller than
its children’s outside probability. Since combined edges span more of the sentence, leaving
less unknown, they tend to have a higher outside probability. Of course, the logic in this
paragraph could easily be reversed if the parser started with the root node of the tree. In
that case the outside probabilities would be accumulated while inside probabilities would
have to be recomputed at each step.
2.6.2 Parsing as state space navigation
This thesis largely ignores the theoretical foundations of computer language processing,
they have simply been taken for granted. However, there are some optimisations to the
parsing process that are best explained by examining the foundations of parsing and so
these will be briefly discussed here. We have already discussed the idea that a grammar can
be viewed as a set of phrase-structure rules. These rules can also be viewed as transforming
from one state into another. Formally, we can define a grammar as a four-tuple consisting
of a start state S, an input string i, a set of transformation rules t, at least one goal state x.
Given this representation, parsing is the process of starting at the start state and applying
transformations out of t until a state x is reached.
Parsing with hypergraphs
Having formally defined parsing, representing it as a walk in a graph is relatively simple,
the application of one of the transformation rules outlined above (the grammar rules) corre-
sponds to following an edge. Further, probabilistic parsing can be represented by putting a
cost on the arcs in the graph. Under this formalism, if a node can be reached that means it
can be parsed, with the shortest path corresponds to the best parse.
The advantage of looking at parsing this way is that there are a large number of algo-
rithms for graph processing that can then be applied to parsing. For instance, applying
Dijkstra’s algorithm leads to a probabilistic chart parser. This approach has been examined
extensively by Goodman (1998) and also by Klein and Manning (2001b, and other papers).
Parsing and A∗ search
Just as with graphs, it is relatively simple to define the problem of finding the most probable
parse (or, a good parse) of an input string as heuristic search, in the classical AI sense of the
word. Search involves systematically generating a set of possible states of a domain, starting
from an initial state and attempting to produce a goal state. Each state can be expanded, to
produce new states. In a simple top-down parser, each state is a partially-built parse tree;
43
each state can be expanded by choosing a node in this tree and applying all possible rules
in the grammar to grow this node.5 In a chart parser, states of the search space are not
stored explicitly, but the set of states on the fringe of the search space (i.e. the set of nodes
in the search space which have yet to be expanded) can be viewed as the set of possible
combinations of edges which span the complete input string. Expanding a state on the fringe
can still be quite clearly modelled as creating all possible edges that span a new substring,
using the existing edges in the chart.
In AI search, different routes to a goal state can have different costs. There is a straight-
forward analogy of cost in probabilistic parsing; the cost of a complete parse is simply an
inverse function of its probability, so that the highest probability parse has the lowest cost
and vice versa. A heuristic search is used in cases where the search tree is too big to be gen-
erated in its entirety. This is clearly the case in our probabilistic parsing scenario. The idea
is to estimate the cost of intermediate states in the search space, and first expand those with
the lowest cost. This is known as best first search. One useful algorithm from classical AI is
the A∗ search strategy which is a form of best-first search in which the heuristic evaluation
function is broken into two parts — one part is the cost of getting to the current state, and
the second part is an estimate of the cost of reaching the goal state from the current state. If
two conditions are met, it can be shown that an A∗ search is guaranteed to terminate, and to
find the shortest-cost path to the goal first.
The first condition is that all path costs are non-zero. This corresponds to the requirement
that all grammar productions have an inside probability strictly less than 1. Given a suitable
smoothing model, this will indeed be the case.
The second condition is that the heuristic evaluation function is optimistic about the cost
of reaching the goal from the current state. This means that the actual cost of reaching the
goal from the current state can never be less than our estimate of the cost. It is possible to
use the inside and outside probabilities so that they meet the requirements of an A∗ heuris-
tic. The inside probability accurately represents the cost to the current state while, at least
theoretically, the outside probability, accurately represents the cost to get from the current
state to the goal. In practice the method used to evaluate the outside probability is just an
estimate, but it is possible to write this estimate so that it is always optimistic. For instance,
Klein and Manning (2002) developed such a parser.
5When we talk about ‘parent nodes’ and ‘child nodes’ in a parsing algorithm, there is a potential ambiguity,
since we could be talking about relationships in the search space, or in the syntactic structures being built. I will
restrict the terms ‘parent’ and ‘child’ to refer only to relationships in syntactic structures.
44
2.6.3 Viterbi optimisation
It is also possible to construe probabilistic parsing as a walk through a Markov model. A
Markov model specifies a set of states, and a transition probability between each pair of
states. In the parsing scenario, states are partial parses (just as they were when parsing was
interpreted as AI search). The transition probability between two states is the probability
of applying a grammar rule which makes the transition between the two states. Thinking
of probabilistic parsing in this way allows us to use standard techniques from probability
theory for reducing the complexity of probabilistic reasoning tasks.
The AI search algorithms described in Section 2.6.2 are still intended to find every pos-
sible parse of the input sentence. All A∗ search does is provide a useful ordering on nodes
to be expanded so that the most probable complete parse is the one found first. However,
we are frequently not interested in all parses of a sentence, but only in the single most likely
parse. The Viterbi algorithm is a general statistical technique for finding the most likely
state in a Markov model (Viterbi, 1967). When applied to parsing, the algorithm states we
can discard any interpretation which we know will not form part of the final parse (Manning
and Schutze, 1999). Formally, we have previously been attempting to compute the probabil-
ity of different parse trees given some input sentence, i.e. P(T|S). The Viterbi optimisation
states that we wish to find the most likely parse tree.
The most obvious optimisation from the Viterbi algorithm is to remove ‘duplicate’ con-
stituents. Recall that when the parser considers how a given constituent might combine with
other constituents in the chart, it does so on the basis of the parent node of this constituent.
The parent node carries some information about the internal structure of the constituent (for
example, its headword and head constituent), but by no means everything. This means that
it is possible that the chart parser produces two edges which have different internal struc-
tures, but identical parent nodes. Since they are different, these edges will almost certainly
have different inside probabilities, but since they happen to have the same parameters, the
probability model will always give the same probability for any further grammatical pro-
ductions in which they appear. The probability of the full trees created from these new
productions is the inside probability of their constituents times the probability of the combi-
nation, so the constituent which had the highest inside probability to begin with will always
result in trees with higher probability than the constituent with the lower inside probability.
If we assume the goal of parsing is simply to find the single best parse, we can discard the
less likely subtree and know for certain that we are not discarding the best parse. This op-
timisation occurs very frequently in practice, because often local ambiguity relates to how
constituents are attached, but the final generated constituent must always be the same type.
For instance, consider Figure 2.18.
This figure shows how the phrase saw the girl with the telescope has two possible inter-
45
VP (prob = 0.001)
V
saw
NP
DT
the
NN
girl
PP
PREP
with
NP
DT
a
NN
telescope
VP (prob = 0.003
V
saw
NP
DT
the
NN
girl
PP
PREP
with
NP
DT
a
NN
telescope
Figure 2.18: Two alternate interpretations of saw the girl with the telescope,
showing the effect of the Viterbi optimisation
pretations. Both of these interpretations have the same general structure; they are both VPs
headed by saw that do not need any more arguments. Since they are equivalent in the hy-
pothetical grammar that has been used throughout this section, the lower probability inter-
pretation will be discarded. If a different grammatical formalism was used which included
more internal information, such as Black et al.’s with its h2 or Bod’s which includes the entire
subtree, then the less likely interpretation could not be discarded.
Another optimisation caused by the Viterbi algorithm is to note that if no constituent
ends at a particular location, there is no point looking for constituents starting at this location
since they cannot possibly be used to span the whole input.
2.7 Three statistical parsers
Having discussed statistical parsing in abstract terms, I will now describe three real sta-
tistical parsers in order to show how the techniques interact. The parsers which will be
described are Klein and Manning (2003) because it is a high-performance non-lexicalised
statistical parser, Bod and Scha (1996) because it shows how a large amount of information
can be used, and Collins (1999) because it strikes a nice balance between theory and prac-
tice. Magerman’s parser SPATTER will not be discussed because it is a precursor of Collins’
system, and has been superseded by it. We focus here on complete systems, rather than on
grammatical formalisms, some of which have already been discussed.
2.7.1 Klein and Manning’s statistical parser
Recall that Klein and Manning’s parser is unlexicalised, and obtains its high performance
instead through the use of features such as whether the attachment is internal or external,
if it dominates a verb, contains a gap, and so on. Another interesting decision Klein and
Manning made was to use no smoothing or interpolation. This is only possible because of
the unlexicalised grammar, though even so it is a little surprising. Because of this, Klein and
46
Manning had to be careful about the cost of each extension to the number of counts, just
as lexicalised grammars do and so Klein and Manning found the best cell in Table 2.6 was
v ≤ 2, h ≤ 2 rather than the locally better performing v = 3, h ≤ 2. This same trade-off is ex-
tremely common throughout statistical grammars — extra information is usually useful, but
it results in decreased counts, which prevents other information being considered. A backoff
strategy that could intelligently choose which information to use would be extremely useful
both to lexicalised statistical parsers, and to Klein and Manning’s parser.
Final results from Klein and Manning’s paper are very good, precision and recall are
86.3%, 85.1% respectively. So, despite being unlexicalised, Klein and Manning obtained
higher accuracy than Collins’ 1996 lexicalised statistical parser.
2.7.2 Bod’s statistical parser
To recall from Section 2.4.3, Bod and Scha’s grammatical representation stores a sample of
every permutation of every training tree from the corpus. Bod’s parser is called TREE-DOP.6
Like most statistical parsers, the parsing process is derived from chart parsing and involves
viewing the trees as rewrite rules.
Having built the grammar, TREE-DOP is able to parse by choosing trees from the rule
book that are likely to match the sentence. At each stage of the parsing process it will have
a ‘partial parse’ representing the sentence which has been matched so far. This, and the
current input word, are matched against trees in the rule book.
If only one tree matches the current partial parse then this is selected as the final parse.
Typically more than one tree matches the input word and the tree which is selected is the
one which has occurred most frequently with this input word. In Figure 2.19 the word Mary
is being added to a rule in the rule book to derive the partial parse shown. One nice feature
of this approach is that whole phrases such as idioms are easily recognised because their
tree matches the input perfectly.
S ◦ NP = S
NP VP Mary NP VP
V NP Mary V NP
likes likes
Figure 2.19: Using DOP to parse Mary likes
Parsing the input produces all parse trees, which Bod and Scha (1996) refer to as the
derivation forest. The actual process is shown in Figure 2.20. Most applications are only
6‘DOP’ stands for ‘data-oriented parsing’.
47
repeat until one derivation is clearly best
for k := 1 to n do
for i := 0 to n - k do
for chart-entry(i,i+k) do
for each root-node X do
select at random a subderivation of root-node X
eliminate the other subderivations
add derivation to possibilities
Figure 2.20: Idealised pseudocode for Bod’s statistical parser
interested in the most likely parse tree, so Bod would like to use some sort of Viterbi op-
timisation. However, his state-space is too large to search for the optimal solution as we
discussed above. Instead TREE-DOP uses Monte-Carlo sampling (Hastings, 1970) to esti-
mate the most likely parse in O(n3) time. Monte-Carlo sampling involves deriving multiple
parses from the derivation forest and selecting the most commonly occurring parse. The
weak law of large numbers guarantees that the most likely parse will be the one most likely
to be selected by Monte-Carlo sampling. Unfortunately, sampling requires extremely large
numbers to be reliable, which makes parsing using TREE-DOP impossibly slow.
One solution to the slow parsing time was proposed by Goodman (1996) in which Bod’s
trees are transformed into equivalent PCFG rules. Goodman’s reimplementation is five hun-
dred times faster than Bod’s version (Goodman, 1996). However, Bod takes issue with call-
ing Goodman’s approach equivalent, stating that it does not find the globally most likely
parse (Bod, 1996); a revised version is presented in Scha and Bod (2003).
Because of the slow parsing time, Bod was unable to provide precision and recall figures
derived from exactly the same testing corpus as used by other researchers. On the sub-
corpus he used, he obtained 80% precision and recall.
2.7.3 Collins’ statistical parser
Michael Collins’ parser (Collins, 1996) has been an extremely influential contribution to the
field of statistical parsing. Since 1996 it has been the best performing statistical parsing
system, with an initial precision, recall of 86.3% and 85.8% respectively. Later improvements
in (1997) and (1999) kept it ahead with a precision and recall of 88.7,88.6%. Collins’ system
is essentially an extension of Magerman’s approach, and it performs better due to a number
of tweaks such as BaseNPs.
I will base my own implementation on Collins’ system, since it remains the state-of-the-
art statistical parser. A detailed overview of Collins’ system will be given in Chapter 3.
48
2.8 Summary and future direction
We have seen that statistical parsing is the way to go in natural language processing. It
provides much more flexibility, without the brittleness associated with deterministic pars-
ing. However, there are some questions about where statistical parsing should be headed.
The performance of current statistical parsers is asymptotic, with the difference between a
generic implementation and the best implementation only a four percent reduction in errors.
Further, Klein and Manning have shown that the approach being taken for lexical informa-
tion is not working — including lexical information leads to far more work and complexity,
yet is providing us with only a tiny reduction in errors. Despite this apparent problem,
there has been no serious work yet to shift statistical parsers out of the field of carefully
edited financial text and into useful domains.
The performance of current statistical parsing systems
A simple PCFG parser obtains about seventy-five percent accuracy, while a top statistical
parser obtains a little over eighty-five percent. However, since Collins released his first
model in 1996, the state of the art has only improved about one and a half percent. To put
this in real terms, the best statistical parsers make about three parsing errors on a typical
input sentence, and parse perhaps twenty percent of sentences with no errors. Further-
more, these disappointing figures are when the parser is both trained and tested on care-
fully edited financial text. It is extremely unlikely that these performance figures would
be maintained if we were to test on less carefully edited text such as the Usenet group
biz.marketplace.investing or text outside the WSJ domain, such as an arts maga-
zine.
There are a number of approaches that can be taken to improving parser performance.
Perhaps the next logical approach would have to concentrate on increasing their ability to
generalise to new domains of text. Later, we could look into methods requiring less super-
vision so that they could be trained on raw text. This is all uncharted territory; there are
no corpora available in other fields for cross-validation, there are not even any metrics for
measuring parser generalisation. The golden rule has been to always train on the same kind
of corpus as testing. Regardless of the approach taken, it is clear that something has to be
done. So this thesis will start with an existing state-of-the-art parser, and then look at ways
in which it can be improved with different domains in mind.
Software engineering is important
A number of statistical parsers have been written, and the source code for at least three is
available as free software. However, they are all documented at a theoretical level rather
49
than an implementation level. Since much of the work in designing and building a lexi-
calised statistical parser relates to implementation details, this presents a significant hurdle
to anybody attempting to write a new statistical parser.
To address this, the next two chapters describe in considerable detail exactly how Collins’
lexicalised statistical parser was implemented. It is hoped that this will give readers not
only enough information to re-implement the parser, but also describe the techniques and
software engineering issues that are important in implementing a statistical parser.
What is the future of lexicalised parsing?
Klein and Manning note that most improvements are not lexical in origin; in experiments
disabling lexical information for Collins’ parser, only a tiny reduction of model accuracy
was seen. However, the intuitive appeal of lexical information remains; the difficulty is in
combating Zipf, the WSJ corpus is simply too small to have useful lexical counts.
The main goal of my thesis is to combat Zipf by looking at ways of clustering words
into categories. It is expected that when we start using word categories instead of words,
we will have sufficient counts for useful statistics while retaining useful lexical information.
This is the topic of Chapter 6. Another important aspect is looking at how these clusters
should be used in backoff. The linear nature of deciding what to back off next is wasteful of
useful information, and a smarter approach with possible applications in almost all areas of
probabilistic modelling will be discussed in Chapter 7.
50
Chapter 3
A description of Collins’ parser
In order to address the issues identified in the previous chapter, I need a parser to experi-
ment with. At the start of my project, there were no publicly available lexicalised statistical
parsers, so I had to write my own. I decided to reimplement an existing system, rather
than design my own parser from scratch. An existing parser provides a good guide: we can
expect that a parser designed along the same lines should be able to achieve similar levels
of performance. As for the choice of a parser to implement, I decided to reimplement the
parser developed by Michael Collins, simply because this parser is the one which performs
best. A specific version of Collins’ model had to be implemented, and Model 2 from Collins
(1997) was chosen as striking the best balance between complexity and accuracy. 1
This chapter provides a detailed overview of Collins’ grammar formalism and parser.
The chapter serves two purposes. Firstly, it provides the background necessary to under-
stand my reimplementation of Collins in Chapter 4. Secondly, and equally importantly, it
provides a description of Collins’ system which in places is more detailed than that given
by Collins himself. Collins’ own publications about his system tend to concentrate on moti-
vations for the design of his system, and do not always provide a good introduction to the
system for programmers intending to work with the code, or make alterations. My descrip-
tion of Collins will focus more on these latter issues.
3.1 Collins’ grammar formalism and probability model
Collins’ grammar is a HPSG-inspired approach very similar to the examples given in Section
2.4.1. This means that it is a top-down model (although the parser is bottom-up) in which
the head of each category is generated first and then constituents to the left and right are
1The addition of gapping (Model 3 from the same paper) led to increased complexity without increasing
accuracy significantly. Later, Collins (1999) revised his model but it was decided that the benefits this provided,
particularly in punctuation, were not sufficient to justify any modifications.
51
added in any order. It differs principally by modifying the training data to make it more
suitable for training a statistical parser, and also by supporting subcategorisation frames.
3.1.1 New nonterminal categories: NPB and TOP
Perhaps the most obvious addition in Collins’ grammar is the creation of two abstract non-
terminal categories, one for non-recursive noun phrases and one for a distinguished top-
level sentence phrase.
A base NP (NPB) is an NP that does not include any NPs as children. The advantage
of distinguishing these from recursive NPs is that they tend to have very different usage
patterns. Consider the NP shown below, and the NPB it contains:
NP
NPB
Pierre Vinken
, ADJP
61 years old
An NP which has an NP embedded inside it is likely to take arguments such as ADJP,
while the NPB is much more likely to decompose directly into proper or common nouns.
Similarly, there is room for improvement with the sentence category (S). Along with
SBAR, it is frequently used in the WSJ for recursive sentence substructures. This leads to
several problems with using S to terminate the tree. Firstly, it complicates prior probabilities
because we cannot assume the prior probability of S heading a tree is one. Secondly, S may
be generated with subcategorisation frames (see Section 3.1.2) which are then impossible to
discharge for the root node. While neither problem is critical, Collins created a new root
node which he calls TOP to avoid these problems.
3.1.2 A distance metric: adjacency and verbs
Recall that with the HPSG grammar model, each modifier to a head is modelled as a separate
event. However, languages, and especially English, tend to have strong preferences for
modification by nearby words; a model which fails to take this into account will perform
poorly. A distance metric is an attempt to encode the preference for nearby attachments
without significantly reducing counts.
In Collins’ theory, the distance metric is implemented as a number of simple heuristics.
The first two heuristics are tests to see if the words are adjacent, and if there any verbs in
between. These heuristics are easy to motivate by noting that three quarters of attachments
are to neighbouring words and only one in twenty attachments has an intermediate verb
(Klein and Manning, 2003). An advantage of these heuristics over more sophisticated ones is
that they can be each represented using a single bit, so will only halve the expected number
52
of counts for each event. There are two other heuristics used by Collins: two further bits
to represent the presence of coordination or punctuation, and subcategorisation frames to
represent a nonterminal’s complements. These will be discussed next.
Coordination and Punctuation
Coordination and punctuation are treated as a special case by Collins’ model. In particular,
the model as implemented strips punctuation out of the sentence before parsing and adds it
back in when printing the final parse. That is not to say it has no effect on the parser; instead
a Boolean flag is set whenever an operation would cross over certain types of punctuation
so that the only counts used are events from the corpus in which this flag is also set. The
handling of coordination is similar, except that occasionally and is not tagged as coordina-
tion, and so it is not stripped out of the sentence but instead carefully skipped whenever it
is tagged as coordination.
Subcategorisation frames, complements and adjuncts
As already discussed in Section 2.1.2, head words carry with them a subcategorisation list
specifying what complements they require. For instance, John chased sounds a little odd as
a complete sentence, whereas John chased the dog is natural. Collins’ approach to subcat lists
is to include a (typically empty) bag of nonterminals that each constituent requires to be
complete. A bag is used instead of a set because a few words, such as thought, require two
arguments of the same type. In order to reduce its impact on counts, only a few classes of
nonterminal are counted.
Sometimes the phrases that modify a word are optional. For instance, in John chased the
dog in the park, the PP in the park need not have been provided. Such phrases are called
adjuncts.2 There is no distinction between complements and adjuncts in the Penn treebank.
However, Collins uses a simple algorithm to preprocess the treebank to distinguish between
complements and adjuncts. The preprocessor turns every NP node into either an NP-A
(denoting an adjunct NP), or an NP-C (denoting a complement NP). This algorithm will be
discussed in Section 3.1.3.
3.1.3 Preprocessing the Penn treebank
The Penn treebank was discussed in Section 2.3.1. As well as the syntactic markup already
discussed, the treebank markup attaches a significant amount of semantic information to
nonterminals, such as -LOC for locative; nonetheless it does not include all the fields which
are required by Collins’ probability model. For instance, head words are not explicitly
2The term argument will be used to denote both complements and adjuncts.
53
identified as such. To address this issue, the very first step in parsing is to transform the
distributed WSJ into Collins’ format — deleting extra semantic information while adding
headwords, arguments, NPB and TOP. After adding headwords, etc. the final preprocess-
ing step is to convert the treebank into a HPSG-style event file. Figure 3.1 shows how a
simple tree can be converted into an event.
S
NP VP PP
⇒
Parent S
Head VP
Left[
NP]
Right[
PP]
Figure 3.1: Conversion from a WSJ style tree to head driven
While this figure shows a conversion of PCFG into head driven, it makes a number of
assumptions that must be resolved in writing an algorithm to perform the conversion. For
example, how did we decide the VP was the head child? And how do we decide which
complements are arguments?
As with the code for the parser, it is worth describing the preprocessing in more detail
than a high level algorithm. Even after releasing his code, Collins did not release his pre-
processor to perform these tasks.3 The pseudocode provided by Collins left a number of
decisions undocumented and impossible to reproduce. This raises the question of under
which circumstances it applies to prepositions. Because of this question, and similar ques-
tions, my reimplementation of Collins’ preprocessor will be discussed in Section 4.2.
3.1.4 Collins’ event representation
So far we have described events in slightly abstract terms such as a tree being attached to
the left of the head phrase. At some point it is necessary to discuss exactly how trees are
represented so that events in the treebank can be counted. It is also necessary to describe
how events are simplified when counts are too low, and how the counts are combined to
give a probability. After this we can derive the probability of any given tree.
All events in Collins’ model are productions associated with a reference parent. There
are three types of production: a unary production is the generation of the head constituent
of the parent, a dependency production is the generation of a left or right sibling of the head
of the parent, and a subcategorisation production is the generation of a subcategorisation
frame for the head of the parent. To explain Collins’ representation of trees, we once again
3Collins recently placed a note on his website stating that Bikel’s parser can perform the necessary prepro-
cessing Bikel (2005).
54
refer to a simple example tree — see Figure 3.2. In this figure, only the nodes with arrows
take part in evaluating the probability.
NP−C
DT NN VB NP−C
a mousechasedcatThe
VP
SLeft nonterminal (L )NT Parent (P)
Head tag (t)
Head word (w)
Head nonterminal (H)
Left tag (L )t
adjacency = true
verb = false
wLeft word (L )
Distance ( ):
Left subCat (LC): {NP−C}
Figure 3.2: Collins’ representation of a left production event
This figure describes the fields used to represent the dependency event of an NP-C at-
taching as a left sister to the head VP of a parent S node. Collins stores this event by en-
coding the terms pointed to with arrows in the figure. At this point, the event simply is the
co-occurrence of these values for these data fields. To represent a unary production, such
as the generation of VP from the parent S node, we only need the terms on the right-hand
side of Figure 3.2. To represent a subcat production, such as the generation of the VP’s left
subcategorisation frame (in this case, a bag containing one item, NP-C), we use the same
terms as the unary event, plus one additional field: a bag of nonterminals.
Having specified exactly which parts of a tree to use, we can now say exactly how to com-
pute the probability of unary attachments (Equation 3.1), subcategorisation frames (Equa-
tion 3.2) and dependencies (Equation 3.3).
Punary(H | P, w, t) (3.1)
Psubcat(LC |H, P, w, t) (3.2)
Pdep(LNT, Lw, Lt | P, H, w, t,∆, LC) (3.3)
3.1.5 Backoff and interpolation
As in all statistical parsers, deriving the probabilities just mentioned is complicated by low
counts. To address this, events are grouped at different levels of backoff, as discussed in
Section 2.5.
Collin’s parser uses three levels of backoff for most events. A complete list of the prob-
abilities derived, at all three levels of backoff, is given in Tables 3.1 and 3.2, with the more
55
unusual events in Tables 3.3 and 3.4. Columns in the table describe the different event types
which are computed. (I have omitted the right events, which are symmetrical to their left
counterparts.) Cells in a given column specify probability estimates for the event at a given
level of backoff.
Table 3.1 contains backoff strategies for unary and subcat productions. Note as the back-
off level is increased, we throw away elements on the right-hand side of the conditional
probability terms.
Back off level Unary Left Subcat
1 P(H | P, w, t) P(LC | P, H, w, t)
2 P(H | P, t) P(LC | P, H, t)
3 P(H | P) P(LC | P, H)
Table 3.1: Collins’ unary and subcat events
FIXED: Shorter versionFor dependency events, given in Table 3.2, the situation is somewhat more complicated.
Firstly, a dependency event includes both the left word and the head word, so Zipf’s law
plays an even greater role in reducing counts. To address this issue, Collins splits depen-
dency events into two separate events whose probabilities are multiplied together. Recall
from Equation 2.14 that we can use conditional independence to assume part of the depen-
dency is independent of the other part. We can then back-off each part independently. A
second complication in dependency events is the addition of the Boolean flags representing
coordination and punctuation (see Section 3.1.2), represented in Table 3.2 by the terms c and
p.4
Back off
level
Left1 Left2
1 P(Lnt, Lt, c, p | P, H, w, t,∆, LC) P(Lw | Lnt, Lt, c, p, P, H, w, t,∆, LC)
2 P(Lnt, Lt, c, p | P, H, t,∆, LC) P(Lw | Lt, c, p, P, H, t,∆, LC)
3 P(Lnt, Lt, c, p | P, H,∆, LC) P(Lw | Lt)
Table 3.2: Collins’ dependency events
There are several other probabilities that Collins derives. Firstly the probability of stop-
ping parsing (generating TOP) is a variation on unary productions, presumably because
Collins found the basic model produced TOP nodes incorrectly. Again, these probabilities
are computed using two separate events. The backoff strategies are given in Table 3.3. (Note
4Previous versions of Collins’ model do not include these at all, and later versions appear to place these
56
Backoff level TOP1 TOP2
1 P(H, t | P = TOP) P(w |H, t, P = TOP)
2 P(H, t) P(w |H, t)
Table 3.3: Collins’ TOP events
that at level 2 backoff of TOP1, we end up using an unconditional probability to estimate a
conditional probability.)
Secondly, outside probabilities, which are used as heuristics during parsing (discussed
in Section 2.6.1) also need to be computed. Collins does not provide details for how these
probabilities are derived, but by examining the output of Collins’ parser, I have derived
Table 3.4. Note that we again use two sub-events to derive these probabilities. Prior2 in this
case is actually an unconditional probability, which at level 2 is approximated using another
unconditional probability.
Backoff level Prior1 Prior2
1 P(H |w, t) P(w, t)
2 P(H | t) P(w)
Table 3.4: Collins’ Prior events
3.1.6 Smoothing
The previous section shows how three probabilities are derived for each event type, at each
level of backoff. The next problem is to combine estimates at different levels of backoff
into a single probability estimate. The process of smoothing has been discussed in Section
2.5.2. To recap briefly: multiple values are combined using a weighted average based on the
confidence in the probabilities, and counts are modified slightly to provide a few counts for
events unseen during training. The smoothing algorithm that Collins implements appears
to be different to the one discussed in Collins (1997). The implemented version will be
presented in Figure 3.3. Here we will discuss the equations that the paper presents.
Equation 3.4 is used to combine the probabilities.
p = λ1e1 + (1− λ1)(λ2e2 + (1− λ2)e3) (3.4)
In this equation λi is the weighting at backoff level i, and ei is the probability estimate at
level i. This equation simply says to compute a weighted average, the interesting part is in
the computation of λi.
inside the distance metric, and generate them as a separate event altogether.
57
The equation Collins uses to compute λi appears to have changed throughout his work.
For instance, the equation given in (Collins, 1996) is presented in Equation 3.5; Collins (1997)
does not state how λi is computed but, as shown in Figure 3.3 it is different to the 1996
implementation, and finally Collins (1999) describes several methods, including Equation
3.5.
λ1 =δ1
δ1 + 1
λ2 =δ2 + δ3
δ2 + δ3 + 1(3.5)
In these equations, δi is the denominator of the count. Collins notes that these equations
correctly increase λi as the denominator for the more specific event increases, but they are
not as sophisticated as the approaches discussed in Section 2.5.2.
In implementing Collins’ model as just described, it seemed the probabilities being gen-
erated always differed slightly from those produced by Collins’ parser. Eventually this was
found to be because the smoothing algorithm described in the thesis (Collins, 1999) differed
from that implemented in the parser that Collins released. I do not know whether Collins
forgot he changed his code when he wrote that section of the thesis, or if the implementation
was based on an older algorithm which was published elsewhere, or something else, but the
actual implemented version was reverse–engineered. It is presented in Figure 3.3.
e = ε
for i = 3 to 1
bot = 5ui
top = bot * e + fi
bot = bot * fi
e = top / bot
endfor
Figure 3.3: Collins’ smoothing function as implemented
3.2 Collins’ parsing algorithm
Collins’ parser is a statistical chart parser, along the same lines as Magerman (1995), and
those described in the previous chapter. Like any chart parser, constituents or phrases are
found bottom-up and all phrases are found regardless of their position in the sentence. Re-
call from Section 2.6 that using an HPSG style grammar requires some modifications of the
chart parsing algorithm. We are now in a position to discuss these modifications.
58
As described in Section 2.2.1, a chart parser is essentially three nested for loops. A
greatly simplified version is presented in Figure 3.4. (Note that no chart parser would be
implemented quite this inefficiently; much of the rest of this chapter will discuss optimisa-
tions.)
for start = 0 to length
for end = start + 1 to length
for split = start + 1 to end
left = get_edges_spanning(start,split)
right = get_edges_spanning(split + 1, end)
combine(left,right)
endfor
endfor
endfor
Figure 3.4: Simplest possible chart parser pseudocode
The key to this algorithm is the combine function, which is given in Figure 3.5. combine
combine(left, right)
foreach (l left)
foreach (r right) {
joined_edge = join_two_edges(l,r);
expanded_edges = add_singles_stops(joined_edge);
chart.add(expanded_edges);
}
Figure 3.5: Pseudocode for combine
takes two sets of edges, and adds a new set of edges produced from these edges to the chart.
This happens in two steps. Firstly, we attempt to join every pair of left and right edges
to create a set of compound edges, using dependency productions (in the sense defined in
Section 3.1.4). Secondly, we expand each of these compound edges by adding parent nodes,
using unary and subcat productions (again as defined in Section 3.1.4). These two operations
will be described in turn in the next two subsections.
3.2.1 Dependency productions, and the use of a reference grammar
The core operation in the parser is the joining of two edges. In a HPSG approach, one edge is
the parent which means that its head will be the head of the new phrase, and the other edge
59
is a child. The child edge must be complete (it cannot be enlarged after it is grafted into the
parent), while the parent edge must be incomplete (because it grows in this operation). A
simple figure showing the operation is presented in Figure 3.6.
join two edges (VP
V
likes
,
NP-C
Mary ) →
VP
V
likes
NP-C
Mary
Figure 3.6: Simple example showing how an (incomplete) VP can have a
(complete) NP-C added as a right sibling.
One aspect of the join operation that is significantly different with a head-driven gram-
mar is that siblings can be added either to the left or to the right of the head. This contrasts
with normal chart parsing where the grammar is normally written to be either left-recursive
or right-recursive, but not both.
Pragmatically, we can note that every join operation takes a complete edge and an in-
complete edge, so there is no point trying to join two complete edges. Having noted this,
we can split the left and right edge sets based on whether or not they are complete and
then we only have to merge the complete left edges with the incomplete right edges (the
left edges become children), followed by the operation of merging the incomplete left edges
with the complete right edges (the left edges become parents). This split halves the num-
ber of compatibility checks we have to make. Collins refers to adding a sibling to the left
of the head as join 2 edges precede and adding a sibling to the right of the head as
join 2 edges follow .
Either way, the core operation is the creation of a new edge. This new edge is identical to
the old parent except that it has another child. The probability of the new edge also has to be
calculated. It will be equal to the probability of the parent multiplied by the probability that
this child node is added in. The probability of the child node being added can be computed
by looking up the probability of the appropriate dependency production in the probability
model discussed in Section 3.1.4.
There is one additional modification necessary to support subcategorisation frames. If
the head nonterminal of the child is marked as a complement then it must be removed
from the edge’s subcategorisation bag. If it is not present in the bag, then the new edge
is invalid and can be rejected. This whole operation actually works out to very little code,
and Figure 3.7 presents it in pseudocode form. The get dep prob function is presented to
make it clear how the data fields in a node become the parameters in looking up the event
probability. This code shows an important optimisation: if the new edge has an invalid
subcategorisation frame, then it is immediately rejected. This operation is optional; without
60
join_two_edges_follow(parent,child) {
new_edge = new edge(parent); /* Copy the parent */
new_edge.add_child(child);
retval = new_edge.rc.remove(child.parent);
if (retval == k_failure) return NULL;
new_edge.info += get_dep_prob(parent,child,k_follow);
return new_edge;
}
get_dep_prob(parent,child,direction) {
delta = make_delta(parent,child,direction);
return prob(child.parent,child.headtag, child.headword,
parent.parent, parent.headnt, parent.headword,
parent.headtag, delta, new_edge.rc);
}
Figure 3.7: Pseudocode for joining two edges (dependency events).
it the probability of zero would be generated (since this frame has never been seen) and the
edge would ultimately be rejected, but by shortcutting the expensive probability generation,
we save a lot of time. There is another optimisation not shown in which we can also skip the
probability calculation and immediately reject the edge, and that is when the combination
of nonterminals being considered was never seen in the training corpus. To implement
this, we need to build a grammar of all combinations of nonterminals seen in the corpus:
if the combination being considered ‘violates’ this grammar, the edge is not created. One
way of understanding this operation is in relation to backoff events. Recall that at backoff
level three, events are basically combinations of nonterminals. So the decision to filter using
a grammar of nonterminals effectively means that at level three, we assume there are no
unseen events. If we implemented this principle explicitly, we would build the edge being
considered and then assign it a probability of zero; again, consulting an explicit grammar
just gets rid of the edge at an earlier stage.
Finally, it is worth stating explicitly that all of these operations create new edges rather
than modifying existing edges; and we most certainly do not discard edges just because we
have used them to create another. Competition between the new edges and the existing
edges is how ambiguity is resolved. For instance, when we say The child cannot be enlarged
after it is grafted into the parent, it is likely that elsewhere in the chart is an incomplete version
of the child which will in due course be enlarged, marked as complete and then grafted with
61
the parent. To permit modification of children would lead to unnecessary duplication of
work.
3.2.2 Unary productions
The combine algorithm is intended to merge increasingly large spans of the sentence. How-
ever, in an HPSG approach we need an additional step to ‘grow’ individual edges upwards
by giving them parent edges. To see why this is necessary, consider the steps involved in
parsing the man as a noun-phrase. First we tag the as a determiner, and man as a noun. If we
next join the to the left of man then we would build a two-word noun, not a noun phrase. In-
stead we should first create a unary production for man of a incomplete noun-phrase. Later
we can attach the to the left end of this noun-phrase, to create the correct interpretation. This
example is illustrated in Figure 3.8.
DT
the
NN
man
DT
the
NP
NN
man
NP
DT
the
NN
man
Figure 3.8: Simple example showing the steps in building an NP con-
stituent the man.
The function Collins uses for unary productions is called add singles . It works by
creating all possible parents for the input edge and assigning a probability to each. As with
dependency events, unlikely edges are rejected and a grammar is used to avoid generating
impossible edges; pseudocode is presented in Figure 3.9.
It was noted above that the man noun-phrase was incomplete. But how do we know
this? And in sentences such as dog bites man — common in newspapers — there are no other
words in the phrase so man forms a complete noun-phrase. This ambiguity is handled by
the possibility that we should stop expanding the phrase. The function Collins uses to create
a new complete edge for each incomplete edge is called add stop . Note that the concept of
complete nodes is used outside unary productions. This function is actually implemented
as two dependency events, as shown in the pseudocode in Figure 3.10.
Frequently multiple unary productions will be required without any intervening combine
operations. The most obvious example is with noun-phrases because of Collins’ NPB cat-
egory, but co-ordination and TOP provide other examples. This process of interleaving
add singles with add stops is called add singles stops . Pseudocode for this pro-
cess is presented in Figure 3.11.
62
add_singles(child) {
foreach (parent possible_parents(child.parent)) {
foreach (lc possible_lc(child.parent,parent)) {
foreach (rc possible_rc(child.parent,parent)) {
new_edge = new edge(child);
new_edge.stop = false;
new_edge.headnt = new_edge.parent;
new_edge.parent = parent;
new_edge.lc = lc;
new_edge.rc = rc;
new_edge.prob += get_unary_prob(new_edge);
new_edge.prob += get_subcat_prob(new_edge,k_left);
new_edge.prob += get_subcat_prob(new_edge,k_right);
result_edges.add_one(new_edge);
}
return result_edges;
}
Figure 3.9: Pseudocode for add singles. The previous parent is demoted
to a head, and new parents are generated.
add_stop(edge) {
new_edge = new edge(edge);
new_edge.stop = true;
new_edge.prob += get_dep_prob(new_edge,stop_edge,k_precede);
new_edge.prob += get_dep_prob(new_edge,stop_edge,k_follow);
return new_edge;
}
Figure 3.10: Pseudocode for add stop
63
add_singles_stops(edges, depth) {
if (depth == 0) return edges;
foreach (e edges)
edgeset.add_many(add_singles(e));
foreach (e edgeset)
edgeset2.add_one(add_stop(e));
return add_singles_stops(edgeset2,depth - 1);
}
Figure 3.11: Pseudocode for add singles stops
3.2.3 Search strategy in Collins’ parsing algorithm
It is clear that the above processes for generating new edges can result in a very large number
of edges being produced. In particular, the call to add singles results in three nested for
loops, so calling it inside what is essentially another for loop to generate recursive unary
productions results in an algorithm with a high degree of complexity.
To constrain the state-space expansion which add singles stops creates, Collins men-
tions that he makes use of a beam search. Beam search is a variation on best-first search in
which the list of nodes in the search graph which have to be expanded (referred to as the
fringe in Section 2.6.2) is kept ordered by decreasing heuristic value, and truncated to a
fixed maximum length after each new node expansion (Ney, Mergel, Noll, and Paeseler,
1992). (The resulting truncated list of nodes is termed the beam.) Restricting the number
of nodes in this way enables beam search to avoid the combinatorial explosion of breadth-
first search. The goal of a beam search is to constrain the search space to a manageable size
without a significant loss in accuracy, by throwing away the nodes with the lowest heuristic
scores no matter when they were generated during the search.
The problem of developing a good heuristic is more serious for beam search than other
search algorithms. The algorithm discards any elements that are locally given a low prob-
ability, which means that they will not be part of the overall solution even if the locally
likely solution is later rejected. This contrasts with normal search techniques that would
come back to the locally unlikely solution after considering alternatives. The heuristic which
Collins implements is, naturally enough, related to the probability of nodes. But recall from
Section 2.6.1 that we can assess the probability of a construction both using the inside prob-
ability that it is locally a likely production to apply, and the outside probability that it fea-
tures in a complete parse tree. Collins’ heuristic value for a node during parsing is simply
the product of its inside and outside probabilities (estimated as shown in Section 3.1.5).
Despite the attractions of beam search, there are some difficulties when it is applied to
64
lexicalised statistical parsing. The problem is that there are a very large number of new
nodes created by the innermost operation in the parsing algorithm—add singles stops .
And it happens that most of these nodes have very similar heuristic evaluations, especially
at higher backoff levels. The effect of this is either that all of these nodes fall off the end of the
beam, because they are not likely enough, or that the nodes generated ‘swamp’ the beam,
and throw everything else off, discarding alternative explanations before the local situation
has a chance to be tested over the whole sentence. What we would like is to select a range
of the most likely nodes generated by add singles stops to be considered for further
processing by combine . Collins’ solution is effectively to implement two beam searches:
one locally within add singles stops , and one globally within combine .5
3.2.4 Summary of Collins’ parsing algorithm
We have now covered the entire parsing process, but in a piecemeal fashion. Figure 3.12
(loosely based on page 189 of Collins’ thesis), presents the same operations we have already
described but as a single coherent function. This is the pseudocode that will be referred to
when discussing implementation details.
Very briefly, in words, here is what happens. We initialise the chart with a set of edges,
each of which is one word. Then, we attempt to join every pair of adjacent edges using a
dependency production, to create complex edges. We add each of these edges to the chart
twice, once as an unfinished edge (which needs more daughters) and once as a finished edge
(all of whose daughters have been found). We then expand all the new finished edges using
unary productions (considering both a single unary production and chains of two or three
unary productions). The resulting edges are added to the chart. Then we iterate by again
attempting to join every pair of adjacent edges (except those pairs that have already been
processed). Search heuristics are applied in two places: firstly, only a selection of the most
likely unary productions are generated; secondly, we only store the most likely edges for
each span of the input string. The algorithm finishes when there are no more pairs of edges
to consider. If there are edges spanning the whole input string whose parent nonterminal is
TOP, the one with highest probability is returned as the parse of the string. If not, the parse
fails. This is the essence of Collins’ parsing algorithm.
5In fact, Collins only uses the term ‘beam search’ to refer to the local search strategy. To constrain the number
of nodes generated by the combine operation, he uses an algorithm for constraining the maximum number of
edges for each span of the input string. But again, he uses the same heuristic, related to the probability of nodes.
Effectively, he implements a number of independent beam searches at this outer level, one for each span in the
input string.
65
edge parse(sentence) {
initialise(sentence); // init all word/tag pairs
for span = 2 to n { // n = number of words in the sentence
for start = 1 to n - span + 1 {
end = start + span - 1;
for split = start to end - 1 { // combine pairs of edges
foreach e1 in incomplete_chart[start][split]
foreach e2 in complete_chart[split+1][end] such that
check_grammar(e1,e2,follow) {
e3 = join_two_phrases(e1,e2,follow);
set1.insert(e3);
for i = 1 to max_unary { // add_singles_stops
incomplete_chart.insert(set1);
foreach e in set1 {
set2 = add_stops(e));
complete_chart.insert(set2);
}
foreach e in set2
foreach P in nonterminals // (in the grammar)
foreach LC in subcat // (in the grammar)
foreach RC in subcat // in the grammar
set1.insert(add_singles(e,P));
}
foreach edge e1 in complete chart[start][split]
foreach edge e2 in incomplete chart[split+1][end] such that
check_grammar(e1,e2,precede) {
e3 = join_two_phrases(e1,e2,precede);
set1 = {e3};
... // add_singles_stops is as described above
}
}
}
}
return complete_chart[0][n].best("TOP")
}
Figure 3.12: Collins’ parsing algorithm
66
Chapter 4
A reimplementation of Collins’ parser
This chapter describes my reimplementation of Collins’ statistical parser. The goal of the
chapter is to document my system at a level of detail which permits a precise reimplemen-
tation, including descriptions of pseudocode and data structures.
It is impossible to present this information at the appropriate level of detail for every
reader; a much shorter version of this chapter has been published (Lakeland and Knott,
2004). In the other direction, additional technical information is presented in Appendix B,
which includes a more precise definition of the data flow, data structures and class hierarchy
as actually implemented. Naturally, the ultimate reference for any implementation is the
source code itself. This is too large to include in its entirety, but key portions (with a link to
the rest) are included in Appendix C.
Much of the published work on statistical parsing describes systems at a fairly high level
of abstraction — but many of the challenges in implementing a lexicalised statistical parser
relate to implementation-level details. This is an area which has not been well covered in the
literature on lexicalised statistical parsing, and yet I would estimate that most of the work
involved in building a lexicalised parser is in addressing the software engineering issues
which arise. Before continuing, I will briefly motivate this idea.
4.1 The complexity of Collins’ parsing algorithm
From reading Collins’ pseudocode in Figure 3.12, it is hard to understand why implement-
ing a statistical parser is difficult. However the problems become clear counting the loops:
for instance Collins’ algorithm iterates over every possible span, start and split for O(n3);
within this loop he iterates over every edge on the chart which has an unclear asymptotic
complexity but is at least O(m) (m is the number of rules in the grammar); within this loop,
Collins iterates over every unary rule in the grammar, again O(m). This results in a final
time complexity of at least O(n3m2), making sentences unparsable without a supercomputer.
67
Similar problems occur throughout statistical Computational Linguistics. Consider a de-
scription of Finch’s unsupervised thesaurus generation algorithm (Finch, 1993): “Count the
co-occurrences of each word with every other word and merge words with similar counts”.
Or again, consider Bod’s pseudocode in Figure 2.20 (page 48): all of these algorithms are
simple to describe, but hard to implement efficiently.
The problem is not restricted to time complexity. Memory usage is also unacceptable
with a naive implementation of such algorithms. For instance, many potential phrases need
to be stored for a long time before they can be rejected. Another area where memory usage
is a problem is the storage of training data. Consider how many events need to be derived
from the training corpus if we are building a lexicalised probabilistic grammar. There are
many words in the language, and (naturally) many more pairs of words, and thus a huge
number of events involving pairs of words in specific grammatical constructions.
Efficiency is not just an important factor in the parser we finally produce; it is also cru-
cially important during the process of developing the parser. Debugging a system which
takes half an hour to parse a sentence is an extremely tortuous process. This means that the
program must be efficient from the very outset, rather than developed without concern for
efficiency and then optimised afterwards.
The conclusion from these considerations is that the difficulties in statistical parsing are
not as much related to linguistics or artificial intelligence but to software engineering. How
can a system reject incorrect interpretations before they swamp resources, and how can it
handle so much data efficiently? The remainder of this chapter will discuss not just the
final implementation of the parser but also the process that was used in developing this
implementation. It is hoped that the former will be of use to people interested in the details
of Collins, while the latter will be of use to people intending to write a statistical parser or
similar system.
A data flow diagram for my reimplementation of the entire parsing system, encompass-
ing both the preprocessor for the WSJ and the parser itself, is given in Figure 4.1.
A data flow diagram providing an overview of the whole parser is presented in Figure
4.1. The first step is to convert the WSJ treebank into an event file suitable for fast com-
putation of probabilities, and a grammar file used for pruning the search space. During
actual parsing, a sentence is input to the system, and a part-of-speech tagger is used to
initialise the chart . The parsing algorithm is then executed. It is important to note there
are two separate loops occurring in the parsing algorithm. A large loop (shown in blue) cor-
responds to the combine operation given in the pseudocode in Figure 3.12, with the chart
passing edges to the parser’s control structure, which then joins and expands them, and
puts the result back into the chart. Within this loop is a smaller loop (shown in green) cor-
responding to add singles stops in the pseudocode in Figure 3.12. This loop expands
68
individual edges, adding all plausible parents.
Figure 4.1: Simplified data flow diagram for my implementation of
Collins’ parser
It is in these two loops that the system spends most of its time. Therefore much of
implementing the parser comes down to implementing these two loops efficiently.
In the remainder of this chapter, I describe the components of this diagram in more
detail. Section 4.2 describes my implementation of the preprocessor; Section 4.3 describes
my implementation of Collins’ probability model; Section 4.4 describes my implementation
of a (new) part-of-speech tagger using this probability model; and Section 4.5 describes my
implementation of the ‘chart’ data structure. Within the parser, there is nothing especially
interesting in the control structure implementing the combine loop; it very closely follows
the pseudocode already presented in Figure 3.12. However, it is worth discussing beam
search in more detail; Section 4.6 describes my implementation of the beam search function
from add singles stops . In Section 4.7, I provide some practical software engineering
advice about how to go about building a lexicalised statistical parser. (This mostly concerns
debugging and program verification, and it was mostly learned the hard way.) Finally, in
Section 4.8, I present the results of my parser.
69
4.2 Implementation of the treebank preprocessor
Before any parser can be implemented, the WSJ must first be converted from a corpus of
trees to a simple enumeration of events.
The WSJ as distributed by the LDC is a large collection of parse trees. These trees do not
include information about headwords, or about complements or adjuncts, both of which are
fundamental to Collins’ approach. The trees also include a lot of information, such as the
semantic locative mark, which would be useful in parsing, but leaving it in reduces counts by
too much. So the first step in preprocessing the treebank is to add the information Collins’
model needs, and delete any extra information. The second step is to transform the trees
into a flat file of events, more suitable for insertion into a hash table.
Transforming the treebank into Collins’ model turns out to be harder to do accurately
than it appears, although most of the problems faced related to Collins under-specifying his
model. The transformation code is implemented in Lisp because it seemed a natural choice
for tree processing, although Perl or python would also be good choices. The high level
code is given in Figure 4.2. This shows that the steps can be performed independently and
so can also be discussed in isolation.
(defun process (tree)
(output-for-collins 0
(add-headword
(first (add-npb
(add-complement
(first (drop-none
(convert-to-numbers tree)))))))))
Figure 4.2: Actual high-level code for preprocessing the treebank
There are five functions in this algorithm, which will be discussed in the order in which
they process the sentence. To begin with, convert-to-numbers transforms words, tags
and nonterminals into enumerated types, which can be stored and processed more effi-
ciently. (From now on, everything in the preprocessor and parser will be based on numbers
instead of words and grammatical symbols.) drop-none then removes gapping informa-
tion. (If model three was implemented rather than model two, then a -G marker would
have to be added here instead of simply deleting gaps.) add-complement decides which
children are complements and adds a -C to their head nonterminal. add-npb finds the ba-
sic noun phrases and adds an extra level in the tree — i.e. (NP ...) goes to (NP (NPB ...)).
add-headword chooses a head child for each phrase and also transforms the tree so that
information about this head child is stored in the parent. Finally, output-for-collins
70
produces output in a format that is easy to parse.
Of these steps, adding complements and headwords are sufficiently complicated to war-
rant further discussion. The algorithms Collins uses are based on Magerman (1995)’s al-
gorithm for identifying headwords. The idea of using an algorithm to automatically select
headwords is interesting and empirically it appears to work well; interested readers are
referred to Magerman (1995). The basic method is to take each sequence of sister nodes
in a tree, and determine which is the head, and which are the complements and adjuncts.
Collins decomposes this process into two separate stages: somewhat surprisingly, he begins
by identifying complements, and only after this identifies heads. The remaining nodes are
classified as adjuncts. Implementing the algorithm involves fairly complex tree manipula-
tion, and so detailed pseudocode is provided in Figure 4.3 to give a more precise description.
(defun add-headword (tree)
(if (terminal-p tree) ; are we down to words?
(list (first tree) (second tree) (first tree)) ; a word is a headword
(add-headword-internal (first tree) ; do the work
(mapcar #’add-headword (cdr tree))))) ; recurse
(defun add-headword-internal (head children)
(let* (left-to-right (get-direction head))
(priority-list (get-priority-list head))
(search-children (if left-to-right children (reverse children)))
(found (remove nil
(mapcar #’(lambda (item)
(find item search-children :key #’first :test #’equal))
priority-list))))
(if found
(append
(cons head
(list (second (first found)) (third (first found))))
children)
(append ; headword not found assume the first/last child
(cons head-for-output
(list (second (first search-children))
(third (first search-children))))
children)))))
Figure 4.3: Pseudocode to implement Magerman’s headword algorithm
We add a complement tag to any nonterminal matching any of several constraints relat-
ing to its NT and its parent’s NT. One constraint, for instance is that “[the] nonterminal must
71
be: an NP, SBAR, or S[,] whose parent is an S”. As already mentioned, Collins derives com-
plement information before deriving headword information. It seems counter-intuitive to
derive information in this order. But again the algorithm seems to work empirically, perhaps
because both processes are so similar. Because the algorithm is so similar to the headword
algorithm, pseudocode will not be presented here.
I have implemented a preprocessor based on Collins’ description of his algorithms. How-
ever, as mentioned in Section 3.1.3, there are many areas in which Collins’ description of his
preprocessor is incomplete. For instance the kinds of arguments taken by prepositions are
not given (perhaps implying they take none). But it is clear they do take arguments, be-
cause the event file presents numerous examples of them with arguments. To circumvent
these problems, my preprocessor is also designed to reproduce the event file which Collins
distributes. Rather than present the minutiae of my preprocessor code here, Section C.4
presents it in full.
4.3 Implementation of the probability model
The core of Collins’ probability model is a function called genprob1 (for ‘generate prob-
ability’), which takes an event and returns a probability computed by calculating relative
frequencies of events found in the WSJ at different levels of backoff, and smoothing/inter-
polating between these appropriately (see Section 3.1.6). The key data structure supporting
this function is therefore a table of these WSJ events associated with counts. Each event is di-
vided into nine sub-events (numerator, denominator and weighting, each at three different
levels of backoff). Each sub-event is associated with a count of how frequently it occurs.
How should we store the table associating sub-events with counts? We could store it as
an array, but this would be impossibly large and very sparsely populated. It is obviously
better to use a hash table to store these counts. In a hash table, we use the event to generate
a hash key, which operates as an index into the table.
The table itself can be implemented as a simple array. More sophisticated hashing algo-
rithms exist, usually based on a tree for the first few bits of the key followed by an array for
the rest, but they are unnecessarily complex here since their benefit is dynamic resizing of
the hash-table, and here the size of the hash tables does not change during parsing. Even
if events were to be added to the hash table during parsing, it is unlikely so many events
would be added as to justify a complex data structure. So my hash tables are simple arrays.
There are two different ways of organising the hash table data structure. We could sim-
ply have one big hash table, in which the key specifies not only the event’s description, but
also what kind of event it is (i.e. a unary numerator); or we could use a separate hash table1It is our convention throughout this thesis to refer to functions in teletype; while genprob is a function it is
referred to so often that it is easier to read in a roman typeface.
72
for each kind of event. The latter approach leads to slightly reduced efficiency because it
becomes impossible to keep the hash tables at the same density, but it makes debugging
easier since each entry can be verified. I adopted the latter approach.
It is normal when discussing a hash-table implementation to justify the collision resolu-
tion algorithm used. Since events are many bytes long, but array indices are only two bytes
long, it is inevitable that different events will map to the same array index. The normal
method for resolving such collisions is to store the full event along with the value (in this
case, the frequency) in the hash table and then either store different events in a linked list,
or use a separate hashing algorithm to look in a different part of the hash table. However,
neither approach was used here; instead we silently ignore collisions. This leads to counts
being incorrectly combined whenever a collision occurs.
Why is it expedient to ignore collisions? Storing the count takes perhaps four bytes,
but storing the key takes at least ten bytes, and so any collision detection would triple the
parser’s memory requirements. It would also significantly decrease the parser’s speed since
a ten byte comparison would need to be performed on every lookup. Furthermore, the
benefit gained by avoiding false collisions is small. Occasionally a collision will cause an
incorrect interpretation to be given a higher probability, or the correct interpretation to be
given a lower probability. These cases occur very rarely, perhaps just a few times per sen-
tence. Even when they do occur, the effect is almost always a tiny change in the probability
since most events have a very low count, and so will not affect the final parse of the sen-
tence. Finally, in the few situations where a significant effect is seen, the second-best parse
usually has only slightly lower precision and recall than the best parse and so the error does
not significantly affect accuracy.
The final implementation issue in hash-tables is how to transform the event into an array
index; this is known as key generation. Ideally this process should be at least partially
reversible so that surprising lookup results can be traced back to the event which caused
them, but at the same time performance is critical. My key generation function takes the
event’s components (e.g. head tag, head word etc), which are each represented by an integer
(as mentioned in Section 4.2) and computes their product, modulo the hash table size. This
function achieves a reasonable tradeoff between performance and verifiability. Collins’ own
implementation uses XOR on the event’s components, leading to increased efficiency but a
process that is harder to debug. If the parser’s efficiency becomes an issue, this would be an
easy way of increasing speed.
A further hash-table was used for each event. This tenth hash-table cached the final prob-
ability computed from previous calls to the lookups. Recall that each hash-table just holds
the raw counts of numerators, denominators and weightings for sub-events; to compute a
probability, nine of these counts must be looked up, and a smoothing/interpolation formula
73
applied. Since over 99% of hash table lookups are performed more than once, it makes sense
to store the results of this computation in a separate hash table. This means only one lookup
is needed almost all the time. This single optimisation increases the parser’s performance
by almost two orders of magnitude.
4.4 Implementation of a POS tagger
Before parsing can begin, we must first perform part-of-speech (POS) tagging on the input
sentence. A POS tagger maps words to their parts-of-speech. For instance, walked might be
tagged VBD to say it is a verb in the past tense. For most words this is a trivial process since
they can only be interpreted as one particular part-of-speech. But a simple lookup table does
not suffice since many words are ambiguous, and some words were unseen during training.
To obtain higher accuracy than a lookup table can achieve, the context of previous words
and/or their tags is used by various different methods.
In this section, I discuss the POS tagger I developed on top of my reimplementation of
Collins’ probability model. I begin in Section 4.4.1 by discussing the relationship between
tagging and statistical parsing in general. In Section 4.4.2, I describe the algorithm used by
my tagger — a standard hidden Markov model — and in Section 4.4.3 I describe how the
algorithm was implemented. Finally, in Section 4.4.4, I present some results. The tagger I
developed is also documented in a separate publication (Lakeland and Knott, 2001).
4.4.1 The relationship between POS tagging and lexicalised statistical parsing
When parsing, a tagger is needed to initialise the parser from the sentence. This is be-
cause parsers do not have sufficiently large grammars to write specific grammar rules for
every word, and so write their grammar rules in terms of POS tags instead. Non-lexicalised
parsers then discard the sentence entirely and just parse the tag sequence, while lexicalised
parsers still use the words to guide the parser. Collins’ parser uses the words when statis-
tical counts are sufficiently high, but it is clear that the tags control the parse much more
than the words do (discarding them entirely only slightly decreases parser accuracy, as we
discussed in Section 2.4.4). Because the parser is so dependent on the tags, it is important
the tags are correct.
The normal process of POS tagging involves resolving the ambiguity in the sentence
and selecting a particular tag for every word. However most parsers, including Collins’,
support ambiguity in the way the sentence is tagged. Being a statistical parser, Collins’
parser also allows us to assign a confidence to each tag for a word being tagged ambigu-
ously. Essentially, a normal tagger will produce one tag per word, but the optimal input for
a parser like Collins’ is a probability distribution of tags for each word. Somewhat surpris-
74
ingly, Collins himself does not use a tagger which provides such probability distributions.
In fact, it turned out to be quite hard to find exactly what Collins did for POS tagging since
it did not appear to be documented anywhere. But by reading the code it was found his
parser supported two different modes: in ‘oracle’ mode the parser assumes the input has
been perfectly tagged and consequently assumes no ambiguity, while in ‘evaluation’ mode
the parser assigns every word all the tags it was seen with in training at equal probabil-
ity. I therefore decided to build my own POS tagger to initialise the chart, which provides
probability distributions on POS tags.
There are several good reasons for developing a POS tagger of my own rather than using
an existing one. Firstly, we can ensure thereby that there is a smooth coupling between the
parser and the tagger; in particular, we can ensure that the tagger uses the same tagset
and tokenisation scheme as the parser. Secondly, in writing the skeletal structure of the
parser we have already implemented a lot of the code that a tagger needs, such as reading
sentences from a file, mapping words and nonterminals into enumerated types, hash-table
lookup, backoff and smoothing. Moreover, the tagger contains none of the computational
complexity of a parser and so is extremely simple to debug. Not only does this make the
tagger a useful practise-step towards developing the parser, it also significantly reduces the
amount of code in the parser which might contain bugs. Finally, if we implement a tagger
based on the existing probability model, it is very simple to make it return a probability
distribution for each tag, rather than simply the best tag or an undifferentiated set of possible
tags.2
4.4.2 Part of speech tagging using hidden Markov models
Consider a simple sentence such as The can can hold water. The job of the tagger is to deter-
mine that The is a determiner, the first use of can is a noun while the second is as a verb,
and so on. A simple mapping between words and their most common POS tag will usually
obtain the correct answer but such an approach would have get one of the uses of can wrong
in the example sentence. To obtain better results than a simple lookup, we use the context
of the word in order to better predict the tag of the current word. For instance, can is usually
a verb (99.5% of the time) but verbs almost never occur after determiners, especially at the
start of a sentence (only 5% of the time). Further, identical verbs almost never follow one
another. So through context we can determine when to use the conventional tag, and when
to make an exception.
2The POS tagger was one of the components of the implementation which I completed, with my publication
describing it in 2001 (Lakeland and Knott, 2001). At the time a literature search did not find any POS taggers that
produced a distribution of tags — or even that produced a set of tags. However, I have since found an excellent
paper by Charniak, Carroll, Adcock, Cassandra, Gotoh, Katz, Litman, and McCann (1996) which describes such
a tagger. Charniak et al.’s tagger performs similarly to the one I describe.
75
Aside from the simple lookup, there are two classes of approaches to POS tagging: rule-
based and stochastic. There is some debate in the literature about which of these approaches
is more suitable. However, we have already addressed an identical question when deciding
to produce a statistical parser rather than a deterministic one, and so it makes sense to be
consistent. As Charniak et al. (1996) puts it: “we are simply more familiar with the statistical
tagging technology.”
Within stochastic tagging, there are a number of approaches but the most common is
to use a hidden Markov model (HMM). In this approach, the sentence is viewed as the
product of some model. Because we cannot see exactly what is going on we say that the
model is hidden, but we can observe its behaviour and based on these observations make
assumptions about the internal state of the model. Specifically, we assume that the internal
states correspond to the sequence of parts-of-speech and words already seen, and the model
has a probability of going from the current state to the next state based on the likelihood of
producing the next word given the current internal state. More formally, we say:
P(W |T) = Πi=1...nP(wi |w1, · · · , wi−1, t1, · · · , ti)
Since this model has far too many parameters to compute for sentences beyond a couple
words, we make a Markovian assumption that only a certain amount of context is necessary.
The amount of context retained is known as the order of the Markovian assumption and in
practice it is chosen according to the amount of available training data, because if we have a
lot of training data then we do not need to make broad assumptions. The important thing to
realise reading this is that we are discarding all context words from history and only using
the previous tags.
P(W |T) ≈ Πi=1...nP(wi | ti−2, ti−1, ti)
Another important point to note is that we are predicting the probability of each tag in
isolation, we are not maximising over the sentence. This means that since an unknown word
at the start of a sentence is best interpreted as a proper noun, we will interpret it as such
even if looking at the later words implies that it is really a verb. This problem has been re-
solved in POS tagging by considering candidate paths rather than simply selecting the most
likely for history. The code modifications necessary to perform argmax over the sentence
are small, essentially a very simple chart is needed instead of just an array. An evaluation of
the tagger’s errors implies this enhancement will result in a significant improvement on the
tagger’s results, approximately halving the errors. We did not implement this improvement
because an analysis of the parser’s errors (see Section 4.8) implies the tagger’s accuracy has
virtually no bearing on the parser’s.
As with the equations in the parser, it is probably more useful to present these equa-
tions in terms of the events that must be counted rather than mathematically. This is pre-
76
sented in Table 4.1. Consider again the sentence The can can hold water. The first step will
Backoff level POS
1 P(tag |word, prev tag, prevprev tag)
2 P(tag | prev tag, prevprev tag)
3 P(tag |word)
Table 4.1: Part of Speech event representation
be to take The, and compute for every tag P(tag,"The","#STOP#","#STOP#") . This
means we will compute the counts of the tuple (DET, "The","#STOP#","#STOP#") ,
(NN, "The","#STOP#","#STOP#") , etc. Since a determiner typically starts a sentence
(follows "#STOP#","#STOP#" ), and The is almost always a determiner, we will select de-
terminer as the best tag by a large margin. Next we shift the window and consider can with
the context P(tag,"can","DET","#STOP#" . Because it is following a determiner, we can
classify can as a noun instead of the more common verb.
4.4.3 Implementation Details
The work involved in implementing a basic HMM tagger is quite small, comparable to a
senior undergraduate assignment. As with the parser, the steps are: (i) preprocessing a
tagged corpus into a normalised structure and saving the events in this corpus into a simple
event file, and (ii) implementing a probability model which uses the events and writing
some control structure to read sentences and pass them one word at a time to the probability
model.
Since the tagger I implemented is designed to work with a parser based on the WSJ,
it makes much more sense to use the WSJ as a training corpus instead of a larger tagged
corpus such as the BNC. While using the BNC would enable the development of a more
accurate tagger, it would necessitate the use of a different tagset and tokenisation scheme
and the mapping between these would destroy any benefits gained by the extra accuracy.
Instead, we take the WSJ treebank and strip out all nonterminals to obtain a flat structure.
Again, we could obtain higher accuracy by further processing, such as proper name detec-
tion, but such transformations would again make the parser more complex. Converting this
tagged sequence into an event file is simply a matter of iterating over the sequence, convert-
ing words and tags to their enumerated types, counting events in a hash-table and saving
the hash-table to a file. This process is so close to the pseudocode for the tagger’s control
structure presented in 4.4 that it will not be presented separately. The only difference is that
instead of using P pos to derive a probability, the hash-table values which it looks up are
incremented.
77
Implementing the tagger’s probability model involves exactly the same steps as in the
parser. A request is made by the control structure to derive the probability of a tag event
using the parameters provided. The parameters are then converted to nine separate hash-
table keys using the standard modulus approach, looked up in the nine tagger hash-tables,
and the results are combined using the smoothing algorithm already presented in Section
3.1.6. Finally, the control structure is little more than a read-eval-print loop. Pseudocode is
presented in Figure 4.4.
output[0] = output[1] = "#STOP#";
for (i = 2; i < len+2; i++) {
prevprev_tag = output[i-2];
prev_tag = output[i-1];
current_word = sentence->word_as_enum(i-2);
for (tag_nr = 0; tag_nr < num_tags; tag_nr++) {
current_tag = possible_tags[tag_nr];
prob = probs->P_pos(current_tag,current_word,
prev_tag,prevprev_tag);
if (prob > maxp) {
maxp = prob;
maxt = current_tag;
}
}
output[i] = maxt;
Figure 4.4: The tagger’s control structure
4.4.4 Results
This tagger was never intended to be state-of-the-art, but it is still expected to perform well.
Specifically, it was intended to give the parser greater accuracy than simply selecting the
best tag, without giving the parser the complexity of selecting all plausible tags. We will
defer an analysis of the tagger’s effect on the parser’s efficiency until our analysis of the
parser’s results in Section 4.8. However it is worthwhile here at least contrasting the tagger
to others that are available.
The tagger is evaluated based on its accuracy on the testing section (Section 23) of the
WSJ. This method could be argued as slightly optimistic since it means the training and the
testing data are very similar, but since the tagger’s job is to provide tags to the parser, and
the parser is going to be tested on the WSJ, this is a reasonable test. On this test the tagger
78
achieves an accuracy of 95.6%. This is above a basic tagger, but below the very best taggers
which can achieve 98% accuracy. Predictably, this figure is close to Charniak’s figure for his
basic model of 95.9%, which was generated using a very similar method.
As a baseline, it is useful to measure a POS tagger’s accuracy when it is given no context
— that is, how accurately can you predict the current tag given just the current word. It
turns out that such a simple model obtains about 88% accuracy when tagging the WSJ using
the WSJ tagset. The inclusion of several simple rules such as a tagging dictionary raises
this accuracy to 94%. Since the parser is not especially dependent on the tagger’s accuracy,
it would be reasonable to conclude that the baseline approach would have been perfectly
adequate. This is the conclusion Charniak comes to in Charniak et al. (1996), but it is worth
noting that the amount of extra work required to implement a simple HMM based tagger is
very small.
Normally the evaluation of a POS tagger concentrates only on whether or not its chosen
tag is correct, the rank of the correct tag is rarely presented. However in this case the tagger’s
job is to pass a set of tags on to the parser and it is highly desirable for this set to include
the correct tag while preferably being as small as possible. Therefore Table 4.2 presents the
Rank Percentage
1 93.1%
2 3.9%
3 0.9%
4 0.4%
5 0.3%
6-9 0.7%
10-19 0.4%
20+ 0.1%
Table 4.2: Actual position of the tag that should be in first position
relative position of the correct tag. Clearly the tagger gets the right answer almost all the
time, but what about the cases where the tagger gets it wrong? If the correct tag is given a
high probability then we may be able to detect the possibility of an error and compensate
for it using ambiguity. Figure 4.5 presents the probability associated with the correct tag
in the cases where the tagger has made an error. Clearly a significant number of the errors
have a probability over 0.2, and yet Table 4.2 tells us that very few incorrect tags are given
probabilities over 0.2. Based on these results, we can conclude that using just the tags with
return by the tagger with a probability over 0.2 will almost always give us the correct tag.
In those few cases where it does not work the correct tag is usually assigned a probability of
79
Probability estimate
Probability
Fre
quen
cy
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
050
0010
000
1500
0
Figure 4.5: Histogram of the probability assigned to the correct tag, in the
cases where the tagger chooses a wrong tag as best
80
near zero and so we just accept the parser will make an error on this word.
The computational efficiency of the POS tagger was not important here since it is only
used in the initialisation phase of the parser, rather than inside any loops. This means the
difference between a fast and a slow tagger will be lost as noise when measuring the parser’s
performance. However, for the sake of completeness it is worth mentioning that the tagger’s
complexity is linear in respect to the length of the input, with the tagging of a single word
requiring approximately thirty hash-table lookups. The initialisation of the tagger takes
one minute. After initialisation, the tagger processes a little under five hundred words per
second, which is significantly slower than average.
As always, it is useful to compare my approach to Collins’. In this case, his approach is
not as effective as mine. Since the parser’s probability model is written using conditional
probability, this approach means that it will take the parser a long time to reject a poor
choice of POS tag. This is because the probability of any given parent is computed using
P(parent|tag)× P(tag), and it can be extremely high for a poor choice of tag since all tags
have by definition a probability of one in Collins’ model, and that parent may be the only
parent ever seen with this tag. The effect of this is that a locally likely constituent will be
inserted into the chart, and only rejected when it is found not to work well globally. By
contrast, my method will assign a very low probability to a poor choice of tag, and so even
though the conditional probability P(parent|tag) is high, the overall structure is given a low
probability and the parser can reject it much faster. It is hard to imagine situations where
my approach will increase overall parser accuracy since these local errors are easy to correct
at a global level, but my approach will allow the parser to reject incorrect interpretations
much faster.
4.5 Implementation of the chart
We now come to the parsing algorithm itself. In this section, we discuss implementation of
the chart, the core data structure used by the parser.
The goal of the chart is to store all the nodes for each span. When describing determin-
istic parsers, our description of the chart data structure was rather vague. This is because
we have been talking about systems with small grammars and so the chart data structure
is just an implementation detail where any design would work. However for a statistical
parser the grammar is so huge that the wrong choice of data structure will make parsing
impossibly slow. Given our emphasis on software engineering, it is important to describe
the chart data structure in detail.
The most natural way of implementing the chart would be as a three-dimensional ar-
ray, in which the first two dimensions specify the start and end of the span, and the third
81
dimension stores the actual edges. Unfortunately, we do not know how many edges will
be needed for any given span of the input string. There are two possible strategies in this
situation. Firstly, we could preallocate a third dimension of some fixed size. But the third
dimension must then be big enough to hold the maximum number of edges stored for any
span. Secondly, we could replace the third dimension with a linked list, in which there is
no need to preallocate memory. However, allocating memory during run-time is compu-
tationally extremely inefficient. Moreover, the indirection involved in traversing a linked
list is also a little inefficient. To get around both these problems, note that the control struc-
ture of the parsing algorithm means that edges with a given start and end position in the
input string are added consecutively. This means that we can store the chart as a huge
one-dimensional array of edges, with a two-dimensional index array of pointers indicating
where the set of edges associated with each span are stored. There is then no wasted space
in the chart, and we still have constant time access to any span.
There are four further optimisations made in the chart data structure. Firstly, recall that
there is an optimisation which requires consultation of a grammar to avoid generating pro-
ductions which never occur in the training corpus. The grammar lists all the nonterminals
that are allowed to combine with each other in dependency productions. To implement this
optimisation, we add a third dimension to the index array, to hold the parent nonterminal
associated with each edge. The system can then ask for only edges which are ‘grammatically
possible’.
A second optimisation comes from noting that the control structure of the complete
function means that we always process one complete edge and one incomplete edge, so it
would be more efficient if we could loop over all complete edges and all incomplete edges
separately. It thus makes sense to have two separate charts, one for complete edges and one
for incomplete edges.
A third optimisation is to avoid adding generated nodes to the chart. The use of cutoffs
by Collins is discussed in Section 4.6.1; but we can also reject nodes that our probability
model says are equivalent to a higher probability node already in the chart. This is the
Viterbi algorithm which was discussed in Section 2.6.3 and it speeds up the code by two
orders of magnitude. This is a nice example of the adage that it is better to optimise the
algorithm than the code.
A final optimisation in the chart data structure is to store only the most probable edges
for any given span in the chart. While the Viterbi algorithm guarantees not to delete the very
best parse, this final optimisation is more of a heuristic, simply ensuring that it is unlikely
that the best parse ends up being deleted. This optimisation is implemented by treating
the list of edges stored for any given span as a priority queue, ordered by the probability
of edges. For any given span, we can define the best edge as the edge at the head of this
82
queue. The number of edges on the queue is determined by a global cutoff threshold, which
specifies how close to the best edge’s probability an edge’s probability needs to be in order
to remain in the chart. Edges whose probability is beyond this threshold are marked as
unusable, and are not considered in further productions. A further benefit of the priority
queue is that add singles stops is called with the best edges first, and can therefore
more quickly reject edges in later calls.
This completes the description of my ‘chart’ data structure. To contrast my approach
with the one Collins used, he represented the chart as a single huge array but used a two
dimensional array of pointers iterating over this array which provide the next node with
the correct start and end points. This representation gets most of the benefits of the one just
described — for instance it is trivial to preallocate the array. The main disadvantages of his
representation compared to mine is that his did not include the complete/incomplete opti-
misation, or the ability to iterate only over a particular nonterminal. However it did make
his implementation significantly easier; my chart management code is around a thousand
lines of code.
4.6 Implementing add singles stops and beam search
The add singles stops function has the task of finding parents for each new edge and
creating more new edges, one for each of the possible parents, which it adds to the chart. Its
input comes from join two edges , which produces incomplete edges and so its first task
is to see if these edges could be considered complete using add stop . All of the edges that
can be considered complete are added to the ‘complete’ chart ready for the next complete
loop, and they are then extended to new parents using add singles . The output from
add singles is then added to the ‘incomplete’ chart, and also processed again recursively
in order to support edges with no siblings.
The difficulty in implementing add singles is that it includes three nested loops and is
itself called recursively by add singles stops about five times. While none of these loops
is dependent on the size of the input sentence (i.e. the function is O(1)), an unconstrained
implementation would result in approximately 20005 edges being created. Even if these
edges were discarded by the chart on creation, the time taken to create them would make it
impossible to parse a simple sentence. To resolve this, Collins only expands edges likely to
be part of the final parse. As already mentioned, Collins’ uses a constrained best first search
known as a beam search for this process. The edges produced by add singles are not
added directly to the chart, but to a fixed-size data structure called the beam, which holds
a list of edges ordered by decreasing probability. A new edge generated by an invocation
of add singles is inserted into the beam at an appropriate point, to keep it ordered. If
83
its probability is lower than that of the last edge on the beam, it is not added to the beam
at all — which is what happens most of the time. The benefit of this is that instead of an
unmanageable number of nodes being created, perhaps only a few hundred are created (of
which optimisations in the chart will still discard all but a handful). I will discuss the beam
search algorithm in more detail in Section 4.6.1.
Searching generally involves creating new nodes for each child being expanded. As was
mentioned in Section 4.5, allocating memory at run-time is a computationally expensive op-
eration and is undesirable in a program where efficiency is critical. Preallocation of memory
to store nodes is clearly a preferable option. Since a beam always has exactly n nodes on
it, it seems intuitively obvious that beam search could be implemented with preallocated
memory — but it proves surprisingly difficult to do efficiently. To support preallocation,
my implementation of beam search uses skiplists (Pugh, 1989). I will discuss skiplists in
Section 4.6.2.
4.6.1 Beam Search
The search space traversed by the beam search for a given application of add singles stops
has as its root node the edge generated by join two edges . Children of this node are all
the possible productions featuring parents of the edge, as found by add singles . Each
child node itself has children, corresponding to the possible productions adding grandpar-
ent edges — and so on. The complexity of the search space is in its huge branching factor,
not in its depth.
To manage the search space we use a heuristic search algorithm called beam search.
Unlike exhaustive best-first search, this sacrifices the requirement of finding the best parse
in order to reduce the computation required. The search is initialised by putting the root
node at the head of the beam. We then iterate, taking the node at the head of the beam and
generating its children, and adding these into the beam so as to keep it ordered. The general
idea of a beam search is that, just like best-first search, we add all children generated at each
step to a priority queue. But since the priority queue is a fixed size in beam search, some
children fall off the end and are silently discarded. Code for a beam search is given in Figure
4.63.
It is worth mentioning explicitly that beam search implements a sort of cutoff, whereby
any new edge whose probability is further than the width away from the best edge will be
automatically discarded.
3Programmers not used to functional programming are reminded lazy evaluation means the outermost take
will prevent generate children from generating unneeded children
84
beamsearch :: Int [Node] -> [[(Node,Real)]]
beamsearch width nodes =
[ nodes :
beamsearch width
(take width ; only keep some children
(sort snd ; get comparison on second field
(fold ++ [] ; [[1,2],[3],[4]] -> [1,2,3,4]
(map generate_children nodes))))]
Figure 4.6: Code for a beam search
4.6.2 Skiplists for implementing beam search
From an implementation perspective, it would be desirable to use an array to implement
beam search since the beam is a fixed size, insertion into a sorted array involved moving
a lot of data. A linked list is another possibility, although it reduces most operations to
O(n). In the literature, there appears to be very little discussion on how to implement beam
search more efficiently. This is perhaps because most beam searches are implemented with
a beam of perhaps ten or twenty elements, for which the difference between a computation-
ally efficient implementation and a simple implementation is insignificant. However, in a
statistical parser the beam needs to be very large because the heuristics are not especially ac-
curate. Collins notes that his beam is ten thousand nodes long; at this point considerations
of efficiency certainly override considerations of simple coding.
In order to implement beam search efficiently there are two operations that need to be
efficient. Firstly, it is hard to find the appropriate place in the beam to add a child node that
has just been generated. Secondly, we wish to avoid allocating new memory to store the set
of child nodes generated at each iteration before they are placed on the beam. The search
generates a large number of children which are immediately thrown away, and if care is not
taken, this will waste a lot of time in calls to malloc and free .
To address both the slow operations, I implemented the beamdata structure as a skiplist.
Skiplists (Pugh, 1989) are a variant on linked lists in which a number of ‘next’ pointers are
kept on each node instead of just one. The number is a function of the length of the list: for
a list of length n, we keep lg n pointers. These extra pointers allow the algorithm to ‘skip’
along the list. An analogy often used is that a skiplist is like a highway, with the different
‘next’ pointers providing high speed lanes along the list, see Figure 4.7. How much benefit
do we gain from skipping? Following Figure 4.7 we can see the process for finding the next
item is directly isomorphic to binary search, and code to do so is presented in Figure 4.8.
From this we know that access to any item in the list can be achieved in O(lg n), while access
85
11 2 3 11 124 5 6 7 8 9 10 13 14 15 16
... head
Figure 4.7: A simple skiplist showing the first sixteen items
to the best item is O(1). Insertion into a skiplist is similar to insertion into a linked list,
except that the ‘next’ pointers need to be tidied up, as is shown in Figure 4.9. One side-effect
of the pointer management is that insertion is no longer O(1) operations. Each node has
on average lg(lg n) pointers, and these need to be reattached during insertion and removal,
making it an O(lg(lg n)) operation. However, even for huge values of n this is so close to
O(1) as to be indistinguishable. Throughout the rest of this section, O(lg(lg n)) functions
will be referred to as O(1). Insertion into an arbitrary location of the list is O(lg n + lg(lg n))
which can be simplified to O(lg n).
node * skiplist::find(node * n) {
node *cur, *next;
cur = next = head;
for (int step = logn; step >= 0; step--) {
do { // iterate until next < n.
cur = next;
next = cur->next[min(step,cur->depth)];
} while (next->priority > n->priority);
}
return cur;
}
Figure 4.8: Code to find the highest node in a skiplist with priority ≤ n
In most applications, the size of the skiplist is unknown at construction time. This com-
plicates the process for deciding how many ‘next’ pointers to use. To resolve this, Pugh
assigns the number of ‘next’ pointers as a random variable with a probability distribution
designed to give the correct number of pointers on average if the final list size was n. This
clever hack is not needed here since the beam size is known and does not change.
86
void skiplist::insert(node * n) {
node * place = find(n);
int i, curdepth, depth;
// fill up place->next
for (depth = 0; depth < min(place->depth, n->depth); depth++) {
n->next[depth] = place->next[depth];
place->next[depth] = n;
}
// fill up n->next
if (n->depth > place->depth) {
node * cur;
curdepth = place->depth + 1; // we’ve filled this far
cur = next = n->next[0];
do {
if (cur->depth > curdepth) {
for (i = curdepth; i <= cur->depth; i++)
n->next[i] = cur;
curdepth = cur->depth;
}
} while (curdepth < n->depth);
}
}
Figure 4.9: Code for inserting a node into a skiplist
87
Doubly-linked skiplists
As a small but extremely useful extension to Pugh’s idea, I implemented doubly-linked
skiplists (analogous to doubly-linked lists). This gives O(1) access and insertion to both
the start and end of the list rather than just the start. As with conventional skiplists, doubly-
linked skiplists have O(lg n) insert and lookup. Because the first item can be popped in O(1),
the list can be iterated over in O(n). This iteration is the main operation in add singles stops .
Evaluation of doubly-linked skiplists for beam search
Having developed a suitable data structure, we apply it to beam search. The benefit of
having a pointer to the last element in the skiplist is in memory management. The definition
of a beam search means we always need at most n elements in the beam, so the back pointer
gives O(1) access to an unused element. The operation of the beam search is then to pop this
unused element, fill it with the new child, and push it back to the beam. Because most new
children turn out immediately to be below the threshold, no elements in the beam need to be
moved and so constant time performance is usually obtained. By allocating n + 1 nodes for
a beam of length n, we can provide add singles with an empty node in almost constant
time by simply returning the last node in the list. This node is filled with values of the child
being generated, including a priority. It is then reinserted, most commonly at the start or
end of the list which is almost constant time, but even an insertion into another location is
O(lg n).
For a more quantitative analysis, Figure 4.10 shows the time taken by the skiplist to pop
an element off the front of the beam and insert it with a random priority. Normally it would
be appropriate to show a number of different graphs, examining the time if elements are
popped off the back, or inserted with non-random priority. But this graph clearly demon-
strates that the timings are so close to being independent of beam size, and linear to the
number of insertions, that there is no benefit in examining performance further. Similarly,
profiler results are as would be expected: three-quarters of the time is spent inside find ,
with approximately one-eighth of the time spent inside both push and pop . When used in
beam search, these figures are expected to change so that push and pop take a greater pro-
portion of the time. This is because the heuristic evaluations are typically either high or low
rather than uniformly distributed, which will make execution of find significantly faster,
without affecting push or pop .
Another parser-related issue is how to interleave the calls to add stops with the calls to
add singles . One choice is to embed the call to add stop within the call to add singles .
This will mean that children are inserted onto the beam ahead of nodes at the first layer, so
that some probable grandchildren will be generated before all the children are generated.
Another choice is to place all generated children onto a second beam and then process them
88
0
10
20
30
40
50
60
70
0 5000
10000 15000
20000 25000 0
1e+07
2e+07
3e+07
4e+07
5e+07
0
10
20
30
40
50
60
70
Time
Beam size
Iterations
Time
Figure 4.10: Time taken by the skiplist to insert random elements with
different beam sizes
all at once using add stops . I took the latter choice mainly because it makes it easier for
the probability model to cache intermediate values.
Overall, doubly-linked skiplists have proven to be a novel and efficient method of imple-
menting beam search for large n. Contrasting this approach to the standard in the literature,
a heap-based approach should outperform the skiplists for small values of n, but be outper-
formed by the skiplist for large values of n. As with the chart, it is interesting to compare
my approach to to Collins’. It turns out Collins does not actually implement classical beam
search, but instead uses an array of edges being expanded with a threshold — if an edge is
a certain amount worse than the best edge then it is discarded. This is significantly simpler
and somewhat more efficient than my approach. However, calling it beam search is stretch-
ing the definition. In order to more accurately simulate Collins’ results I also implemented
an array approach. From a wall-clock perspective the two approaches perform identically
— that is, inserting n items into the skiplist is sufficiently close to linear that the difference
between it and true linear is not a performance concern.
4.7 Some software engineering lessons learned
This concludes discussion of my reimplementation of Collins’ parsing system. However,
before presenting some results to summarise the performance of my system, it is useful to
summarise some of the software engineering principles which are important in the devel-
opment of a piece of software as complicated as a lexicalised statistical parser. Most of these
principles were learned the hard way, and anyone who wants to implement a similar sys-
tem of their own would do well to take them on board from the outset. This section doesn’t
89
address coding issues, but rather focuses on the software engineering processes used. It
largely concentrates on the mistakes made, and how to avoid them, but occasionally men-
tions something which worked well, especially if Collins did it differently.
There is a famous quote: “Plan to throw one away; you will anyhow.” (Brooks, 1982).
I don’t have time to write the system again, but will put here how I would implement the
system better. This is not how I would improve on the system (see Section 8.2.1 for that) but
how the results of this chapter could have been achieved with less work.
4.7.1 Programming languages for statistical parsing
Initially my parser was implemented in Lisp, with the naive implementation taking only
slightly more lines of code than the pseudocode. It was far too slow; millions of incorrect
interpretations were created, rejected and garbage collected for every correct phrase found.
It would have been possible to preallocate structures to reduce the garbage collection but
this would have defeated the point of using Lisp. Since time preprocessing the corpus is
irrelevant, the Lisp preprocessor from this implementation is still used.
The second incarnation was in the language Clean (Plasmeijer, 1998), a functional lan-
guage similar to Haskell. This language supports lazy evaluation and it was hoped this
could be used to avoid generating unlikely phrases. It appears this is possible and it would
be a very promising project, but while it took virtually no time to hack up prototypes in
Clean and writing efficient Clean is entirely possible, it took me longer to write good code
in Clean than it took to write good code in other languages. My conclusion from this is
that Clean is naturally suited to tasks where you understand what you have to do, and un-
suited to tasks where you have little idea how you are going to approach the problem. One
interesting component completed before this implementation was abandoned used the C in-
terface to Clean to access a SQL database. This approach turned out to be an excellent way
of performing tasks Clean was not well suited to (I/O and memoisation). Were I to reimple-
ment the parser again I would use Clean with the C interface to handle memoisation. This
is because I now know exactly how to design the parser and well designed Clean code is a
joy to read compared to well designed C code.
A number of languages were then considered for a third reimplementation now that
some properties of the problem were known. The language had to be capable of fast execu-
tion, of permitting breaking the rules in the core loops. At the same time the language had
to be relatively readable and modular since my goal after writing the parser was to make
extensions and so I did not want to be locked into Collins’ design. Java was rejected because
it provided no means of avoiding the garbage collection problems that plagued the Lisp
implementation. Specifically, it is easy to write simple Java code and it is possible to write
efficient Java code with my own memory management but it is certainly not easy to write
90
simple and efficient Java code. The same argument would apply to C#, Python, Perl, and
any similar language that abstracts away pointers.
The final program was written in C++. The object oriented approach was chosen over
ANSI C since it was more suited to reimplementing components. Having already written
the program twice I hoped to only have to rewrite components rather than the entire system.
4.7.2 Revision control
Anybody building a nontrivial program will use a source code control system such as CVS
or subversion. We found that simply using version control is insufficient since, for instance,
improvements to the preprocessor would often break the parser since it depended on the
older format for the data files. What became necessary was to branch the code so that
the parser was developed with a stable version of the preprocessor while extensions to the
preprocessor were developed in a separate branch. Later, when the preprocessor and the
parser were relatively stable, the preprocessor could be switched to a new version and all
resulting incompatibilities fixed.
Another related step was the development of a build script. There are a large number of
steps involved in converting the treebank and other data into a format suitable for parsing.
It is relatively easy to perform these steps sequentially. However that means any change to
one of the earlier steps (such as a tweak to the tokeniser) requires every subsequent step to
be repeated. Since there is usually output from the previous version lying around, it was
often the case that output files from different versions of the code would be used at the same
time, leading to subtle errors.
Finally, version control applies to files rather than subroutines, but I often found that I
needed to write almost identical blocks of code but the differences were such that I could not
write a general function, perhaps because the differences could not be expressed as function
arguments in C++. Initially I simply wrote the same code twice but this invariably leads to
bugs being fixed in one version but not in another. My solution to this was to use source
code preprocessing so that our single ‘meta’ version generates multiple functions, each with
slightly different logic. I used the tool funnelweb (Williams, 1992) for this purpose. A nice
advantage of this over general functions is that the resulting code is much easier to read
than a highly generalised function full of if statements for its various options.
4.7.3 Efficiency and debuggability
Premature optimization is the root of all evil
– C. A. R. Hoare
Tony Hoare’s quote is frequently used to discourage optimisation before profiling. Through-
91
out this implementation, I found the opposite to be true. Every time I wrote code for cor-
rectness instead of efficiency I found the parser could not complete a single sentence. Even
when the parser still worked, the decreased efficiency made the parser harder to debug be-
cause it took longer to test other parts of the unoptimised parser than it would have taken
to optimise the current part. To take one random example, the grammar was initially im-
plemented using the STL set class since it saved me having to write my own set classes.
However, grammar lookups are performed in both join two edges and add singles
and so while the STL implementation took virtually no time to verify, it took two hours to
parse a sentence which made it harder to concentrate on other parts of the parser than if
I had written and then debugged my own. Essentially, because I am pushing the limits of
what can be achieved with available technology, it is essential to optimise prematurely, even
though it is still undesirable to do. It is because of this that throughout this chapter I have in-
terleaved implementation and optimisation techniques; perhaps some of the optimisations
are superfluous but delaying optimisation was not possible.
A closely related point is that the most efficient data structure is harder to debug. For
instance, my hash keys in Lisp are arbitrary precision, making it very easy to map keys back
to data values and detect bugs in key generation. However in C, and therefore in Collins’
implementation, we are constrained to thirty two bits which is likely to be more efficient
but cannot be mapped back so easily. Similarly, Collins uses array offsets to refer to edges
in the chart where we use pointers. Depending on compiler optimisations our code may be
slightly faster as a result, but tracking an edge through parsing is much easier in Collins’
system.
‘Magic numbers’ are another area in which bugs can easily creep into the system — for
instance, setting the maximum number of nonterminals to 100 might be correct at first, but
later adding -C complements could easily overflow this and lead to data corruption. I man-
aged to avoid many of the problems here by automatically generating the declarations of
constants from the input files, so any change to the input files will automatically appear in
the source code. Similarly, many functions in the probability model take a dozen or so pa-
rameters and getting these in the wrong order will not cause any typecast errors since they
are all integers, it will just generate invalid output. This problem was avoided by imple-
menting basic datatypes as different classes so that incorrect orders does result in typecast
errors. Curiously, Collins uses magic numbers everywhere.
4.7.4 Debugging methodology and test suites
Debugging the parser turned out to be extremely difficult. It is not so hard to detect the
presence of a bug, but isolating where in the process this bug is introduced could take a
week. In a normal program a bug can be isolated by stepping through its operations on
92
simple input but with a statistical parser there are far too many operations to do this for even
the most trivial input. The best approach I found was to spend a lot of effort detecting bugs
as soon as possible after they are introduced. For instance, if a bug in the tokeniser leads to
a small number of events not being generated then it is critical to detect this problem during
the generation of the event file rather than during the execution of the parser.
In order to facilitate this, after testing every function I wrote an automated test suite that
rechecks functions every time the system is built. For example, the probability model can
be checked by comparing the counts it derives to those produced with grep . If a bug is
later introduced in the input to this function then it will likely cause some test-case to fail.
Similarly, the system is liberally scattered with assert statements that perform everything
from internal bounds checking to checking that the skiplist is in sorted order and still has n
elements. As a last resort, I also made extensive use of the memprotect kernel call to lock
any data that was not currently being edited (such as the hash tables). This allowed me to
catch a number of bugs where I had missed an assertion.
A related technique that proved to be useful was to design with debugging in mind.
For instance, the first implementation of the chart did not include any functions to query
or summarise the chart because such functions were not needed by the parser. However, in
debugging other aspects of the parser I frequently found myself wondering if certain edges
had made it into the chart or if they had been culled before then. By implementing these
extra print features it is easier to debug other parts of the program.
A final comment is that I found high-level debugging to be much less useful than low-
level debugging. For instance, by examining the sentences the parser performs poorly on
it may be possible to infer it has a problem, perhaps one related to coordination. But this
approach turned out to be significantly more time-consuming than simply verifying every
function independently, mainly because the parser was too big to find where the bug was
after the high-level approach found the existence of a bug.
4.7.5 Naming of variables and parameters
Collins’ code is frequently hard to read, because functions are called with parameters whose
meanings are hard to remember. For instance, to access a left dependency event, we call the
function get dep prob with the fifth parameter set to 1, while to access a right depen-
dency event, we set this parameter to 0. The code would be easier to read if constants were
declared for ‘left’ and ‘right’, and used in invoking the function. There are many other ex-
amples in Collins’ code: start nonterminals, subcategorisation frames, nonterminal names
are all accessed using nonintuitive parameter names. Collins also has a mapping between
nonterminals/words and numbers which is inaccessible outside the program — the com-
puter doesn’t care, but the person using the debugger has very little idea which NT is meant
93
by 53. I did my best to store all magic numbers in explicit global constants. Later I put them
into environment variables to make it even easier for different programs to use the same
numbers.
A similar point is the explicit use of types. These aren’t supported particularly well by
current programming languages but the ability to have words stored in a ‘Word’ type that
is explicitly made incompatible with the ‘Tag’ type prevents a lot of errors such as calling the
left production with the arguments (lw,lt,lnt,P,H,w,t,delta,lc) instead of (lw,lt,lnt,w,t,P,H,delta,lc).
This is a very painless way of catching otherwise very hard-to-find bugs.
One area in which Collins got it right and I got it wrong, was in referencing edges. I used
pointers, while Collins used array indices. Again, either is equally simple for the computer
but Collins’ indices make it easy for a person debugging the code to trace where every node
comes from — the operation of finding which operations Collins’ performed in the best
parse is trivial in his parser, but quite a chore in mine.
4.8 Results of the parser
We are finally at a point where we can evaluate the parser. All evaluation will be performed
using the ‘evalb’ tool that Collins produced to evaluate his own parser. We start with a
brief evaluation of Collins’ parser and then examine both the preprocessor and the parsing
algorithm using my systems, evaluating after each step. At this point we have my compete
system and so we analyse the time complexity of the system, and finally we analyse the
errors made by my parser, starting with the differences between my parser and Collins’,
and then examining the errors made by both parsers. For all of the evaluations, we will use
Collins’ convention of ignoring sentences whose length is greater than forty.
4.8.1 A re-evaluation of Collins’ parser: precision and recall
Before we begin to analyse the performance of my system, I will present an analysis of
the performance of Collins’ system in Table 4.3. This table was generated using the parser
Collins released to support (Collins, 1997), parsing Section 23 of the WSJ. These results do
include the effects of some modifications I made in order to ease later modifications. These
modifications included the addition of tracing and debugging code, as well as hooks for
future extensions. The benefits of these extensions will be examined later, but one effect of
them is that the results presented here do not precisely match those published in (Collins,
1997). My goal throughout this chapter has been to produce a parser which can reproduce
this table.
Later Collins improved on his model, largely by correcting the generation of coordina-
tion and punctuation. Results from running the improved version are presented in Table 4.4.
94
Number of sentences = 2245
Number of Error sentences = 2
Number of Skip sentences = 0
Number of Valid sentences = 2243
Bracketing Recall = 85.18
Bracketing Precision = 85.05
Complete match = 24.83
Average crossing = 1.01
No crossing = 65.05
2 or less crossing = 85.38
Tagging accuracy = 96.50
Table 4.3: Results from Collins’ 1997 parser including my code hooks
(Predictably, these are almost identical to the performance given in (Collins, 1999, p. 190)).
Because my parser was already designed before this version was released, it was decided
not to modify my parser to match the newer version. Some initial work has been completed
Number of sentences = 2245
Number of Error sentences = 2
Number of Skip sentences = 0
Number of Valid sentences = 2243
Bracketing Recall = 88.52
Bracketing Precision = 88.68
Complete match = 36.07
Average crossing = 0.92
No crossing = 66.70
2 or less crossing = 87.12
Tagging accuracy = 96.74
Table 4.4: My evaluation of the parser in Collins’ thesis (Collins, 1999)
in copying improvements in the probability model to Collins’ newer version, but for now I
note that the goal was to build a parser which could later be modified and so it was not es-
pecially important that the best version of Collins be used (for example, Model 2 was chosen
rather than Model 3 as Model 2 is simpler to implement).
95
4.8.2 Evaluation of my preprocessor and parser: precision and recall
The preprocessor and the parser are two totally separate pieces of code. Because there were
a lot of decisions in the development of the preprocessor that Collins did not mention (see
Bikel (2004)), it is important to evaluate it independently of the parser. For instance there
is ambiguity in the headword rules and if these are executed incorrectly then this can be
expected to lower the parser’s accuracy but it does not mean there are any errors in the
parser. In order to test this I evaluated my system while using Collins’ event file in Table
4.5. It would be desirable to also evaluate Collins’ parser using my preprocessor but in
doing so I was unable to obtain over 60% precision/recall so there is clearly still a bug in my
preprocessor. It is interesting that even with well over ninety percent of the events generated
correctly, the parser’s performance is abysmal.
Number of sentence = 2245
Number of Error sentence = 4
Number of Skip sentence = 0
Number of Valid sentence = 2241
Bracketing Recall = 84.91
Bracketing Precision = 85.30
Complete match = 24.59
Average crossing = 1.06
No crossing = 64.66
2 or less crossing = 85.10
Tagging accuracy = 96.44
Table 4.5: Results from my parser using Collins’ preprocessor
Since Table 4.5 is extremely similar to Table 4.3, we can conclude that the parser is work-
ing correctly. To verify this, we also evaluated my parser’s results while using Collins’
parser’s results as a gold standard in Table 4.6. Essentially this table shows that my parser
works almost identically to Collins’. Unfortunately, I was unable to precisely reproduce
Collins’ preprocessor output. Since the important part of a statistical parser is the parser, it
was decided to simply continue reusing Collins’ preprocessed events.
There are still a few errors highlighted by this table (a perfect reimplementation would
obtain an 100% exact match). However an analysis of these errors showed they occur almost
exclusively in sentences containing awkward coordination and punctuation. This strongly
implies the discrepancies are caused by conditions that Collins treats as special cases. We
are not interested here in reproducing Collins’ parser to the level of detail of having to reim-
plement all special cases, and so consider this implementation is more than accurate enough
96
Number of sentences = 2245
Number of Error sentences = 6
Number of Skip sentences = 0
Number of Valid sentences = 2239
Bracketing Recall = 94.33
Bracketing Precision = 94.92
Complete match = 76.33
Average crossing = 0.57
No crossing = 84.77
2 or less crossing = 91.69
Tagging accuracy = 98.32
Table 4.6: Results from my parser using Collins’ output as a gold standard
as a basis for extensions.
One curious result that I discovered in the creation of the above tables was that precision
and recall drops as more of Section 23 is parsed. That is, the performance on the first quarter,
half, etc. of Section 23 is invariably higher than the final precision and recall figures. I
do not know if this is coincidental, a side effect of people developing the treebank being
inconsistent, or a side effect of Collins’ probability model being tuned based on results from
earlier parts of Section 23. Regardless, it is an interesting result which has not been reported
elsewhere. It is also a result worth remembering when debugging the parser since the parser
needs to be doing better than expected at first in order to end up at the expected value.
4.8.3 The complexity of Collins’ and my parsers
As well as showing that my parser is able to reproduce Collins’ results, it is important to
show it has the same time complexity. Efficiency is not a major concern in this project,
except inasmuch as is necessary for the parser to produce results fast enough for debugging
and evaluation. However if the parser has different complexity then it implies the system
will not scale.
Figure 4.11 shows a scatter-plot of time taken versus sentence length for my parser. It is
plausible that this figure is the O(n3) that Collins predicts, but it is hard to be sure without
linear regression. Plotting the logarithm of the time against the logarithm for sentence length
is shown in Figure 4.12, produce a graph which is clearly linear and so proving complexity
is polynomial — had the graph been nonlinear then the complexity would be exponential.
Linear regression on the log values gives k = 3.5 with very little error (residual = 0.55, R2
= 0.93), so we can conclude the polynomial is O(n3.5). Performing the same calculation
97
0 10 20 30 40 50 60
050
010
0015
0020
00
data$length
data
$tim
e
Figure 4.11: Scatter-plot of time taken by my parser to parse sentences of
different lengths
98
0 1 2 3 4
−4
−2
02
46
8
ll
lt
Figure 4.12: Scatter-plot of log(time) versus log(sentence length) — the
gradient is the parser’s complexity
99
on Collins’ parser gives a similar but slightly better result, k = 3.05 at a similar confidence
level. It is likely that Collins’ slightly better complexity is due to my reimplementation being
much more cautious in its dynamic programming, keeping edges which do not take part in
the final result in order to make it easier to debug any errors.
4.8.4 Evaluation of my parser with my new POS tagger
In the evaluation of the POS tagger, I noted that its true benefit was not in its accuracy,
but in how it reduced the number of edges in the chart. Any errors in accuracy can be
trivially resolved by selecting the second-best tag, while making the chart smaller should
make virtually every loop in the parser shorter and so speed up parsing. Since Charniak has
published results showing the inclusion of multiple tags did not significantly increase his
parser’s accuracy (Charniak et al., 1996), I did not expect to see any improvement outside of
efficiency.
This hypothesis proved to be correct, with the number of edges in the chart dropping by
around fifteen percent which leads to a corresponding drop averaging thirty-five percent in
parsing time. The primary reason my parser is significantly slower than Collins’ is that it
keeps many more candidates and so this relatively large increase in efficiency would be ex-
pected to be much lower if the same technique was applied to a parser with less conservative
dynamic programming.
Parsing precision and recall were 84.98 and 84.91, a tiny and almost certainly coinciden-
tal increase over the parser without the tagger. Precision and recall can also be measured
against the old output of the parser without the tagger, which gives 98.48 and 98.39. This
result tells us that the inclusion of the tagger really does make very little difference, rather
than causing the parser to make a similar number of different errors.
4.8.5 An analysis of the errors in Collins’ parser
Collins’ parser may be the current state-of-the-art, but it still makes many errors; the results
just presented show that the parser makes some error in three out of every four sentences.
So, what are those errors, and what can we do to eliminate them?
Intuition would say that longer sentences lead to more errors. If this is so, then we should
concentrate on extending the parser’s beam and otherwise keeping alternatives available
for longer. However, as Figure 4.13 shows, there is very little correlation between sentence
length and parsing accuracy. More formally, we can state that the correlation coefficient
between sentence length and accuracy is somewhere between -0.22 and -0.14, so there is a
slight correlation, but not enough to justify further work (approximately three percent of the
variance in accuracy is caused by sentence length).
100
60
65
70
75
80
85
90
95
100
0 10 20 30 40 50 60 70
Pre
cisi
on
Sentence Length
"sentencelength-vs-accuracy"
Figure 4.13: Parsing accuracy versus sentence length.
Instead we have to look for other sources of errors. Eyeballing the output implies that
sentences containing rare words are frequently parsed incorrectly. See for instance Figure
4.14 which shows the sentence He’s a NOUN parsed correctly and incorrectly respectively.
TOP/’s
S/’s
NP-A/’s
PRP/He
VP/’s
VBZ/’s NP-A/fool
DT/a NN/fool
TOP/bore
S/bore
NP-A/’s
PRP/He VBZ/’s DT/a
VP/bore
VBD/bore
Figure 4.14: Two parse trees showing that changing ‘bore’ to ‘fool’ corrects
the parse.
The principal difference between these two sentences is that in the correct parse the parser
had available accurate statistics for all the words used, while in the second parse it had to
resort to using the POS tag for the key word bore. Further anecdotal evidence for this hypoth-
esis is provided in Tables 4.7 and 4.8 which show a random sample of correctly parsed and
poorly parsed sentences respectively. In these tables, id is the reference number for the sen-
tence, Rank is a simple metric for measuring the frequency of the words in the sentence (the
rank of a word is defined as the relative frequency of the word, so the most frequent word
101
has a rank of one). Finally, (in Table 4.7) cause is my interpretation of the most fundamental
error in the parse. It is clear from these tables that rare words are a significant problem. The
id Sentence Rank
1838 When selling is so frenzied, prices fall steeply and fast. 2657
704 Revenue gained 6% to $2.55 billion from $2.4 billion. 12770
105 The $409 million bid is estimated by Mr. Simpson as representing 75% of
the value of all Hooker real-estate holdings in the U.S.
4455
391 Although Mr. Pierce expects that line of business to strengthen in the next
year, he said Elcotel will also benefit from moving into other areas.
21
135 The Boston firm said stock-fund redemptions were running at less than
one-third the level two years ago.
76
837 In the previous quarter, the company earned $4.5 million, or 37 cents a
share, on sales of $47.2 million.
12770
1354 But when something is inevitable, you learn to live with it,” he said. 21455
Table 4.7: A selection of correctly parsed sentences
advantage of this metric over frequency stems from Zipf’s law, extremely frequent words
would tend to swamp all comparisons and the difference between a word occurring once
and ten times will be invisible. Note that this metric causes the exponentially decreasing
frequency of words to produce a linear sequence which is the same effect as computing the
logarithm of the word’s frequency. For sentences, the most natural approach would be to
define the sentence in terms of its highest ranked word but this turns out to be a poor choice:
non-headwords are not used in unary productions and are only used once in dependency
productions. So instead we look for the highest ranked headword.
In order to test this hypothesis more formally, we need to measure how the rank corre-
sponds to parse accuracy over the whole corpus instead of just a few sentences. A graph of
this is presented in Figure 4.15. It turns out that graphing this to highlight the trend is quite
difficult; there are more sentences with low-ranked headwords and so it is natural to see a
greater variation in accuracy. However, the peak on the left side grows disproportionately
faster than would be expected. This growth means that sentences with high-ranked words
are parsed correctly more often.
Since it is hard to see the trend, it is useful to analyse the data more formally by per-
forming linear regression on the data points. It may seem counter-intuitive to use linear
regression on data that is clearly nonlinear, but linear regression is still useful on nonlinear
data because it tells us how the mass of the data changes as we increase one variable, which
gives us the global trend we are looking for. Specifically the gradient is −120 (with an error
of up to 40) which proves that less frequent words lead to a worse parse. If the least frequent
102
id Sentence Prec. Rank Cause
94 Earlier the company announced it
would sell its aging fleet of Boeing
Co. 747s because of increasing main-
tenance costs.
26% 9044 because PP very poorly
attached
1758 My colleagues and I fully realize we
are not a court . . . etc.”
31%. 13742 adverbial completely
misinterpreted
1111 Call it the “we’re too broke to fight”
defense.
42% 12365 to fight not attached in-
side the quote!
1767 Of course, Mr. Lantos doth protest that
his subcommittee simply seek infor-
mation for legislative change.
43% 46203 protest... interpreted as
a NP
1550 Here’s what Ronald Reagan said after
the 1987 crash: “The underlying econ-
omy remains sound.
47% 32234 Not interpreted as two
phrases around the
colon
1370 But that was all of three months ago. 50% 20701 Many errors, including
adverbial
Table 4.8: A selection of poorly parsed sentences
headword in the sentence is one hundred words less frequent, we would expect an average
for the final precision/recall to be one percent worse. From this, we can finally conclude
that sentences containing rare words are parsed less accurately. Note that this result does
not contradict Klein and Manning’s results; for instance one hypothesis that supports both
results is that tags are usually sufficient but when they are not, words are necessary.
4.9 Summary
At this point it is reasonable to ask what has been achieved. In brief summary: this chapter
described a reimplementation of Collins’ parser. The reimplementation takes twice as much
code, obtains very slightly lower performance and is significantly slower. On the other hand,
the modular way in which the parser has been written makes it easier to change how the
system works.
The remainder of this thesis will describe one such change, which addresses the difficul-
ties with rare words just outlined in Section 4.8.5.
104
Chapter 5
Thesaurus-based word representation
The results of the previous chapter show Collins’ algorithm performs sub-optimally with
uncommon words. The reason for this can easily be seen by examining the backoff rule in
Equation 3.4. Put simply, if a word is uncommon it is discarded and the parser essentially
reverts to using a probabilistic context-free grammar. The significance of this in practice
is unknown, but since Zipf’s law says most words are uncommon and to date the parser
has only been tested in the training domain it has the potential to be very significant. I have
therefore decided to concentrate the second phase of my PhD on resolving this problem. My
hypothesis is that poor backoff is the largest remaining problem in statistical parser accuracy
on sentences in the same genre as the training corpus, and a significant reason why statistical
parsers do not perform well with unedited text or outside the domain of their training data.
To solve the sparse data problem, it is not possible simply ‘to build a bigger training cor-
pus’. Firstly, this would be extremely expensive; WSJ-style corpora need to be constructed
by human annotators. Secondly, Zipf’s law means even a slight improvement would require
a much larger corpus. Thirdly, it does not help with shifting domain. And finally it has little
academic interest. The main suggestion I want to pursue is that we should back off rare
words by grouping words into categories of semantically related words — in other words, by
adopting a level of backoff in between single words and parts-of-speech.
In this chapter and the following one, I consider the question of how to generate rep-
resentations of words which allow the semantic similarities between words to be made ex-
plicit, so that they can be grouped for the purposes of backoff. (The process of using these
word representations in a backoff scheme for the parser will be dealt with in Chapter 7.) The
present chapter motivates the ideas of thesaurus-based backoff, and gives a review of the
literature on automatic generation of word representations. In Section 5.1, I provide support
for the general idea of backing off by grouping semantically related words by giving some
concrete examples. In Section 5.2, I provide some criteria for kinds of measures of semantic
relatedness which will be useful for our purposes. In Section 5.3, I will survey the literature
105
about how to produce measures of semantic relatedness between words, and evaluate sev-
eral different proposals according to these criteria. I will argue that the best measure for our
purposes is one devised by Hinrich Schutze (1993).
5.1 An example of the benefits of grouping similar words
The normal method for representing words in a statistical parser is as a simple enumeration.
This might be alphabetical or in the order the words happened to occur in the training cor-
pus, so that for instance cat might be encoded as 4424 , while cataclysms is encoded as 4425 .
Using this method there is clearly no semantic correlation between words with a similar en-
coding and so the parser does not use the word encoding as anything more than part of the
hash-table key.
Since many words do not occur frequently in the training corpus, we do not have useful
counts for them. For instance, cataclysms only occurs once in the WSJ: ... such short-term cata-
clysms are survivable ... and yet forcing every use of cataclysms to this grammatical structure,
or even strongly favouring it, would be completely incorrect. Collins’ probability model,
and that of every other statistical parser, would replace cataclysms with its POS tag NNS, (or
to be more technically correct, generate a new hash-table key which does not include the
word, but still includes the encoded POS tag). However, discarding the word discards a
significant amount of information useful to parsing. For instance, cat has the same POS tag
but being both concrete1 and animate has significantly different usage.
An alternative encoding might still give cataclysm the value 4425 but give 4426 to
calamity , 4427 to convulsion , and so on. A more general word, such as disaster may
get an encoding of 442 . Such an encoding would mean that if the counts for cataclysms
were insufficient, it could be replaced by or grouped with those for disasters or calamity ,
etc. depending on the context. This new encoding provides more information then discard-
ing the word entirely while significantly increasing counts.
It may seem that this example is somewhat contrived; after all, how much does mis-
understanding cataclysm affect real-world performance? However this view ignores many
things. Firstly, Zipf shows these rare words make up the bulk of the language, and secondly,
the WSJ is a terrible sample of real-world language usage – For instance, it only has seven
uses of the word cat, and kiss occurs exactly once: Unamused, residents burned Rand McNally
books and wore t-shirts that said: “Kiss my Atlas.” It would be hard to infer useful generalisa-
tions about the usage of kiss based on this example, but replacing it with VB wastes a lot of
usage information compared to replacing it with a simple category word.
1While the dictionary has the first definition of cataclysm as concrete, it appears to be used in the abstract
more often.
106
5.2 Criteria for semantic relatedness measures
In this section, I consider what kinds of semantic representation of words we want to pro-
duce. Clearly, we want a representation from which we can read off the degree of semantic
relatedness of any two words in the system’s lexicon. But there are several different schemes
which would allow this, and not all of them are equally relevant for the purposes we have
in mind.
5.2.1 Attention to infrequently occurring words
For the most common words we already have excellent usage statistics, and so there is no
need to back off. Noting this, the area that better word representations can improve is with
less common words. That is, we are interested in good representations for the less-used
portions of the lexicon. However, to say we are trying to improve the representation of rare
words is simplistic, we do not have sufficient counts for accurate statistics on perhaps ninety
percent of the lexicon. Unfortunately, as will be seen throughout this chapter, concentrating
on less frequent words is directly contrary to almost all popular word representations, which
concentrate on producing extremely good representations for common words and ignore
rare words entirely. Since the use of the representation here is to deal with low word counts,
that makes most representations useless. Therefore our main criterion in deciding which
representation to use, will be determining how well it represents rare words.
5.2.2 Multidimensional representations of word semantics
One problem with the simple example presented was that replacing a word by its category
only works for words with one clear category. Cataclysm , and many others, have at least
two categories. The approach also assumes that there is only one dimension of backoff
when for instance it would very likely be useful to back off between singular and plural
independently from semantic type. Another useful dimension, though not as important in
English, would be formality. Because of this, a vector-based (or n-ary) representation seems
potentially much more useful than a simple mapping between words and categories. This is
not a strict requirement, but given the choice of a vector-based representation and a simple
hierarchical representation, we would prefer the vector.
Another consideration is choosing a word representation which is appropriate for the
particular backoff technique we choose to implement in the parser. At the time when the
word representations were being developed, there were a few alternative techniques being
considered: keeping everything simple by discarding information in a fixed order; adding
a new level of backoff between words and tags much like the categories already discussed;
and replacing genprob with a neural network using the vector as input and the existing gen-
107
prob function to generate training data. I wanted to be able to compare different techniques,
which means the word representation should ideally support all three approaches.
5.3 A survey of approaches for computing semantic similarity be-
tween words
5.3.1 Hand-generated thesauri: WordNet and Roget
One solution is simply to use a manually compiled thesaurus like Roget (Chapman, 1992).
Consider the word formation. If the counts for formation are insufficient then instead of sim-
plifying formation to noun and including many unrelated words, we can simplify it to only
include members of its superordinate category (constitution, setup, build-up, etc.), which are
likely to indicate the correct usage of formation much more accurately
However, Roget is not especially careful at ensuring all members of a category are inter-
changeable. This is because the thesaurus was written for humans, and so it is reasonable to
assume that readers are not going to make poor substitutions. WordNet (Miller, 1995) pro-
vides a similar approach to Roget in that it is a hand-written thesaurus, but it was designed
for processing on a computer and so is more careful to ensure all members of a category are
interchangeable.
For our purposes, both Roget and WordNet can be immediately rejected because they
have no entries for rare words which are what the system was intended to solve (or, cu-
riously, for extremely common words like he). What this means is that we have to move
towards methods for learning word clusterings.
5.3.2 Unsupervised methods for thesaurus generation
Since there are no suitable thesauri available, it is necessary to generate one automatically.
The field of unsupervised thesaurus generation is currently a big topic in NLP, probably
larger than statistical parsing, as will be demonstrated by the large number of approaches
examined below. The review in this chapter is not exhaustive, but covers the main alterna-
tive approaches, with a focus on those potentially suitable for our backoff application.
The basic method behind all thesaurus generation techniques is to use statistics about the
words which occur in a window around a given word to derive information about which
words are similar to one another. This is very similar to how the previous chapter used
counts for determining how likely an event is. We are using bigram statistics (counting how
often two words occur together) to estimate the mutual information between two words,
and looking for words with high mutual information. For example the words cat and dog
should be similar because you expect them both to occur close to words like collar, food, pet,
108
vet, and so on. The techniques in this chapter could be described as representing words
by separating out their mutual information from their dissimilar information. For example
we would first encode the the common information between dog and cat, and then encode
whatever is peculiar to cat. While the techniques being discussed do not use mutual infor-
mation directly, it frequently underlies the technique and therefore it is useful to explain the
technique now.
Information Theory is a branch of statistics which studies the amount of useful informa-
tion contained within an event. Its relevance here is that a word can be viewed as an event
and so the tools from information theory can be applied to help decide on the best format.
In particular, Information theory includes the concept of a bit of information, where each
bit can be either true or false. All information can be represented by a number of bits, and
so every word can be represented as a string of bits. Intuitively, this is obvious since we are
already representing words as either a sequence of letters or a number, both of which are
just a string of bits.
Imagine for instance a HMM trained on the WSJ sentences. Running with no input, the
model would generate a random sequence of words that almost appear to be from the WSJ.
But by providing just a small amount of information at each decision point we can gently
push the model into generating any sentence from the WSJ. This extra information provided
could be considered to be an extremely efficient encoding of the sentence, and the more the
sentence we want to generate differs from the WSJ, the more information we will need to
provide to the model. Viewed this way, it is easy to see that if the model was most likely to
produce cat next, it should require very little extra information to produce dog instead, but
a significant amount to produce essence. The relevance of this is that many of the techniques
use an extremely similar approach, measuring the amount of surprise at seeing the next
word, and the amount of surprise is exactly the same as the minimum number of bits to
encode the event.
5.3.3 Finch
Finch (1993) implemented the first successful approach to thesaurus generation. He pre-
sented an algorithm for automatically building dendrograms (i.e. hierarchical cluster dia-
grams) based on the similarity of usage in a corpus. Unlike the hand–generated approach,
Finch’s method derives a representation for every word in the input lexicon. The general al-
gorithm is to start a two-dimensional array of bigrams, such as in Table 5.1. Then each row
in the array is considered for similarity and all rows within a certain (hamming) distance are
combined. Since each row corresponds to a word, combining rows is equivalent to combin-
ing words to form a cluster. This method leads fairly easily and naturally to a dendrogram
(tree) representation, several subtrees of which are given in Figure 5.1. The algorithm is very
109
simple and is presented in Figure 5.2.
bought company large yesterday
computer 313 7825 1386 388
new 1174 19430 3386 7929
traded 63 849 66 500
at 1905 28881 10401 6508
Table 5.1: An example of bigram counts
An interesting property also present in Table 5.1 is that some words have high bigram
counts but low mutual information, because both of the words are very common. Take
for example at and company, one of the highest bigram counts in the table. If this value
was missing from the table and we were to infer it from the other values, we would note
that fifty-five percent of bigrams are in the company column and sixty-five percent in the at
row. Based on this we would estimate the ‘missing’ value at 34,700 ±190. The actual value
of twenty-nine thousand is significantly less than this, showing that seeing at reduces the
chance of seeing company. This shows the importance of scaling values, and a very similar
argument can be used to show the importance of using the hamming distance rather than
absolute distance.
Unfortunately, Finch’s algorithm was not able to be used to solve our backoff task be-
cause it could not scale sufficiently. Ignoring time complexity, Finch’s algorithm uses a
thousand-by-thousand matrix for a thousand word lexicon. Since counts in the cells have to
be added, four bytes per cell is an absolute minimum, leading to a four megabyte matrix in
Finch’s original paper. However the WSJ has a lexicon of fifty thousand words, leading to
nine gigabytes of required RAM to store co-occurrence statistics. At the time this approach
was investigated this was an impossibly large amount of memory, and no way of perform-
ing the task in two passes could be conceived. What is more, a dendrogram is not exactly
the form of word representation we want; as already discussed, we would prefer vectors
supporting multiple independent ways of clustering words. So Finch’s approach was not
pursued.
5.3.4 Brown et al.
Another early approach which showed considerable promise was that of Brown et al. (1992).
Their approach was to attempt to predict the next word based on history in exactly the
same way as a Markovian POS tagger works (such as the one described in Section 4.4).
Very briefly, we are predicting the current word wk given all previous words wk−11 , and we
are making the standard Markov assumption that this can be approximated by an n-gram.
110
itheywehesheyoui'veyou'vewe'vethey'vei'di'llyou'llyou'dwe'llit'si'myou'rethey'rewe'rehe'sshe'sthat'sthere'swhat'swhosewho'sthismethemhimherusmyselfyourselfthemselvesitselfhimselfmineanyoneanybodysomeonesomebodyeveryoneeverybodynobodysomethinganythingnothingeverythingit
toofinonatforwithfrombyintothroughagainstaboutbetweenwithoutunderwithinduringviaupontowardstowardacrossamongbeyondregardingnearoutsidebehindinsideoutupdowno�awaybackoveraroundalongtogetheraheadagaintwiceforwardexceptabovefollowingbelowpastFigure 5.1: A figure from Finch’s thesis showing the internal structure
from several parts of the dendrogram
111
For every word in the corpus
For every word near this word
increment the counts of these words co-occurring
Endfor
Endfor
For every row (=word) in the table
For every row x in the table
For every row y in the table
If the hamming distance between x and y is small
Add y to x’s combine list
Endif
Endfor
Create a new row z
For every row in x’s combine list
Add its columns of y to z
Delete its rows/columns from the table
Endfor
Add z as a new row/column to the table
Also save to the dendrogram that z is
the parent of x and its combine list
Endfor
Endfor
Figure 5.2: Finch’s dendrogram generation algorithm
112
Brown et al. used a deleted-interpolation trigram model and a corpus of a third of a billion
words.
This model is only directly useful in generating nonsense text which appears superfi-
cially like English. However, Brown et al. note that similar words will have a similar prob-
ability model, and that if we create some classes then we can assign words to the classes so
that words with a similar probability model get placed in the same class. He starts with one
class per word, and then merges classes based on their similarity until only one class is left.
Results were good, as shown in Figure 5.3.
question
charge
statementdraft
casememo
request
letter
plan
Figure 5.3: Sample clusters from Brown et al.’s algorithm
Expanding an approach to handle a large vocabulary is always problematic. Figure 5.3
was based on a vocabulary of just one thousand words. To handle larger vocabularies,
Brown et al. first clusters the one thousand most common words, and then assigns every
other word to the category where it best fits. Again, results were good but this approach
was not used here as it does not provide any measure of difference between words, and the
hierarchical classification only applies to the most frequent words. Also, we would prefer to
derive a vector-based word representation rather than trees, as mentioned in Section 5.2.2.
5.3.5 Smrz and Rychly
Smrz and Rychly produced an approach similar to Finch in that it forms dendrograms based
on hierarchical clustering. The key difference is that they demonstrated their approach
working on a lexicon of forty thousand words (Smrz and Rychly, 2002). From the perspec-
tive of this thesis, this is an extremely useful improvement – there are thirty-two thousand
lexical entries in the WSJ so there is a strong likelihood that this approach can be used di-
rectly.
From a technical perspective, the algorithm used is presented in Figure 5.4 (copied from
Smrz and Rychly (2002)). Contrasting this to Finch there are only minor differences. The
reason that Smrz and Rychly were able to process so much more data is a more careful ap-
113
proach to the data representation. Rather than simply implement a two-dimensional array
as Finch did, they used a full corpus processing tool supporting sparse arrays called CQP
(Christ, 1994).
function locateclust (id):
Path⇐ ∅while clusters[id] not closed:
Path← Path∪ {id}id← clusters[id]
foreach i ∈ Path:
clusters[i]← id
return id
function hierarchy():
foreach 〈rank, id1, id2〉 ∈ sortbgr:
c1← locateclust(id1)
c2← locateclust(id2)
ifc1 6= c2:
clusters[c2]← c1
hierarchy[c1]← [c1]∪ {< c2, rank >}hierarchy[c2]← [c2]∪ {< c1,0 >}
return hierarchy
Figure 5.4: Pseudocode of Smrz and Rychly’s clustering algorithm
Since Smrz and Rychly’s results were representations of Czech words, they could not
be used directly for backing off the WSJ. Instead the algorithm had to be run again on an
English corpus. Rychly sent me their code, which was a python wrapper around the corpus
toolkit CQP. CQP is a large toolkit that is useful for a number of areas in language analysis.
I eventually managed to get CQP working but was unable to get Smrz and Rychly’s code to
interface with it. It seems CQP had changed too much since the code was written. Given that
Smrz and Rychly’s method also only produces a dendrogram, rather than the vector-based
representations we are seeking, I eventually abandoned this method.
5.3.6 Lin
Lin has developed an automatic word clustering algorithm with some very impressive re-
sults (Lin, 1997). The distinguishing feature of Lin’s approach is that syntactic dependency
is used to resolve lexical ambiguity. For example, fence (sword fighting) and fence (selling
114
stolen goods) are different words. Different words with the same spelling are referred to as
either homographs (unrelated meaning) or polysemous (related meaning). Lin’s hypothesis
is that an automatically constructed thesaurus should have different entries for each sense,
much like manually written thesauri do.
Studies of human language use have shown that people can distinguish between word
senses using very little context (Choueka and Luisgnan, 1985). The local context used by
Lin is defined in terms of the syntactic dependencies between the word and other words in
the same sentence. Lin uses an HPSG-inspired approach similar to Collins’ which looks at
the word’s subject, adjunct(s) and complement(s). For all of these he stores the result as a
triple containing the word, the relationship (sbj, adj, cmp), and the word it has this relation-
ship with. For example, in the sentence The boy chased a brown dog, the context stored about
boy is [boy, sbj, chase] , and for dog [dog, adj, brown], [dog, cmp, chase] .
As with needing to select the correct sense given just the spelling, this information is not
present in the training corpus and must be automatically derived. Lin implemented a
broad–coverage parser for this purpose.2
With the output from the parser, Lin is finally ready to derive the thesaurus. The ap-
proach is to look at the raw triple list produced by the parser. Remembering that the result-
ing triples are of the form [word, relationship, object] = count , the similarity
between two words is defined as the number of identical triples, divided by the number of
triples that are not present for the other word. This gives the proportion of the time that the
two words are used in the same way. Below is the output from running the algorithm on the
word brief (Taken from Lin (1998)):
brief(noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, de-
position 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04, . . .
brief(verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name 0.05, em-
power 0.05, summon 0.05, overrule 0.04, . . .
brief(adjective): lengthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.09, extended
0.09, daylong 0.08, scheduled 0.08, stormy 0.07, planned 0.06, . . . .
In a ‘future–work’ section, Lin describes how his algorithm can be used to form a kind
of lexicalised dendrogram in which every word has its own dendrogram, of which it is the
head. However Lin’s approach is still fundamentally for finding a word’s nearest neigh-
bours and it is not obvious how it use it for backoff. If the set of words above (affidavit,
petition, memorandum, motion, . . . ) occurred frequently in the results then a token could be
2Note something interesting: we are using a parser to get better word representations so as to get a better
parser. There is no paradox here; Lin’s parser only has to do a rough job of identifying word senses, rather than
delivering high precision and recall in every respect.
115
used to represent and count it. However affidavit will have a similar but subtly different
representation, which makes counting concepts impossible. Lin’s word similarity metric is
impressive, the resolution of homographs and polysemy means it avoids many of the mis-
takes made by other approaches. Ultimately, the need to redesign genprob from scratch just
to use this information, resulted in this approach being abandoned.
5.3.7 Elman/Miikkulainen/Liddle
Elman
The first application of neural networks to thesaurus generation was Elman networks, a
modification of back-propagation to use a context layer (Elman, 1990). A neural-network is
an algorithm for learning arbitary function mappings, such as between a sequence of words
and the next word in the sequence. Elman trained one of these networks on the task of
predicting the next input word so that each letter in the sentence is presented to the network
sequentially (letters have random representations). After training he was able to show that
the network’s internal representation (its hidden layer) contained semantic information, as is
shown in Figure 5.5. Elman produced this figure by providing different words on the input
VerbsNouns
AnimatesInanimates
Breakables
food
AnimalsHumans
sandwichcookie
bread
plateglass
carbook
rock
boy
man
girlwoman
dragon
lionmonster
cat
transative (always)
likechase
smash
break
sleep
intransitive
thinkexist
dog
catmouse
see
smellmove
Figure 5.5: Analysis of the weights in Elman’s network, showing the lin-
guistic knowledge which had been learned
and measuring the hamming distance between the different activations produced. These
distances were then clustered. The demonstration was a very significant result because it
shows that a supervised training algorithm is able to learn the unsupervised task of word
representation. One obvious problem with this approach is that either every new lexical item
requires an extra output node, or an impossibly large hidden layer is needed. Either way,
116
it limits the maximum lexical size to the maximum number of learnable outputs (perhaps a
hundred).
While Elman’s network has several good properties, it is not appropriate for use here as
a basis of my word representations. There are two main reasons for this: the knowledge is
not encoded in the representation, and the representation requires a node per lexeme and so
cannot scale.
Miikkulainen
Miikkulainen solved a very similar task to Elman, using a neural network to extract linguis-
tic information by presenting a sequence of words. The difference is that instead of analysing
the hidden layer, Miikkulainen had an extra input layer (the word representation). The idea
is that if the network can learn a better representation, then it can use this to learn an even
better one (Miikkulainen, 1993). Initially the system has no idea what the optimal represen-
tation is, so it gives every word a random representation. Next it is trained in the same way
as any other feed-forward network, feeding forward activation, comparing the output to
(its current representation of) the target, and backpropogating errors. Since one of the lay-
ers is the word mapping, we have implicitly updated our representation. Miikkulainen has
found an extremely elegant way of learning the representation using the same mechanism
as is usually used to predict the output. Miikkulainen was able to get excellent results using
his method, but closer investigation showed his approach had some serious problems. The
largest of these is that Miikkulainen’s approach does not use any form of recurrent network.
This means the system is only able to process sentences using a set of rigid template struc-
tures, such as Det NN VT Det NN (which would fit the sentence the dog ate a steak). This is
a regression from previous approaches such as Elman’s, which predict the next word in the
sentence and so can cope with any sentence structure.
This makes Miikkulainen’s network unusable in this project because we have hundreds
of different sentence templates. It should be noted that Miikkulainen has moved on since
then and done some very interesting work in using a neural-network for parsing rather
than sentence representation. This will be discussed in the future work section of this thesis
(Section 8.2.5).
Liddle
A solution was presented by Liddle who combined Miikkulainen’s extra input layer with
Elman’s network architecture (Liddle, 2002). The combined architecture is given in Figure
5.6.
Liddle’s experiment was moderately successful, some output from his program has been
clustered as a dendrogram and is shown in Figure 5.7. He was able to expand on both Mi-
117
Modified Input Hidden Output
Cat
Sat
On
Context
Cat
Input
Figure 5.6: Liddle’s network architecture
118
ikkulainen’s and Elman’s lexicon. Furthermore, he was able to show that the basic prop-
erties like the noun–verb distinction are quickly learned, but given sufficient time his algo-
rithm can learn quite subtle distinctions, such as the difference between meat and sandwich,
and the animate – inanimate ambiguity of chicken. While Liddle’s approach produces excel-
lent results and scales better than Miikkulainen, I was unable to make it scale well enough to
be used with the WSJ. A network with only ten nodes cannot hope to represent every word
in the WSJ (since 210 is much less than 50,000), but networks with more than ten nodes
showed no signs of even beginning to train. I made investigations into seeding Liddle’s
network with the representation I produced, but these have not been successful to date.
5.3.8 Bengio
While Miikkulainen’s approach is a simple proof of concept, Bengio developed a neural
model that was intended to scale (Bengio, Ducharme, Vincent, and Jauvin, 2003) (Bengio and
Bengio, 2000). Rather than treating Bengio’s approach as a thesaurus generation technique,
it is easier to treat it as a part-of-speech tagger in which a thesaurus is generated as a side
effect. With this mindset, consider a classic HMM-based tagger, perhaps using trigrams,
such as the one implemented in Section 4.4 of the previous chapter. When training this
tagger on the sentence the cat is walking in the bedroom we would store triples like:[[the,cat,is],
[cat,is,walking], [is,walking,in], [waking,in,the], [in,the,bedroom]].
As discussed in the previous chapter, even with a hundred million word corpus we
would not have enough events for an accurate probability model — it is a basic corollary
of Zipf’s law. Wouldn’t it be wonderful if instead of just these training examples we store
{the,a} {cat,dog} {is,was} {walking,running} in {the,a} {bedroom,room,bathroom}? A sin-
gle training sentence has suddenly become a hundred. Bengio refers to this as turning the
curse of dimensionality against itself.
So how do we achieve this? Bengio believes the best approach is to replace the Markov
model with a neural-network. This allows the word representation to be distributed and
so allows learning of similar events to occur automatically through their similar distributed
representation. The approach can be summarised as:
1. Map each word into a feature vector.
2. Express the joint probability function (that a HMM simulates) in terms of these feature
vectors.
3. Learn simultaneously the feature vectors and the probability function.
The key difference between this work and that of Elman, Miikkulainen, or Liddle is that
Bengio concentrates on an approach that scales, and instead of predicting the next word he
119
man
woman
hammer
doll
vase
ball
hatchet
bat
rock
paperwt
the
hit
with
carrot
ate
START
window
curtain
spoon
fork
broke
moved
STOP
boy
_
girl
dog
sheep
chicken
pasta
wolf
lion
plate
cheese
desk
0.0 0.5 1.0 1.5 2.0
Word
Distance
Figure 5.7: Clusters of Liddle’s output
120
learns a statistical model. Unfortunately, despite developing an approach that is intended
to scale and using a large computer cluster, Bengio’s approach still does not quite scale high
enough to be useful here. Specifically, the largest corpus that has been successfully used is
around a million words (about the size of the WSJ). This is fine for deriving representations
for the common words but not useful for deriving the representation for rare words (which
is our concern here). Bengio mentions that they are attempting to expand to fifteen million
words but they have not yet managed to scale the algorithm sufficiently to achieve this
(Bengio, 2003). Even if they are to achieve this, my experiments implied that a corpus of
five hundred million words was not adequate, and one and a half billion words is about the
minimum. So in a few years I expect Bengio’s approach to outperform the one presented
here, but in the meantime we will need to look for another solution.
5.3.9 Honkela (Self Organising Maps)
Self organising maps (SOMs) seem like the obvious solution to the problem since they use
unsupervised learning. This is the logical approach because there is no obvious training
data and fixed requirements on the output format. Additionally their output is in vector
format, making them ideal for all the backoff methods being considered. However, the
correct design of the SOM is not obvious. The intended output is a vector which means only
one word is represented but all of the training methods are based on showing two words
co-occurring. To explain, a simple method of training SOM would be that when two words
co-occur, they should be presented simultaneously to the SOM. The algorithm should then
generalise between the bigrams it is trained on, to produce an internal representation of
bigrams that can predict the probability of any two words co-occurring. However, we want
an encoding for words, not for bigrams. Another problem with using a SOM is that the
number of necessary hidden nodes is unknown. Since I couldn’t work out the representation
or how to train the network, this approach was also abandoned.
Of course, I am not the only person to investigate the use of a SOM in language pro-
cessing. For instance Finch experimented with a SOM in Chapter 8 of his thesis (1993),
using K-means to build the categories. Mayberry and Miikkulainen have also done some
work with respect to dependencies (Mayberry III and Miikkulainen, 1999). Probably the
most complete work in this area is by Honkela, whose thesis contains a number of different
applications for SOMs, including word representation (Honkela, 1997b).
The approach taken by Honkela is to provide the SOM with a wide window, rather than
simply the next word, although his earlier work used a much smaller window (Honkela,
Pulkki, and Kohonen, 1995). The wide window approach is remarkably similar to how
word bigrams are computed. Some of the parameters used are discussed in Honkela (1997a).
Results from this approach look very promising; a word map generated using this technique
121
has been copied from Honkela et al. (1995) and is presented in Figure 5.8 although it is worth
noting that this figure only shows the most common words and it is unknown how well the
algorithm performs on the less common words.
Overall, I decided not to pursue this approach simply because my goal was to use exist-
ing word representation technology more than to research new word representation meth-
ods. The vector representation I was trying to derive differed somewhat to Honkela’s in that
I wanted a number of dimensions instead of two. It is obvious how to modify Honkela’s to
produce more dimensions but not obvious if the algorithm would continue to work so well.
Compounding this with the concern that my corpus is over a thousand times larger than the
one Honkela used, and my lexicon five times larger, the approach seemed too risky. Hav-
ing said that, Honkela’s results look better than mine in many ways and it would be a very
useful approach to try.
5.3.10 Joachims (Support Vector Machines)
Support Vector Machines (SVM) (Vapnik, 1997) are a classification tool that has proven use-
ful when the number of independent parameters is too high for a neural-network. Their
input is multidimensional data that either has or doesn’t have some property. With this data
they build a classifier for deciding if new data does or does not have the property. They
work by finding the best hyperplane through the input data, so that all the data that has
a property is on one side and all the data that doesn’t is on the other side. Since few real
problems are neatly linearly separable like this, they first transform the data using a kernel
function into another (typically higher) dimensional space where the data is hopefully lin-
early separable. For any given input data, a number of different kernel functions may need
to be tried. SVMs also tolerate some degree of training error, allowing up to n input values
to be incorrectly classified.
SVMs are much better known in the field of information retrieval than thesaurus gener-
ation, but there has been some exploratory attempts at applying them to thesaurus gener-
ation, such as the work of Joachims (2001). In this work, Joachims discusses how an SVM
works and then examines a number of properties common in text processing. Given this, he
discusses the sort of problems in text processing for which a SVM are appropriate, and the
sort of problems for which they would be inappropriate.
Consider a set of training examples Sn = ((x1, y1), . . . , (xn, yn)) where x is an n dimen-
sional vector and y is either true or false. The task that the SVM solves is to select the
hyperplane with maximum Euclidean distance from the closest training example, subject to
the condition that at most ρ training examples are incorrectly classified.
Text classification appears to be an excellent application area for a SVM. It has very
high dimensionality since each word is generally considered a different dimension, and it
122
aboutprep
afterprep
againadv
allpredet
am
andcnj
answered
are
ascnj
asked
atprep
awayadv
backadv
be
beautifuladj
been
beforecnj
began
butcnj
byprep
came
can
child
come
could
cried
daughter
day
did
do
door
downprep
eyes
father
fell
forprep
forest
fromprep
gave
get
give
go
goodadj
got
greatadj
had
Hans
has
have
hepron
head
heard
herdetposspron
hereadv
himpron
himselfpron
hisdetposs
homeadv
house
howadv
howeveradv
Ipron
ifcnj
inprep
intoprep
is
itpron
justadv
king
king’s
poss
know
lastordinal
let
like
littleadj
longadj
looked
made
man
mepron
morequantif
mother
muchadv
must
mydetposs
night
noneg
notneg
nothingpron
nowadv
ofprep
offprep
oldadj
onprep
onceadv
onenum
orcnj
otheradj
outadv
overprep
put
quiteadv
said
saw
see
shall
shepron
should
soadv
somequantif
son
stilladv
take
thatdet
theirdetposs
thempron
thenadv
therepron
theypron
thisdet
thought
threenum
time
toprep
togetheradv
took
tree
twonum
untilprep
upadv
veryadv
was
water
way
wepron
welladv
went
were
whatpron
whenadv
whereadv
whichdet
whopron
wife
will
withprep
woman
would
youpron
yourdetposs
Figure 5.8: A sample of Honkela’s word map
123
is highly redundant. However, what property is the SVM supposed to predict? There is no
obvious binary classification going on. Joachims avoids this problem by marking by hand
some documents as being related or not related to corporate acquisitions. Given this concept
he is able to show that a SVM is easily able to list the words most useful in deciding if a doc-
ument is related to corporate acquisitions (‘assignment’ implies it is, while ‘college’ implies
it isn’t, and ‘lunchtime’ provides little useful information either way). Essentially he has a
system for producing excellent mappings between words and concepts, but no method of
automatically generating the concepts.
SVMs are an excellent technique with a lot of power. They can quickly learn quite com-
plex tasks. However they are not appropriate for solving the current task because it is a
‘clustering’ task rather than a ‘classification’ task. It is possible that future work will extend
their ability to cluster concepts.
5.3.11 Schutze
Another automatic thesaurus generation system was developed by Hinrich Schutze (1993).
His algorithm is quite similar to that of Finch, which was discussed in Section 5.3.3. Schutze’s
algorithm is of more interest here because it was designed to scale to a large lexicon.
One problem with modifying approaches to work on a large lexicon is that the less fre-
quent words result in very sparse matrices. Schutze used a clever trick to increase counts:
rather than count bigram counts for words he tokenised every four letters instead of every
word. For example, Baghdad forms the following fourgrams: Bagh, aghd, ghda, hdad. We can
then represent a word as a set of fourgrams — specifically, the smallest set needed to differ-
entiate this word from all other words. There are around half a million possible fourgrams,
which is far too big for clustering algorithms. However Schutze found only one hundred
thousand different fourgrams occurred in a large corpus and most of these were rare (less
than a thousand) or redundant, where an entry is redundant if the word containing it con-
tains another unique fourgram. After removing rare and redundant fourgrams, we are left
with only five thousand fourgrams, well within the limits of computability.
Next Schutze had to derive a vector representation for fourgrams. As with Finch and
others he started with a collocation matrix which is almost the same as Finch’s bigram
matrix, but is renamed here to avoid confusion with Schutze’s fourgrams). One potential
difference between Schutze’s and Finch’s matricies is in the columns. In Finch’s approach
the columns correspond directly to the rows in the matrix while in Schutze’s approach the
columns, or feature words, may correspond to anything. In practice, they also usually cor-
respond to the rows. A window of two hundred fourgrams was used, significantly larger
than Finch’s. Next Schutze ran the principal component analysis (PCA) algorithm on this
matrix. PCA will be discussed in Section 6.3.3, but for now PCA can be approximated as
124
sorting the matrix based on the importance of each column. The less important columns are
then discarded, and the matrix can now be read one row at a time, with each row providing
the vector representation for that word. By varying the number of columns kept, the length
of the resulting vector can be adjusted so that for instance a two-dimensional map can be
created by keeping just two dimensions. For instance, Figure 5.9 (taken from Schutze (1992))
shows a map of words related to the target word supercomputing.
supercomputing supercomputersupercomputers CraycomputingThinking processorscomputer Computer minicomputersminicomputermicrocomputer RISCworkstationsIBMHillis softwareRollwagen mainframePCsdesktoplaptop microprocessorstechnology peripheralscompatible VaxMicro IntelConvexSparcUnixJobs I.B.M. PoqetNEC MipsFujitsuPSXeroxPC
technologies CISCApple PackardEISACMOSOS SXKaporlaptopsInterface Hitachialgorithms HewlettToshibanetworked MetaphorAST chipserverarsenide Microsystemsoptical microchipgalliummicrochipstechnologically Silicon Dataquestprocessing Digitalcopiers lithography CompaqSoftwareinterface
siliconmegahertzLotuscircuitrytechnological transistorsinterfaces SparcstationSeyboldcompatibles HPmidrange MicrosoftMP LSIrobotics ROM chips MicroprocessorLCMotorola
Macintosh AmdahlDevicesRL
ApplicationsLogic CupertinoAdvancedAdobemicronNCRPresentationharnessing CanonEpson Sunnyvalemegabytes Application microns
graphicalVLSI
dbaseInternetencryption bipolar ICLNext stepperscomputation fontDos Macintoshesspreadsheets TandemprintersDisk modemsystemsfacsimilehackers networking Microchipconfigured Nixdorfcircuits Ricoh MIPSZenithuser Merrinlasersbytes Strategies Headstartdisketteetchdisk hardwareclonesmachine Macs Datausers vector BISscanner handheldAtariXT
AMDMacworldFigure 5.9: Two-dimensional version of Schutze’s output
Schutze describes two useful extensions of his original word-clustering work. In Schutze
(1995), the availability of cheaper memory and disk space enabled Schutze to work with
larger corpora and so to cluster words directly instead of fourgrams. In Schutze’s later work
(Schutze, 1998), he described a method of automatically deriving word sense information
for ambiguous words.
The most important point about Schutze’s results is that the approach has been demon-
strated to work on a large lexicon. Furthermore, the approach generates word vectors rather
than dendrograms. Because this approach was the only one which meets both of my criteria
from Section 5.2, I decided to base my approach on Schutze’s.
5.4 Summary
We began this chapter with an explanation of why it is important to look at word represen-
tation, and noted that it would be beneficial to represent words in a way which allows the
grouping of events involving ‘similar words’. We were not specific about how this group-
ing should be done, but we noted that a vector representation would enable more methods
of combining words than a simple dendrogram. We also noted that the benefits are in ob-
taining a better representation for rare words since the parser already has sufficient usage
125
information for common words.
Having decided to look at word representation, we surveyed existing approaches and
found a diverse range of approaches depending on the intended use of the resulting the-
saurus. Some approaches close to the field of information retrieval only represent words as
a special case; others are designed to give excellent representations for very common words,
but very few are designed to give a vector representation to every word in a large lexicon.
As well as being large, the field is dynamic; some approaches were developed years ago and
contain quirks to overcome limitations of slow hardware, while other approaches are clearly
very much in development.
For our purposes, the approach taken by Hinrich Schutze (1995) is the most appropriate.
It provides a vector representation and so we can defer the choice of how to use the word
representation until later. Additionally, it is well established that the approach works; our
aim is simply to apply this existing technique to the backoff task at hand.
126
Chapter 6
A derivation of word vectors
At the end of the last chapter we decided Schutze’s approach to word representation was the
most appropriate for extension to a larger lexicon, because it delivers a vector representation
of words, and it is well-established research that can be relied on to work in a variety of
situations. This chapter describes the implementation of Schutze’s method, the extensions
that were necessary to build a representation for every word in the WSJ, and it includes an
extensive section where different parameters are varied in order to obtain results that are
suitable for use in a statistical parser.
The general process Schutze followed was to take a corpus of text, find words within
a certain neighbourhood and save this information as a bigram matrix, take this bigram
matrix and reduce its dimensionality using PCA, and finally cluster the reduced matrix as
a dendrogram. In Sections 6.1 and 6.2, I will describe choosing a corpus and preprocessing
it to a suitable format. In Section 6.2.2, I survey some off-the-shelf tools which could be
useful in clustering words. The algorithm I implement for computing word representations
is identical to that described earlier in Section 5.3.11. However, the lexicon I need to compute
word representations for is considerably larger than that computed by Schutze, and this
requires some additional tricks for dealing with large matrices; these will be described in
Section 6.2.3. The process of deciding the best parameters is explained in Section 6.4 and the
final results are presented in Section 6.5.
6.1 Obtaining a training corpus: Tipster and Gutenberg
The WSJ corpus that has been used throughout this thesis cannot be used as a basis for
deriving word representations. The reason is simple: our goal is to replace rare words and
the only way to get an accurate representation for rare words is to have quite a few examples
of their usage in the dendrogram table. The only possible solution is to use a corpus which
is larger than the WSJ. (Recall that we do not need to hand-parse this corpus, since we
127
will be deriving word representations automatically from it.) So, which corpus should we
choose? The main criterion should be that is large, and also that it features text in a register
similar to that of the WSJ. This is important — as always, we want the training corpus to
be as representative as possible of the test corpus. However, as mentioned in Section 2.8,
we are also interested in developing word representations which broaden the coverage of
the parser to domains other than the WSJ. With this in mind, the best corpus for developing
our word representations would be one that includes text in the WSJ style, but also contains
texts from other genres.
The Tipster corpus (Harman, 1992) seems ideal for our purposes, since it includes the
WSJ as one of its component corpora while being much bigger. For an indication of size, the
WSJ is approximately six megabytes while Tipster is approximately one gigabyte.
Even after replacing the WSJ with Tipster, it was found the counts were still insufficient
for about half the words in the WSJ, so an even larger corpus was sought. I considered
writing a web-crawler to generate a large corpus but this involved even more stripping
of markup language along with many problems of non-sentence-like text. So instead it
was decided to use Project Gutenberg (Hart, 2005) — a huge collection of public-domain
books. Every English book in the project produced from 1993 to 2004 was concatenated
into a single huge file. There are obvious problems with this corpus; some texts are in Old
English, others are incorrectly identified as English when they are Latin, and at least one
dictionary is included in the corpus, but none of these errors is significant since all we are
looking for is neighbourhood information. In total, Gutenberg is around triple the size of
Tipster, leading to a combined total of about two billion words, or over one thousand times
the size of the WSJ. In the remainder of the thesis, I will refer to the combined Tipster and
Gutenberg corpus as the T/G corpus.
The T/G corpus is designed to be similar to the WSJ, except much larger. For this reason,
all words not present in the WSJ are replaced by UNKNOWN WORD. This process was
also performed because the workstation does not have enough memory to compute bigram
counts for a lexicon much larger than forty thousand words. In order to visually ensure T/G
conforms to the same basic word frequency distribution as the WSJ, Figure 6.1 presents a plot
of the frequency of every word in the WSJ against that in T/G. This plot is approximately
linear at higher word frequencies, so we can conclude that T/G is approximately a larger
WSJ. At lower word frequencies the graph looks less linear, but at these frequencies the high
standard error of measurement in the WSJ frequencies makes the result little more than
noise. 1
1It may appear from visual inspection that this graph is not going to intersect the origin, as of course it should
This appearance is a side effect of the log-log scale which over-emphasises the difference between one and zero.
128
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
10 100 1000 10000 100000
T/G
freq
WSJ freq
Word frequency, T/G vs WSJ
"wsj-vs-tipster.points"
Figure 6.1: Graph of the frequency of every word in the WSJ against that
word’s frequency in T/G
6.2 Preparing the corpus for clustering
6.2.1 Processing the corpus
Stripping markup
Tipster is not a raw text corpus. It is segmented using a markup language and this markup
must be stripped before anything useful can be done with co-occurrence, or else we end up
with computer co-occurring with .tt because it is frequently marked up teletyped. Unfortu-
nately every section of Tipster is marked up differently so half a dozen different scripts had
to be written to strip markup from each section. Gutenberg was already in text format, so
only needed the preamble deleted.
Tokenisation
Next we find a problem that has not occurred in the WSJ — what is a word? In the WSJ
all words are space delimited, even full stops and the like, making tokenisation implicit.
However T/G includes many hyphenated words, full stops following words, and so on.
and writing code to correctly tokenise these turns out to be surprisingly difficult. Another
problem is the high frequency of numbers in T/G – for example, with 3 as a different word
to 4, the lexicon is hard to manage. The initial approach taken was to replace all numbers
129
by a generic number symbol. It would also be extremely advantageous to perform a similar
operation for proper nouns but publicly available software to perform this was not found
during a brief search. In the end the tokeniser is only two hundred and fifty lines of Perl, but
the code is quite fragile with reordering lines causing large number of errors. It seems there
is no good way of writing a tokeniser. Of course, since the only use of the tokeniser here is
to improve bigram co-occurrence, it does not need be anywhere near as good as a tokeniser
used in a tagger or similar.
After finally getting the tokeniser to produce acceptable results, such as treating Mr. as
one word, it was found that the tokeniser’s idea of what constitutes a word differed signifi-
cantly from the WSJ’s. This is completely unacceptable since the whole point of the exercise
is not to get good representations for the words in T/G, but for the words in the WSJ. It
seems that sometimes the WSJ allows hyphenated words as lexemes and sometimes it does
not, with perhaps semantic or frequency information being the deciding factor, properties
that cannot be evaluated with regular expressions (the simple rules which the tokeniser is
built from). This problem was first tackled by deliberately decreasing the performance of
T/G’s tokeniser in order to considerably increase the similarity to the WSJ, and then the
WSJ was retokenised to match the tokens for T/G. This was performed automatically with
new parts-of-speech being generated when new tokens are created. Obviously rewriting the
WSJ in this way also significantly affects the probability model and so a reverse tokenisation
step has to be performed after parsing, before comparing to the gold standard.
This approach appears to work when we eyeball the results, but quantitative analysis
shows it does not work. Specifically, the retokenised WSJ obtains significantly lower preci-
sion and recall than the nontokenised version (84.5% rather than 86%). Since our evaluation
method is to measure the difference in the performance of the parser, this drop in perfor-
mance makes evaluation of word vectors impossible — it could be any improvements are
simply counteracting losses due to tokenisation — and so the retokenisation of the WSJ was
abandoned. The best solution left was to use the simpler version of the tokeniser which does
not include number tokenisation and makes a number of mistakes, but will reproduce the
WSJ fairly accurately. I was able to evaluate the recall of this simplified tokeniser by reto-
kenising the WSJ and comparing it to the untokenised version, since any differences would
be an error. It was found that about 98% of the tokens did not change, a promisingly high
result. However, since any errors will be consistent we also ensure there are at least some
occurrences in T/G for every word in the WSJ by adding the WSJ directly to T/G, bypassing
the tokeniser.
130
6.2.2 Off-the-shelf tools for clustering: a brief survey
Schutze mentions that he did not write the PCA code used by his algorithm, but instead
used the Buckshot algorithm developed separately. So I attempted to find an off-the-shelf
implementation of bigram counting and PCA that would be suitable for such a large prob-
lem.
There are a number of statistical packages available that perform at least some of the
necessary tasks. Among others, the following packages were considered: ‘bow’ is a power-
ful natural language library that is integrated with the programming language rather than a
separate tool (McCallum, 1996), which makes shifting it to a particular task painless. ‘BSP’
is a bigram counting package with support for a number of different tests and extensions
in Perl (Banerjee and Pedersen, 2003). ‘R’ is a very powerful statistical package modelled
after the S language (R Development Core Team, 2004); it is very popular with statisticians.
Schutze suggested the use of Weka (Garner, 1995) or Autoclass (Cheeseman, Kelly, Self,
Stutz, Taylor, and Freeman, 1990), both general purpose clustering toolkits.
Bow
Bow, or its various components, most notably Arrow, form a large C program. It is primarily
intended for information retrieval and contains tools for inverse document frequency (IDF)
rather than simple bigram counting. However Bow is implemented as a powerful library
with associated programs, so the library can be reused without having to modify any code.
Unfortunately I found the library functions were not sufficiently powerful. It contains no
useful statistical tools and its ability to convert a corpus of words to a corpus of integers,
while useful, can be implemented easily without Bow.
BSP
BSP is a bigram tool developed by Banerjee and Pedersen (2003). It is very easy to use but it
is written in Perl and was unable to cope with the large corpus being used.
Weka and Autoclass
Weka (Garner, 1995) and AutoClass (Cheeseman et al., 1990) are two clustering suites, not
specific to language processing. They include a number of high level algorithms. They
have been used successfully by many different projects. Both suites were tested with my
training data but both performed too slowly and were unable to complete training on even
the simpler cases. It may be possible to refactor the training data so that Weka or Autoclass
could classify it, but it seems the interfaces they present are too high level for making this
easy.
131
R
The statistical toolkit R is very popular throughout statistics and is taught to undergradu-
ates as the standard way of performing statistical analysis on a computer. It has a very wide
array of functions built in, including literally dozens of clustering algorithms. Addition-
ally its use on research projects means it has been designed to go moderately fast and scale
relatively well. Finally, being open–source means portions can be replaced if they are per-
forming too slowly. R had already demonstrated that it was relatively fast on dendrogram
generation; by using R I was able to hierarchically cluster Liddle’s word vectors several
orders of magnitude faster than Liddle’s java implementation could.
Another benefit of R is that it provides direct access basic algorithms rather than the
higher level programs in Weka which means they could be reimplemented in C if they could
not scale sufficiently. I thus decided to use R for PCA.
6.2.3 Dealing with large matrices
The WSJ contains approximately fifty thousand distinct words. Even after the retokenisa-
tion just mentioned, there are still thirty-two thousand distinct words. As has already been
mentioned, PCA strongly prefers square matrices. However a thirty-two thousand by thirty-
two thousand matrix is unrealistic on the computer hardware available — it would require
four gigabytes of RAM per matrix, for a total of at least twelve gigabytes. This problem is
compounded by various remaining inefficiencies in R.
A simple solution would be to split the bigram matrix into manageable chunks. If the
thirty-two thousand words are split into four thousand word chunks then we could count
their co-occurrence with the four thousand most common words to obtain manageable
square matrices. Unfortunately this approach does not work: every run of PCA transforms
its input into a reduced space ideal for that data — so there would be no correlation between
the vectors produced between different chunks.
This problem can be overcome by noting that the output of PCA is not the transformed
vectors, but a rotation matrix which when multiplied by the input matrix gives the transfor-
mation vectors. This is significant because this rotation matrix can be multiplied not only
by the input matrix, but by any other input matrix, transforming it into the optimal space
for the first data set. By multiplying the other twenty-eight thousand input matrices by this
transformation matrix we keep all data in the same space. Of course, this is not as good
as using all the words in the first place — not least because the transformation matrix im-
plicitly ends up optimised for placing frequent words well — but the method given works
and given a more powerful computer it would be worthwhile running this section of the
program again.
Another problem was noted when eyeballing the output data. It seemed that very large
132
counts were not being scaled correctly: similar but less frequent words were being clustered
apart. So the scaling normally built into PCA was removed and all scaling is done by a
separate Perl program. This also had the advantage that many parameters (logarithm, shift-
ing probability mass, RMS, centring on zero, and column normalisation) could be adjusted
easily.
6.3 An implementation of Schutze’s algorithm for word clustering
We have already presented an informal overview of how Schutze clusters words (or four-
grams), in Section 5.3.11. In this section, we will describe the algorithm in more detail. There
are three steps: firstly processing the corpus to build a table of bigram counts; secondly scal-
ing the counts in this table to normalise them; and thirdly running the PCA algorithm on
the normalised table.
6.3.1 Building a table of bigram counts
The first step is to take a corpus of text and transform it into a long sequence of words. Next
each word is mapped into a number, which will serve as the reference for this word into the
arrays. In the previous chapter the mapping between words and numbers was somewhat
arbitrary, and I used the order that the words happened to be seen by the preprocessor2.
This method proved to be a poor choice here because there were far too many words to
enumerate them all. Instead the words were sorted by frequency and then enumerated
using their relative frequency. This makes it much easier to vary the cutoff points.
Having obtained a sequence of numbers, we compute bigram counts by looking, for
each word, a certain number of words to the left or the right and incrementing the count of
this co-occurrence. Pseudocode is given in Figure 6.2. This pseudocode skips a number of
for WordPos = 0 to CorpusLength
for WindowPos = max(0,WordPos - WindowSize) to ref
count[corpus[WordPos],corpus[WindowPos]]++;
endfor
endfor
Figure 6.2: Pseudocode to count all co-occurrences in the corpus
implementation details, such as it being impossible to store the corpus in memory due to its
size (or even to store it using mmap!) Because of this, the corpus is loaded from the file into a
2Incidentally, Collins used the same method, although his different preprocessor design meant words were
seen at different times and so his mapping differs significantly.
133
circular array. None of these details are particularly surprising, and would just complicate
the figure if included here.
6.3.2 Normalising the bigram table
The PCA algorithm described next assumes that every row in the matrix is a unit vector
(that is, sums to one with a mean of zero.) There are a number of methods by which we
could modify the bigram table so that it sums to one. The simplest would be to treat every
row separately, sum it to find the mean, subtract this from every cell to give a mean of zero,
and then divide every cell by the mean to give a radius of one. However, this method was
found not to work especially well because the co-occurrence counts for very frequent words
dominated the results. Because of this, several alternative methods were examined. These
will be discussed in Section 6.4.3 along with their effects on the results. However, they are
briefly summarised here:
1. Add one to every single count, as an estimation of held-off probability mass.
2. Compute the logarithm of every count. This is meant to counteract Zipf’s law so that
instead of counts increasing exponentially, they will increase linearly.
3. Centre the data on zero by subtracting the mean of each row from every cell. This is
required for PCA, but may not be required for other clustering algorithm.
4. Should row normalisation use RMS or simply dividing by the mean. It is probably
desirable to always use RMS, but when first implementing this code it contained an
error and this parameter allows reproduction of the error for the recreation of previ-
ously published results.
5. Control whether or not to normalise columns. Column normalisation is dividing every
column by the RMS of that column. It was implemented after it was noted that the
features which occur extremely frequently or extremely infrequently were controlling
the final output too much (because PCA attempts to reproduce every result, and these
results differ by more.) After normalising columns, it can be expected for every feature
to have equal weighting.
6.3.3 The PCA algorithm
The idea of the PCA algorithm is to transform an n by n matrix into another space, so that
it is still n by n but the first row in the matrix is the best discriminator of that column, the
second row is the second best, etc. For instance, the first row may have positive values for
noun-type words and negative values for verb-type words.
134
More formally, the algorithm first finds the vector through the n dimensional data with
the most variance, which is called rot1. Because this vector has the most variance with
respect to the input matrix, it encompasses a lot of the information that was present in the
whole matrix, but of course it cannot encompass it all. Perhaps the best way of visualising
this is to view the matrix as a large number of points in n dimensional space, and rot1 as the
hyperplane that best splits those points into two. If n was just three, then the points could
be viewed as points in a cube, and rot1 is then a simple plane through this cube. Next the
algorithm finds the vector perpendicular to rot1 that encompasses the most variance, calling
it rot2. This process is repeated until rotn. All rotational vectors are unit vectors, it is only
their direction that is of interest.
To determine which angle encompasses the most variance we compute a covariance
matrix. Next we compute the eigen decomposition of this matrix to produce a matrix of
eigenvectors and a diagonal matrix D composed of a list of eigenvalues3. In mathematical
notation, our square matrix is A is decomposed into eigenvalues Λ1 . . .Λn and eigenvectors
R such that
AR = RD
It is perhaps best to view D as simply a scaling factor representing the relative impor-
tance of each row in the eigenvector table. That is, the highest eigenvalue corresponds to
the eigenvector with the highest covariance, or the principal component in the matrix. Sim-
ilarly, the second highest eigenvalue to the second most important component, and so on.
There are very many textbook explanations of principal component analysis available. One
that is specifically written for computer scientists is Smith (2002).
Returning to clustering words, we can multiply our bigram counts by R to transform
the counts in such a way as the first row of the output matrix corresponds to the best dis-
criminator for differentiating words. Reading down the column of this output matrix then
gives us an excellent vector representation for the word, where we can cut this vector at any
point up to its full length n and still get a good approximation of how the word differs from
other words. That is, words that have very similar usage will have very similar values for
their first components. I refer to the output of this transformation as the word’s position in
word space; two words that are nearby in word space will have similar (normalised) bigram
counts.
Another important result is that the matrix R can be used independently of the input
bigrams. This means that if we compute bigram counts for any new words, we can multiply
them by R to compute the new word’s position in word space. This position will not be
exact, in that if this word had been present in the original matrix A, its variations would
3Recall that all values in a diagonal matrix are zero except along the diagonal
135
have resulted in a very slightly different rotation matrix R# being computed, but the position
should be extremely close. The relevance here is that it is impossible, given the current level
of computer power to compute R for a matrix A the size of the whole input lexicon. This is
because a single matrix A the size of the whole lexicon would have around 50,0002 entries,
or roughly a quarter of a billion. At eight bytes per entry this would take roughly two
gigabytes of RAM. We would also need to compute R in RAM for another two gigabytes
of memory, and it would be extremely hard to complete the process without storing the
output matrix for a further two gigabytes. No machines available in my department have
six gigabytes of memory and even if such a machine was available, these values are optimal
cases and unlikely to be realised inside an interpreted language like R.
However, we can temporarily forget about A and instead work with a sample of A,
which I will refer to as A′. Performing PCA on A′ gives a rotation matrix R′. Multiplying A
by R′ will give AR′ which, because of the mathematical property outlined above, is a very
good approximation of the uncomputable AR.
Within the R environment there are a number of different ways of computing PCA. These
differ mainly in the iterative process used to compute the eigenvectors, but also in the nor-
malisation process applied to the input matrix. The simplest method available is singular
value decomposition (SVD). This does not perform any normalisation and so makes it eas-
ier for me to perform normalisation before passing the matrix to R.
Readers may have noticed a number of the algorithm’s parameters implicitly included
above. For instance, the window in the pseudocode looks to the left but looking to the right
gives subtly different results. Other parameters that can be varied are the window size, the
cutoff at which words are considered so rare they are best treated as unknown, the size of
the matrices used in PCA, the method for converting the bigram counts to unit vectors, and
several more. These, along with their effects on the results, will be discussed next.
6.4 Tuning the clustering process
We are now in a position to evaluate the word representations generated by the clustering
algorithm, and to decide on the best values for the various parameters which are defined. In
this section I will present results for various different parameter values, and discuss which
values are likely to be best.
6.4.1 Evaluation methodology
Evaluating a set of word representations is difficult on its own. Most self-contained studies
on word clustering use the measure of perplexity: effective clustering solutions reduce the
perplexity of a language model (see for example Goodman (2001)). However, since our word
136
vectors are intended to improve the performance of a parser, it is more appropriate in our
case to evaluate clustering solutions indirectly, by observing their effects on the precision
and recall of the parser. A more formal evaluation of word representations in these terms
is thus deferred until Section 7.2.6. In the present chapter, we will nonetheless provide an
informal evaluation of the results of different parameter combinations.
Our informal method is to look at the results generated: do words with similar meanings
receive similar representations? Since eyeballing the raw feature vectors generated is essen-
tially impossible, we pass these vectors to a hierarchical clustering algorithm, and generate
a dendrogram containing fifty randomly-chosen words to express the results. Word dendro-
grams are relatively easy to eyeball to get a rough impression of the quality of word vectors,
pending the more formal analysis in Chapter 7. For all evaluations described in this chap-
ter (unless otherwise stated), the input to the clustering algorithm is bigram counts based
on the T/G corpus using a window of two words to the left plus the current word and the
output is a dendrogram created for fifty randomly chosen words. We look two words to the
left because the main use of words is in right dependency events and so we need our word
representation to look from the perspective of what headwords the dependent words will
see to the left. The fifty randomly chosen words are the same for each dendrogram which
has the advantage that it is easier to compare outputs, but the disadvantage that it is easy
to tune the algorithms based on small samples. (More extensive testing was also performed
using larger samples, but these are too large to present in the thesis.)
6.4.2 Dimensions of the bigram matrix
The input to the PCA algorithm is a two-dimensional bigram matrix. The rows in this matrix
correspond to the words, while the columns correspond to features of words. Since the
features are themselves words and we are counting co-occurrence statistics, it would be
reasonable to assume the matrix will be diagonal — the number of times a co-occurs with b
should be the number of times b occurs with a.
However, there are a few complicating factors. Firstly, the PCA algorithm only works
on square matrices. For non-square matrices, it is conventional to convert the matrix to
square by wrapping data around. I found this transformation always led to extremely poor
results and very quickly discarded the use of non-square matrices. Secondly, the amount
of memory on the most powerful computer I had available limited the number of matrix
elements to approximately twenty thousand. Had I hand-coded the PCA algorithm instead
of using R, it is likely this limit could be increased significantly, to perhaps one-hundred
thousand.
Because of the limit to twenty thousand cells, a 4000-by-4000 matrix was used. Smaller
matrices are possible, and can be computed significantly faster. However, smaller matrices
137
lead to inferior results and so presumably larger matrices would lead to superior results,
were hardware available that could process such matrices.
6.4.3 Normalising bigram vectors
Before PCA is run on the bigram vectors, it is important to normalise them. This normalisa-
tion can be achieved in several different ways. In this section, I will consider several different
approaches in succession, culminating in the one I eventually use. The intermediate (and ul-
timately discarded) approaches are presented in more detail than in other parts of the thesis
for several reasons. Firstly, we are moving into unfamiliar territory and so the approaches
that do not work are of almost as much interest as the approaches that did. Secondly, in
order to perform useful qualitative analysis of the final dendrograms, it is illustrative to see
the ways in which earlier iterations of the approach produced inferior output.
Normalising word counts
The most natural way of normalising the bigram counts involves two stages. First, we gen-
erate centred vectors by subtracting the mean of every row from each cell. Second, we
generate unit vectors the counts by dividing them by the root mean squared (RMS) for ev-
ery row. This process is illustrated in Table 6.1 which shows a matrix of bigrams generated
for a window of two words to the left, plus the current word. (In this table, rows are words
and columns are features.)
In this section we are more concerned with the technique for deriving the values than
their meaning, but it is useful to recall their meaning because a good technique will make the
correct meaning more pronounced. A positive number shows a positive correlation between
two words, so when we see company, it is likely that we have seen computer somewhere
within the previous three words. A negative number shows a negative correlation, so when
we see bought it is unlikely that we have seen at within the previous three words. If the cell is
zero then the presence of one word does not affect the probability of seeing the other word,
so after seeing yesterday we should not be surprised by the presence or the absence of new
within the previous three words.
A dendrogram generated from the unit vectors in the final section of Table 6.1 is shown
in Figure 6.3.
This dendrogram has a number of positive features; it is starting to develop a sensible
hierarchical shape, connectives and numbers are both detected and are kept away from other
more common categories, and plurals are separated out quite well.
138
BeforeMeanwhile
376
ChaseDavisPaper
ArtFidelitygrown
promisedindexhost
statesite
influencecenter
previousconfidencedifficulties
restrictionsmodelsutilities
familiestaxesillegal
leveragebid
plungechange
guaranteecontrolling
flatstep
severesimilarorder
3/8AT&T
carryingfailing
chemicalsdeclining
expertsoperates
Chairmansince
George250
50
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Cluster D
endrogram
hclust (*, "complete")
dist
Height
Figure 6.3: Word dendrogram with RMS scaling
139
Vectors of raw bigrams bought company large yesterday mean
computer 313 7825 1386 388 2478
new 1174 19430 3386 7930 7980
traded 63 849 68 500 370
at 1905 28881 10402 6508 11924
Centred vectors bought company large yesterday√
Σ(x2)
computer -2165 5347 -1092 -2090 6232
new -6806 11450 -4594 -50 14,090
traded -307 479 -302 130 657
at -10019 16957 -1522 -5416 20,483
Unit vectors bought company large yesterday
computer -0.35 0.86 -0.18 -0.34
new -0.48 0.81 -0.33 -0.00
traded -0.46 0.72 -0.46 0.19
at -0.49 0.83 -0.07 0.26
Table 6.1: Bigram counts in the process of being normalised
Normalising feature counts
Looking again at Table 6.1, it should be noted that using RMS to normalise word counts
results in the relative weight being strongly decided by the feature with the highest counts.
For example, company occurs much more than any other feature in the table so, while the
positive correlation between company and computer is undeniably correct, it is perhaps un-
desirable for this correlation to dominate the row simply because company is a very common
word in the corpus. This property is undesirable because it means that words with low
counts are treated as virtually irrelevant and will all cluster together. What is needed is to
normalise the feature frequency using the frequency of each feature before normalising the
word co-occurrence counts.
Two different methods were examined for normalising feature frequencies. The first
method was to use RMS on the features, in the same way as was just demonstrated on words.
The second method was to take the natural logarithm of each count before normalising the
words. Neither method can be especially well justified using theory: using RMS on the
features means we are simply looking at the relative surprise at seeing the word rather
than the mutual information, and computing the logarithm is justified by noting that the
frequency of words is distributed exponentially (Zipf’s law) and so by taking the logarithm
we can move to a linear distribution, significantly reducing any bias caused by high-count
features. We also experimented with combining these techniques, so that columns were
140
normalised and then the logarithm was taken.
A dendrogram showing log scaling is shown in Figure 6.4. This figure is much more
restrictionstaxes
guaranteeleveragefamilies
hostcenter
siteexperts
chemicalsutilities
indexmodelsgrown
promiseddifficulties
confidenceinfluence
changestep
bidplunge
sinceorderstate
decliningfailing
carryingcontrolling
flatprevious
similarillegal
severeoperates
AT&TChairman
FidelityPaper
3/8250
63750
BeforeMeanwhile
ChaseDavis
ArtGeorge
0.0 0.2 0.4 0.6 0.8 1.0
Cluster D
endrogram
hclust (*, "complete")
dist
Height
Figure 6.4: Word dendrogram with log applied to all counts before pro-
cessing
promising. It is relatively easy to break the dendrogram into nouns (restrictions through to
models), verbs (through to plunge), adjectives (through to severe) with proper names, prepo-
sitions and numbers all being nicely separated out.
Because the log scaling removes the emphasis from frequently occurring features, an-
other test was to apply both log and RMS normalisation of the columns. However, it was
141
found that there are no significant advantages and so RMS scaling of features was aban-
doned.
6.4.4 Choice of feature words
How should we pick our feature words? Should the four thousand most common words
be used as parameters, or a randomly chosen four thousand? I think the better answer
is the first four thousand. The criticism of this answer is that it is non-representative, but
while the first say five hundred words are very different to normal English in that they
contain few nouns, the first four thousand include quite a few words of every type. More
important, the advantage of using frequent words as features is that they occur more often
with infrequent words and so the relative counts are accurate. A dendrogram generated
using random feature words is slightly inferior to Figure 6.4, and so we will continue to use
the most common words.
6.4.5 Window size
Another question is how big should the window be before two words are no longer consid-
ered neighbours. With a small window, the algorithm generates good representations of a
word’s syntactic characteristics, while a larger window brings in semantically related words
and increases overall counts which is useful for rare words. The small window has already
been shown (Figure 6.4). A dendrogram with a window of fifty words was generated but,
because semantic relationships are harder to verify than syntactic ones, cannot be usefully
presented here. The dendrogram did show potential, with much better semantic relation-
ships in the clusters, but at the expense of syntactic relationships. For instance, promised,
promises are a category here where they would not be in the previous dendrograms. Other
categories (for example illegal, lawyers) also show good semantic relationships with limited
syntactic relationships. Overall, the significant loss of syntactic information means that it
cannot be used to aid the parser and so we must return to our very small window.
A similar question is in what direction the window should be. During experimentation I
was unable to find either direction to be measurably better than the other, and chose to look
backwards like this because it fits better with right branching structure of English.
6.4.6 Iterated clustering
The dendrograms that have been presented so far look quite good. However, if we examine
a greater range of the input, it is apparent that there is no coherent high level structure. In
particular, several small clusters that really should be next to each other are a long way apart.
For example, after the first large category of about two dozen words, some of the numbers
142
are represented. However, in the middle of the first category we find a mini cluster of larger
numbers. Either of these clusters looks good, but they really should be joined immediately
instead of joining first with words like Warsaw.
It would be very useful if we could assign meaningful semantic or syntactic labels to
nonterminal nodes high up in a dendrogram. The lack of high level structure is not espe-
cially important if a later system using the word representations can be trained to use the
low level data correctly, because there are enough counts in the local structure. However
the lack of high level structure is likely to make training a future system very hard, if not
impossible.
One new approach I attempted, to try and impose high-level structure on the data, was
iterative clustering. The approach used in iterating the training is similar to that used by
Miikkulainen, as discussed in Section 5.3.7. After generating the vectors in the manner
already discussed, the whole process of bigram counting is repeated. However this time
whenever a word is found to be within the window of a feature, not only is the bigram
between this feature and this word incremented, but also the bigrams between this feature
and every word that is ‘similar to’ this word. ‘Similar to’ is defined in terms of the hamming
distance between the current vector representations of the words.
To give a better idea of the problems with local structure, a dendrogram produced from
iterating SVD four times is presented in Figure 6.5. Contrasting this with Figure 6.4 shows
some improvements as well as some regressions. The plural category is now complete, but
the -ing category has been split in two. Art, George have incorrectly slipped into the number
category, but illegal probably makes more sense with failing than it did with severe. Overall,
it is hard to make a definitive statement but perhaps the iterated result is slightly inferior.
6.4.7 Integrating POS tag representations
The iterative clustering has improved the global structure somewhat, but it has led to prob-
lems with the local structure. What is really needed is to impose on the dendrogram an
explicit hierarchy such as the part-of-speech of the words. In the next chapter (Section 7.6)
we also need a hierarchy of POS tags and so here we have ‘reused’ the hierarchy created in
the next chapter. To integrate the hierarchy with the existing bigram counts, the new data
is added as additional features associated with each word. A total of fifteen features from
tags was used, which in the next chapter will be seen to be too few to completely capture all
tag information. However, every tag feature we include here will be one fewer word feature
and experimentation showed that using too many tag features resulted in the dendrograms
over-emphasising POS information.
An alternative method for integrating the counts would be to tag the word corpus and
then use this tagged corpus for training since the correct tag would then precede the current
143
restrictionstaxes
expertsfamilies
chemicalscenter
sitemodels
indexutilities
operatesgrown
promiseddeclining
failingillegal
confidenceleverage
difficultieshost
carryingcontrolling
sincestate
previoussimilar
bidplunge
flatsevere
changeguarantee
stepinfluence
orderBefore
Meanwhile6
ArtGeorge
373/8
25050
AT&TFidelityChaseDavis
ChairmanPaper
0.2 0.4 0.6 0.8 1.0 1.2
Cluster D
endrogram
hclust (*, "complete")
dist
Height
Figure 6.5: Dendrogram from iterating SVD four times
144
word in the bigram counts. The latter approach would almost certainly lead to better results
since it would elegantly solve the problem of homographs causing ambiguity, but it was not
undertaken because tagging a multi-gigabyte corpus takes too long with my tagger.
A dendrogram with POS tags is shown in Figure 6.6. This dendrogram shows consider-
6250
3750
ChairmanPaperAT&T
FidelityChaseDavis
ArtGeorge
operatespromised
Beforesince
decliningcontrolling
carryingfailinggrown
Meanwhileillegal
flatsevere
previoussimilarorderstate
centersitebid
plungehost
confidenceinfluence
changestep
indexguarantee
leverage3/8
expertschemicals
modelsutilities
restrictionstaxes
difficultiesfamilies
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Cluster D
endrogram
hclust (*, "complete")
dist
Height
Figure 6.6: Dendrogram where POS tags are used as extra features
able promise, with the inclusion of tags strongly encouraging words with the same POS to
cluster. Not only that, but where we previously had two good sub-clusters that did not join,
such as two sets of proper nouns, we now have one larger cluster. Essentially, all errors in the
previous figures have now been corrected, including relatively good global structure. There
145
Row normalisation RMS
Feature normalisation Log
Features used First four thousand words
Number of iterations One
Integrated POS tags Yes
Window size Twenty words
Window direction Left only
Table 6.2: Parameters chosen for the generation of word vectors
are still two faults in this figure: the global structure is still suboptimal, and polysemous
words are categorised poorly.
6.4.8 Windows revisited
Previously we decided that smaller windows work better, and hypothesised this was be-
cause large windows overemphasise semantic information at the expense of syntactic infor-
mation. However in the last section we produced syntactic information through separate
features and so it is appropriate to revisit the question of window size. If the window size
can be enlarged then this should have two major benefits: it should increase our robust-
ness with low frequency words due to increasing their counts, and it should increase the
amount of semantic information. Preliminary investigations showed that the window size
can indeed be enlarged if POS information is included.
I experimented with many combinations of window size and window direction. In each
case, I examined the resulting dendrogram, and also calculated the nearest neighbours in
word space for a random selection of words. Interestingly, these two measures did not al-
ways coincide. A dendrogram produced for fifty words is obliged to create relations between
words, even if these words are not close in word space. However, since we are ultimately
interested in neighbours rather than dendrograms, the neighbours measure was preferred.
The best window scheme I found was twenty words to the left. A dendrogram for this
scheme is presented in Figure 6.7.
6.5 Results
The previous dendrograms were intended to give the reader an idea of why particular pa-
rameter values were chosen. In summary, we decided to adopt the values given in Table
6.2. A dendrogram produced using these parameters has already been presented in Figure
6.7. In the remainder of this chapter, we will look at the quality of the output generated by
146
chemicalsmodelsutilities
difficultiesfamilies
taxesexperts
restrictionsindex
leveragebid
plungeflat
severeprevious
similarillegalcenter
siteguarantee
stateconfidence
influencehost
orderchange
step3/8
2505037
6Art
DavisGeorge
AT&TChairman
PaperChaseFidelitygrown
Meanwhiledeclining
controllingcarrying
failingoperatespromised
Beforesince
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Cluster D
endrogram
hclust (*, "complete")
dist
Height
Figure 6.7: Dendrogram using the final parameters (a window of twenty
words and tag information).
147
this combination of parameters in some more detail. Rather than looking at dendrograms at
this point, we will move to looking at nearest neighbours, because when we are backing off
from individual words, we will effectively be looking at neighbours in word space.
6.5.1 Results for the first four thousand words
An alternative method of evaluating word vectors is to print their nearest neighbours in
Euclidean space. So for instance, picking the closest neighbours to ship is, naturally ship but
the next few closest neighbours are train, road, foot, hour, spot, sea and boat. Most of these seem
quite good as alternative forms of transportation, although hour and spot are peculiar. For
comparison, a manually compiled inverse dictionary gives send, address, consign, dispatch,
forward, remit, route, transmit which are more like synonyms of send. The key difference
is the inverse dictionary concentrates strongly on synonyms, while the bigram approach
seems to identify words which occur in texts on the same topic. The nearest neighbours for
a small selection of the four thousand most common words is presented in Table 6.3. (Note:
we are selecting neighbours from the whole set of words, not just the first 4000.) There are
some interesting properties in this table. Firstly it is unsurprising that numbers are clustered
together, but it is nice to see that large numbers are clustered away from twenty-something
numbers, which are also clustered away from numbers including decimal points. It is also
nice to see that the months are clustered seasonally. It is also useful to note clear errors in the
table, such as set having a nearest neighbour of called since this might have been put down
to coincidence in the dendrogram. It is also useful to note that the further away neighbours
tend to be less related, so that while heads is a perfect match for heads, it is a poor match for
leaves or turns.
6.5.2 Results for the second four thousand words
Up to now, we have been producing dendrograms that were generated directly from the
output of SVD. We noted back in Section 6.3.3 that it should be possible to perform SVD
on a sample of the bigram matrix, store the rotation matrix, and then apply this rotation
matrix to the entire bigram matrix, effectively achieving an approximation of applying SVD
to the entire bigram matrix. However, that approximation should be near perfect for the first
four thousand words and significantly worse for latter words. Table 6.4 presents the nearest
neighbours for these words. Since this looks very good, we can reasonably conclude that
applying the rotation matrix is working.
148
Moreover Nevertheless Hence Therefore However
popular powerful successful personal free
mortgages borrowers premiums dividends certificates
NASA SPAN AT&T Telecom BellSouth
29 27 26 23 28
By Of From With On
mill packing wash lighting diamond
7.5 8.5 9.5 10.5 0.2
heads faces stands leaves turns
formal immediate strict permanent temporary
set called made given changed
employee accounting filing insurance coverage
deficit inflation unemployment economist budget
associate responsible frank formal junior
December January November February September
facilities sites standards areas centers
100,000 30,000 50,000 20,000 200,000
well only now rather even
collapse crash crisis speculation uncertainty
continues remains suggests represents believes
Table 6.3: A sample of nearest-neighbour words from the first four thou-
sand words
149
editorial editor magazine bureau publication
prosecutors defendants indictments allegations attorneys
stocks traders investors declines Stocks
benefits employers standards improvements costs
returns numbers accounts years Others
different single particular simple useful
rose fell climbed jumped dropped
minority majority membership voting initiative
pact agreement timetable impasse moratorium
know tell let come go
lose keep happen break suffer
nine six seven eight five
book story reading picture writer
comments recommendations requests regulations reviews
holdings assets stockholders partnerships shareholders
slow steady strong weak rapid
chairman executive president vice director
minutes hours seconds feet yards
ought Would Will might shall
genetic biological clinical reproductive therapeutic
Table 6.4: A sample of nearest-neighbour words from the second four
thousand words
150
abounding teeming overflowing laboring imitating
thrusts twists mazes scars props
disapproves deplores displeases persuades errs
disillusionment disfavor rancor passivity savior
functional dynamic static analytical numerical
spores anthers contractions thermometers hybrids
halfhearted self-congratulatory unpolitical Influential earthbound
rumble crackle scamper graze hurl
newscast anchorman newsroom talk-show footage
scratched whistled tucked smelt plucked
cheerleading fifth-grade biking moonlighting scorecard
activism racism backlash homelessness environmentalist
grenades Witnesses gunmen commandos loudspeakers
les se deux des jour
profiled Located patterned latched Coupled
ham chocolate roast jam steak
crucially Collectively meaningfully Insofar Conceivably
Table 6.5: A sample of nearest-neighbour words from the last four thou-
sand words
6.5.3 Results for the last four thousand words
The first eight thousand words all occur extremely frequently in the T/G corpus. Even the
eight thousandth most common word (consume) has half a million co-occurrence counts with
the four thousand features. This compares to the word cataclysms which occurs exactly once
in the WSJ and only has fifteen thousand co-occurrence counts. Other words occurring once
fare even worse — Reykjavik has only five thousand co-occurrence counts.
Since the whole point of the word vectors was to generate quality results for rare words,
it would be very desirable for the least frequent words to cluster well. Table 6.5 presents
nearest neighbours for a selection of these least frequent words. There are still errors in this
table, but the results are surprisingly successful for such rare words. (This is probably due
to the fact that rare words are less polysemous than common words, which suggests that
finding a problem to the polysemy issue for common words would make a large difference
to the quality of word vectors.)
151
6.6 Summary
In this chapter, we have described a method for generating vector-based word represen-
tations using n-gram statistics, which generates similar vectors for semantically and syn-
tactically similar on informal inspection seem to capture similarities between words quite
successfully. There are two main novelties in this approach: firstly the use of the inter-
nal matrix in singular value decomposition to support a larger lexicon than was previously
possible using this technique; and secondly, the inclusion of part-of-speech tags to encode
syntactic similarities. There are a number of further improvements which could be made
to the vector generation algorithm, in particular the use of a preprocessor to differentiate
between separate senses of polysemous words; see Section 8.2.2 for more discussion of this
issue.
In the next chapter, we will discuss how these word representations can be integrated
into my parser.
152
Chapter 7
Improving backoff using word
representations
At this point in the thesis, we have identified the need for improved backoff in statistical
parsing (in Chapter 3), and we have developed a vector representation of words (in Chap-
ter 6). In the current chapter, we will consider how this representation of words can be
of benefit in improving backoff in statistical parsing. But before beginning, it is worth re-
calling Klein and Manning’s (2001a) paper discussed in Section 2.4.4, which suggests that
representing words is not as important for the success of a parser as it might appear.
Klein and Manning’s result could be taken to mean that studying lexicalised probabil-
ity models is not worthwhile. However, there are still several good reasons for considering
word representations. Firstly, Klein and Manning’s paper did show words increased perfor-
mance, just less than expected. Secondly, maybe the reason words are not useful is simply
that their counts are too low. It might still be the case that grouping several words into a sin-
gle word-like category provides a useful level of representation for the parser; this is some-
thing which has yet to be determined empirically. Thirdly, the real benefit of words may be
in allowing a parser trained on the WSJ corpus to generalise to other domains of text. Klein
and Manning were looking at improvements to the parser on the same WSJ domain as it was
trained on, but it may be that deriving word representations from a big corpus including the
WSJ as well as other topics improves the parser’s performance on some of these other topics
too. Finally, I view the Neural Network technique described in this chapter as being much
more general than a word representation. For instance, it could be used to decide whether to
include word information or some other kind of syntactic information, depending on which
is more useful.
So, having concluded that there are good reasons for considering grouping word repre-
sentations in parsing, we now need to consider how we can use the vector-based word rep-
resentations we derived in Chapter 6 to improve backoff in a statistical parser. Essentially,
153
we will be considering modifications to Collins’ genprob function described in Section 4.3
— the function which takes an event representation as input and returns an estimate of its
probability.
There are two obvious approaches to modifying genprob using vector-based word rep-
resentations. Firstly, we could alter the function so that instead of computing the probability
of an event containing a word, it computes the probability of an event containing this word
or any semantically related words — that is, words whose vector representations are close in
vector space. Secondly, we could replace the whole genprob function with one more suited
to a distributed input, such as a neural network. The first approach will be considered in
Section 7.2, and the second will be considered in Section 7.3. As a preliminary to either
approach, however, it is important to ask how tolerant Collins’ parsing algorithm is to a
revised version of genprob . We begin in Section 7.1 by investigating this.
7.1 Feasibility study: Noise in backoff
If we are to modify the backoff algorithm, we can expect (and hope!) to get different re-
sults. Before making these modifications it is important to know how tolerant the system is
to errors in these results. For instance, when the output returned by genprob is completely
wrong it is likely to cause the parser to go down the wrong track and produce the wrong
parse, but it is also possible that the error will be isolated to the current constituent and so
have only a minor effect on the parser’s accuracy, especially if catastrophic errors are ex-
tremely rare. Similarly, our modifications may result in slight random shifts, and so it is
important to know how these will affect precision and recall. Presumably some of these
results will be better, and some will be worse. In order to determine genprob’s tolerance
to different results we can experiment by adding noise and measuring the parser’s perfor-
mance. If genprob is finely tuned so that even tiny changes in probabilities result in large
changes to the parser’s accuracy then we must be much more careful than if the probability
model is quite robust.
There are many different ways in which noise could be added to the parser. Noise could
be added to every probability, or only a certain proportion of probabilities. Also, the noise
could either be additive, so that a probability of say 0.7 gets transformed to 0.70± 0.01, or
else multiplicative, to give [0.70/1.01,0.70× 1.01].
Naturally, it would be desirable to test with noise of the same type as will later be added,
but since that is not yet known we have to make an educated guess as to its properties. The
most obvious property is that we are transforming the probability for words and since every
probability derivation includes a word, we should be adding noise to every derivation. As
for the type of noise, it is less obvious if it should be additive or multiplicative. It seems
154
likely that very low probability events should stay as low probability, implying multiplica-
tive; at the same time what we are doing is essentially adding counts which should lead to
(scaled) additive noise. Since there is no clear answer, additive noise was arbitrarily chosen.
The effect of adding noise is measured by looking at changes to the final precision and
recall figures which it causes. This is because we hope making an error at one stage in the
parse derivation will, on average, get corrected at later stages. Our initial test results showed
a disastrous intolerance of any noise. Genprob was modified to add white noise of ±0.005
to every probability generated, and precision/recall dropped from 85% to just 15%! Further
investigation implied the noise was causing the beam to overflow with unlikely parses, and
changing the noise to only affect probabilities over 0.001 led to significant improvements.
Therefore, the problem is more with Collins’ implementation of beam search than with the
probability model.
After the tweak to continue generating zero probabilities, a graph of noise against error
was derived, presented in Figure 7.1. This shows that any integration of the new word
0
20
40
60
80
100
0.001 0.01 0.1 1
Acc
urac
y
Noise level
Effects of noise on parser accuracy
PrecisionRecall
Figure 7.1: Graph of noise against parser accuracy
representations must keep white noise to within 0.01 to have any chance of a reasonable
precision/recall. In other words, any modifications to Collins’ system are going to be hard. It
seems the probability model is very fragile, with even slightly incorrect probabilities causing
major drops in parser performance.
155
7.2 Parsing by grouping nearest-neighbour words
The previous study tells us that only minimal changes to genprob are safe; anything else
is likely to destroy parser performance. Based on this, the first approach taken to integrate
the word vectors was simply to tweak genprob so that the counts for rare words are sup-
plemented with the counts from similar words (where ‘similar’ means ‘close in Euclidean
distance’). The motivation for this approach is that there are a large number of words, such
as lions which Collins discards entirely, but that we really should know something about.
Recall from the previous chapter that one of our measures of the quality of a vector-based
representation was to look at the nearest neighbour words for a set of test words. In our
final scheme, nearest neighbours seemed to be identifying genuinely similar words a fair
proportion of the time; see for example Table 6.3. We can reuse this result here by grouping
the neighbours of rare words to assist backoff.
7.2.1 Integrating neighbours in parsing
It is undesirable to make any significant changes to the parser. Such changes would risk
breaking the statistical correctness of the probability model, introducing bugs, or otherwise
affecting the parser’s performance in a way that is independent of the neighbours. Therefore
we wish to integrate neighbours a little way away from the core parsing code. One effective
method of doing this is to load the neighbour information into the parser as a simple map-
ping between single word (w) and a set of words (W). Then, whenever an event occurs that
involves a word, we can generate a number of pseudo-events involving the members of W.
This technique has a number of advantages. Firstly, it is simple, requiring no changes to
the backoff or the smoothing algorithm. Since w is guaranteed to be a member of W, we are
guaranteed not to lose any counts. Slightly less obvious is that for backoff levels other than
the most detailed, the pseudo-events will have exactly the same properties as the old real
events and so while the numerators and denominators will change, the ratio between them
will not and so the probability of any given event will not change. Doubling the size of the
corpus will mean that any event matching a|b will now occur twice and so if we had P(a|b)
before, then doubling the size of the corpus will lead to doubling of every numerator and
denominator and so no change to any probability.
Extending this to incorporate smoothing also works provided that alpha is independent
of absolute counts. Collins defines alpha in terms of the cardinality of the set of events that
co-occur with b, and since doubling the corpus will not result in any novel events, the set’s
cardinality will remain unchanged.
The most logical way of integrating the new pseudo-events is to add a fourth level of
backoff, so that levels one, two, and three remain the same but the new level ‘two and a half’
156
contains pseudo-events. Curiously, we found this did not work. Modifying the smoothing
equation (Equation 3.3 on page 58) to take four inputs is easy, but we found the parser’s
performance dropped below 80%, even when the new fourth level is just a duplicate of
Collins’ third level (or a duplicate of his second level, for that matter). Since we are trying
to improve the parser’s performance, we cannot afford to have the performance drop before
any useful modifications have been made, and so we resorted to directly modifying the third
level of backoff.
Modifying the third level to load pseudo-events is quite simple to implement. The main
change involves the parser’s initialisation phase, when it creates a hash table of events from
the file of raw events in the WSJ. In the new algorithm, we read in the event file as be-
fore. However, before simply inserting the event into the hash-table, we consult the nearest-
neighbours information to transform the event, by replacing w with each of the members of
W. Since w is itself a member of W, this is guaranteed to include all the hash-table entries as
the old method, but will result in the same effect as if we had also seen this event with each
of the neighbours of w in the place of w. A few other minor tweaks to the parser were also
required to cope with the larger hash-tables that resulted.
7.2.2 Reversing the neighbours
For any given rare word w, we need to create pseudo-events featuring w modelled on real
events where w could be substituted. That is, we must look for events containing neighbours
of w and substitute in w rather than looking for events containing w and substituting the
neighbours of w (which would lead to the wrong output).
Since it is easier to modify the neighbours file than it is to tweak the parser, we reverse
the neighbours by loading the mapping between words and their neighbours into a hash-
table backwards — that is, if a has a neighbour of b then we store in the hash-table that b is
a neighbour of a. We then iterate over this hash-table to produce a reversed neighbour file.
Loading the reversed neighbours into the parser now gives us the results we expect: that
when we see a word we can replace it with its (reverse) neighbours to generate the correct
pseudo-events.
7.2.3 How to select a group of neighbours for a word
Having decided how to integrate neighbours, exactly what should be considered a neigh-
bour? The principal goal is to improve the probability estimate for events involving rare
words and so, especially in light of the results on the effect of noise, it would be safest to
leave the probabilities of events involving common words alone. We consider two methods
for determining the set of neighbours, which are discussed in the remainder of this section.
157
Using a Euclidean cutoff
One method is simply to define the neighbours of a word w to be all words whose vector
representations are within a threshold Euclidean distance of w. What should this threshold
be? It is useful to recall that we do not want to form groups around common words, as these
have high enough counts by themselves. If it happened to be that common words are more
isolated in vector space than rare ones, we could use this fact to help determine a Euclidean
cutoff which led to only rare words being grouped together.
In Figure 7.2 we plot a word’s frequency against the distance to its nearest neighbour.
Before analysing this figure it is worth mentioning that the frequency of words is derived
using their frequency in the T/G corpus and this is why we do not see the ordinary Zipf dis-
tribution. (It would be possible to generate this graph using frequencies from the WSJ, but
the resulting trend is less clear.) The trend we want to see is that there is a cutoff Euclidean
Figure 7.2: Graph of the log of a word’s frequency versus the distance to
its nearest neighbour
distance at which rare words still have neighbours, but common words do not. Since the
scale between ‘rare’ and ‘common’ is continuous, this is always going to be impossible to
achieve completely, but it appears impossible to achieve at all.
There is a clear trend that increasing the word’s frequency leads to a greater distance
to the nearest neighbour, which is a good start. There is also a clear cutoff that ‘common’
words almost always have no neighbours with a Euclidean distance less than 0.001. While
‘rare’ words do frequently have neighbours within this boundary it is absolutely not the
158
case that ‘rare’ words all fall within this boundary. It seems that while the graph has similar
properties to what we want, we must choose a slightly different technique.
While we cannot find a minimum threshold for common words that all rare words are
within, we can find a maximum threshold (of 0.004) that all rare words have a neighbour
within. Many common words also have neighbours within this threshold, but applying it
will significantly reduce the number of common-word neighbours we introduce. Empirical
analysis of the quality of the neighbours generated supports this approach, it seems almost
all neighbours within 0.002 are appropriate, most neighbours within 0.003 are appropriate,
as are over half of the neighbours within 0.004.
Using the N-best neighbours
The Euclidean approach generally works correctly, but still leads to some problems. As an
example of a correlation on the threshold, is has a distance of 0.0039 to does and so it is good
the system (just) decides to classify it as a neighbour. Similarly, aid has a nearest neighbour
of assistance, a perfect choice, but with a distance of 0.03998. However the closest neighbour
to subject is certain at a distance of 0.039. It is impossible to accept the neighbours for is
and aid without accepting the neighbour of subject. The only way to avoid such errors is to
improve the quality of the word clustering.
An alternative approach is to note that while distance is not always a good measure of
neighbour quality, it is true that the closer neighbours are generally better than neighbours
that are further away. Therefore, if we just accept the closest five neighbours of any word
then we can reject many unrelated words that have a close Euclidean distance merely as a
virtue of coming from a dense area of word-space.
Naturally, we must now include an explicit condition that a word’s count is only sup-
plemented with counts from its n best neighbours if it is a ‘rare’ word. Thus an arbitrary
threshold for rareness is required.
7.2.4 Avoiding swamping counts
Despite all the precautions we have taken, the neighbours we have generated are not perfect
and using them to generate pseudo-events could harm the probability distribution of the
target word. This is true both for common words which have a very accurate probability
distribution already, and also for words with very common neighbours, since while they
need their counts increased we do not want to increase them so much as to replace the
meaning of the word entirely with that of its neighbour.
To mitigate this concern while still increasing the counts of rare words, we keep track of
the number of pseudo-events we have created for every word. Once this reaches a certain
threshold, we skip this word as a neighbour, so only increase counts for events with the word
159
itself. Some experimentation showed a threshold between one hundred and five hundred
makes a slight improvement to parsing accuracy, but the exact threshold within this range is
less important; apparently the main effect of the threshold is to prevent extremely common
neighbours of a word from swamping the counts of the word itself. To give an example, a
word that occurs once in the WSJ, such as eked, generates fifteen real events. A word that
generates one hundred events will occur at least ten times in the WSJ, an example would be
dive. Therefore the approach could be viewed as expanding the corpus to the point that the
parser’s old understanding of dive is comparable to its new understanding of eked.
This solution is much more effective than simply preventing common words having
neighbours, since it does allow slight tweaks to common words and more importantly, it
eliminates the need for a sharp distinction between rare and common words. Were we
able to incorporate neighbours as a separate level of backoff, this step would have been
unnecessary.
7.2.5 Summary
In conclusion, we take the neighbours generated in the previous chapter and build a file giv-
ing for each word the words which consider this word as one of their nearest neighbours.
Then we delete any neighbours that are not within a Euclidean distance of 0.005 or within
the closest five neighbours. When loading the neighbours into the parser, we keep count
of the number of pseudo-events we have generated for every word and stop once one hun-
dred pseudo-events have been created. During experimentation, we found these parameters
were quite robust, so even quite large changes in them would lead to only slight changes in
overall performance.
7.2.6 Results and discussion
It is easy to find examples where the modified parser performs significantly better than it
did before modifications. For example, consider the sentence:
But always, in years past, they have bucked the trend and have been able to pick
up a fifth vote to eke out a number of major victories in civil rights and liberties
cases.
When parsing this sentence, the unmodified parser will come across the word bucked, which
occurs exactly twice in the WSJ. Words which occur less than five times are replaced by
#UNKNOWN#, and are almost invariably nouns. Shortly afterwards, the parser will en-
counter eke. This occurs zero times in the WSJ1 So, we will again replace the word with
1Technically, it occurs once, in the testing section. However, recall that all training is performed with the
testing section deleted.
160
#UNKNOWN# and struggle to interpret the phrase sensibly. It is therefore unsurprising that
the parser parses this sentence incorrectly.
In the modified parser, we will not have to replace bucked or eke since their counts are
supplemented by their neighbours: yanked and outstrip, respectively. While these are not the
perfect neighbours, they give the parser enough knowledge to parse the sentence correctly.
Naturally, it is also easy to find counterexamples where the extra counts coupled with
a poor neighbour result in a slightly worse parse. It is necessary to measure the effects
of the changes at a corpus level. To measure the results, Section 23 of the treebank was
parsed using the modified parser. We evaluated the neighbours code on both our own code
(which reimplements (Collins, 1996)) and on Collins’ own (1999) parser; the results of both
evaluations are presented below.
Modifications to Collins (1996)
The results of adding neighbours to our reimplementation of Collins (1996) are shown in
Table 7.1. There are a few points that need to be made about this table. Firstly, the table
Criteria Unmodified One Two Three Four Five
Neighbours
Bracketing Recall 85.18 85.07 85.08 85.07 84.92 84.91
Bracketing Precision 85.05 84.99 84.97 84.93 84.78 84.76
Complete match 24.83 24.70 24.83 24.74 24.57 24.57
Average crossing 1.01 1.01 1.01 1.02 1.03 1.03
No crossing 65.05 65.05 65.00 64.96 64.78 64.82
2 or less crossing 85.38 85.60 85.47 85.33 85.02 85.11
Tagging accuracy 96.50 96.56 96.53 96.52 96.53 96.55
Table 7.1: Performance of Collins’ 1996 parser over Section 23 before and
after integrating neighbour information
includes a column for an ummodified version of Collins, and a column for only one neigh-
bour. Since the neighbour code considers every word its own nearest neighbour, these two
columns could be expected to be identical. They are not identical because the unmodified
version does not include any of the tweaks used to load neighbours — the largest of which
was declaring all words as frequent2. In later columns we see only slight changes, all show-
ing decreasing performance as more neighbours are incorporated.
2Which, incidentally, means that Collins’ code to measure the frequency of words in the WSJ is of virtually
no benefit.
161
Any of these results would not be significant on their own, but the probability of having
four independent results following a trend is much lower than the probability of any one of
the results being lower. Surprisingly, it turns out that this is still not enough for significance
and therefore we must conclude that the effects on performance are inconclusive. About
all we can conclude is that results did not change significantly, and if anything there was a
slight loss in performance. However, it is important to ask if this loss in performance also
caused an increase in generalisation, because if so then the loss is probably acceptable but if
not we should look at alternative measures of integrating words.
The area in which we expect to see the biggest improvement due to neighbours is when
parsing sentences containing rare words, because this is when generalising is most impor-
tant. In practice, the effects will be most clearly visible if the rare words are head words,
because Collins’ probabilities are only conditioned on head words. By sorting Section 23 of
the treebank by the frequency of the least frequent verb in the sentence we can create a sub-
corpus of arbitrary length of ‘sentences containing rare head words’. A sub-corpus of two
hundred sentences was created and results from parsing it using different parsers is given
in Table 7.2. There seems no support from this table for the hypothesis that the neighbours
Criteria Neighbours
1 3 5
Bracketing Recall 83.28 82.89 82.75
Bracketing Precision 82.98 82.74 82.65
Complete match 14.00 14.5 14.50
Average crossing 1.64 1.69 1.70
No crossing 57.50 57.50 57.50
2 or less crossing 81.00 80.00 79.50
Tagging accuracy 96.29 96.25 96.23
Table 7.2: Performance of Collins’ 1996 parser over a sub-corpus of two
hundred sentences containing rare verbs, before and after integrating tag
information
code leads to improved performance for rare head words.
Modifications to Collins 1999
One concern with the above results is that we have used Collins 1996 as a baseline, since
that was the parser we reproduced in Chapter 4. A valid question is whether the same
properties hold when applying the technique to Collins’ later work, since it obtains much
better performance (88% instead of 85%). Since Collins has now released the source code
162
of his 1999 parser, modifying it to create virtual events for neighbours is relatively easy. In
Table 7.3 we show the performance of Collins’ 1999 parser without modification, after being
tweaked to load neighbours.
Criteria Unmodified One Two Three Four Five
Neighbours
Bracketing Recall 88.52 88.50 88.47 88.54 88.40 88.41
Bracketing Precision 88.68 88.72 88.66 88.74 88.61 88.61
Complete match 36.04 35.81 35.63 35.63 35.46 35.55
Average crossing 0.92 0.90 0.91 0.90 0.91 0.91
No crossing 66.68 66.99 66.99 67.13 66.99 67.08
2 or less crossing 87.13 87.44 87.17 87.12 87.08 87.17
Tagging accuracy 96.74 96.82 96.79 96.79 96.79 96.80
Table 7.3: Performance of Collins’ 1999 parser over Section 23 before and
after integrating neighbour information
While the neighbours extension to Collins (1996) showed a gradual decline in perfor-
mance, the extension to Collins (1999) seems to show no change; the fluctuations in scores
are well below the significance threshold, and are best attributed to noise. This is somewhat
more promising. Table 7.4 shows the performance of the neighbours extension on the rare-
verb corpus. In this table, we finally see some support for our hypothesis. Examining these
Criteria Neighbours
1 3 5 7
Bracketing Recall 87.49 87.66 87.71 87.51
Bracketing Precision 87.78 87.80 87.92 87.84
Complete match 27.00 26.00 26.00 26.00
Average crossing 1.48 1.40 1.47 1.51
No crossing 60.00 59.50 60.00 59.00
2 or less crossing 83.00 81.50 82.00 82.00
Tagging accuracy 96.59 96.59 96.65 96.61
Table 7.4: Performance of Collins’ 1999 parser over a sub-corpus of two
hundred sentences containing rare verbs, before and after integrating tag
information
differences using Dan Bikel’s compare.pl program for significance gives a confidence of
80% that each result is significant, somewhat less than the 95% that is necessary. However,
163
again we are viewing results in isolation. If we ask instead if increasing the number of neigh-
bours tends to increase performance then we do get a statistically significant result. From
this we can conclude that the integration of neighbours improves the performance of the
parser in (Collins, 1999) through the inclusion of neighbours, though by less than we would
have hoped.
164
7.3 Parsing using a neural network probability model
While the previous approach was well justified in that it made safe modifications to a fragile
model, it did not have much scope for significantly improving the performance of Collins’
system on test data from the same genre as the training data. Among the faults in Collins’
model is its inability to note which of its parameters is causing it to lose high counts and
smooth over this parameter. Essentially, the model implements an indefeasible rule about
the order in which information is to be thrown away. But there is no reason to suppose that
this order will be the same for every construction in the grammar. For instance, sometimes
the identity of the head word of a construction might be more important than details about
its subcategorisation frame, while other times the subcategorisation frame might be more
important than the word.
While this is easy to see as a problem, it is not obvious how the problem can be resolved:
a hash-table capable of storing every permutation of every event would be inconceivably
large. However, hash-tables are not the only way of encoding large amounts of highly re-
dundant information. Neural networks have been used for this purpose by a large number
of people (see for example Stuart, Cha, and Tapper (2004)).
A neural network is an adaptive algorithm that attempts to mimic the functional map-
ping between some input and output by minimising the error between their approximation
and the target (training) output. Neural networks are particularly suitable when the map-
ping is too complex for a specific algorithm to be designed, while at the same time the
mapping has a number of properties, the most important of which is that it is continuous.
The potential advantages of applying a neural network to genprob are significant. It
would allow an arbitrarily large input vector to be used, so researchers could experiment
with far more complex statistical models than have been considered to date. NNs are well
known for interpolating well between data points, and so degrading gracefully as we move
between points of training, such as on a novel combination of words.
Replacing genprob by a neural model involves several tasks. The first is that all of the
inputs must be mapped onto a continuous space, so that small changes in the input vector
only lead to small changes in the output vector. This will be discussed in Sections 7.5, 7.6 and
7.7. The second task is to decide on suitable training data and parameters for the network;
this will be discussed in Section 7.8.1. The training of the resulting networks is discussed in
Sections 7.9 and 7.10. Having assembled all of the trained network, the method of integrat-
ing the neural network into the parser is discussed with associated results in Section 7.11.
We begin in Section 7.4, however, by introducing the Cascade Correlation neural network
architecture.
165
7.4 Cascade Correlation
The Cascade Correlation architecture was chosen as the most suitable for all of the neural
networks. Cascade Correlation is a supervised, constructive neural network architecture,
meaning that it requires direct training data, and that it grows to match the complexity of
the problem. Because it uses supervised learning, Cascade is faster and easier to train than
networks relying on indirect feedback. Being a constructive neural network architecture,
Cascade requires fewer parameters than other algorithms. Within constructive neural net-
works, Cascade is the most popular due to its extremely fast training. While a full descrip-
tion of the Cascade Correlation learning algorithm takes too long to present here, a simple
overview is useful. For a full description, see Fahlman and Lebiere (1990).
Cascade Correlation can best be understood by contrasting it to Quickprop (which is
itself just a more complex version of the basic Backprop algorithm.) The algorithm starts
with no hidden units and attempts to learn a mapping directly from the input to the output
using the Quickprop algorithm. Since this is training the connections to the output, it is
known as the output training phase.
Learning a mapping directly from the input to the output will be impossible for all but
the simplest of problems and so Cascade will usually fail. Assuming it fails, Cascade will
create n candidate hidden units, each connected to all the input units, and trains each can-
didate unit such that the unit is most active when the network’s error is at its highest. This
is known as the candidate training phase. Essentially, this phase is learning feature detec-
tors for the situations that cause errors. The candidate training phase completes either when
the best candidate unit has reached a plateau (the training is said to stagnate), or too many
epochs have passed (called a timeout). Of these, stagnation is an indication that the network
is still learning, while a timeout is an indication that the problem is too hard.
Regardless of why the candidate training phase stops, the best candidate unit is then
added to the network as a new hidden unit, with the other candidates being discarded. The
new hidden unit is connected to every unit in the network, including the output units, and
finally the connections to the output units are trained. (This is the output training phase
again.) The process of training a single unit, adding it, and then training the outputs is
repeated until either the maximum number of units have been added, or a victory error
criterion is achieved.
7.5 Testing the vector representation of words
The development of a vector representation of words was the goal of Chapter 6. Here we
will examine if the representation that was developed is going to be suitable for input to a
neural network. The way we will do this is by training the neural network to perform rote
166
memorisation of word mappings and then test it on both perfect recall and on its ability to
generalise.
While the theoretically interesting parts of a statistical parser are undoubtedly its ability
to generalise with ungrammatical input and/or poor knowledge, a great deal of what the
parser does in practice is essentially looking up mechanical rules. Writing 1.000 next to
these rules and calling them probabilities does not change the fact that the parser is acting
as a glorified state transition machine. Because of this, the neural network must be able
to perform as a very accurate lookup table when required. This point cannot be stressed
strongly enough: from a research perspective, the interesting aspects of a statistical parser
are the fuzzy areas, but from a pragmatic perspective, parsing is almost always a simple
rule-based system.
This study is a simple test of whether a neural network can be trained to perform lookup
on the word vectors. That is: can it reliably map word vectors into words? A related and
harder test is, when presented with a novel word vector, can it map it to a good choice of
word – this test is extremely similar to the word generalisation we are hoping to achieve.
There are a number of published results demonstrating that neural networks can be trained
to perform both lookup and to generalise using the same network (see for example Garfield
and Wermter (2003)) but it is important to perform this test with my word vectors because
my vectors may well be harder to memorise than previously published results. It is also
important to perform the test using the same neural network architecture as is used in the
final parser implementation. My parser implementation uses Cascade Correlation, and so
Cascade was also used here.
7.5.1 Mapping words to words
Neural networks that map between one representation and another are an effective method
of determining if a neural network can extract salient features from the representations.
In Chapter 6 we developed a vector representation of words. Here we take the first fifty
elements from the vector representations of each word and use this as an approximation
of the word’s vector. The choice of fifty cells is based on previous neural network studies
which show that networks have significant trouble with over a few hundred units, and fifty
units for words leads to the most complex of the networks (dependency) having a little over
two hundred input units.
One avenue of research that was not examined was reducing the size of the input vector
and investigating if the network could still learn. Reducing the vector size has not been
investigated here because Cascade has demonstrated in the past that it is extremely good
at ignoring irrelevant information and so smaller vectors are very unlikely to improve re-
sults here. However, decreasing the vector size would be an interesting avenue to explore
167
from a theoretical perspective, because it would provide some insight into the amount of
complexity in the English lexicon, or at least the complexity that was captured in the vector
representation.
Output from the network is one unit for every possible target word, and so four hundred
units for this first network. This is a large number of units, but this is balanced by their
simplicity.
Sigmoidal activation functions are used for all hidden units. They are the default unit
in Cascade, and there is no reason to believe they are inappropriate here. For the output
layer, an asymmetric sigmoidal activation function is used instead because values of zero
to one are more intuitive than values between −0.5 and 0.5. This is also in line with the
documentation. The default pool size of eight candidate units was used.
This brings us to the error measure. The error measure used was bits which means there
is an error of one bit if a unit is on when it should be off, or off where it should be on.
This differs from the standard error measure which Cascade refers to as index, in which
the amount of error is a scaled measure of the difference between the actual output and the
target. The difference between these two measures is significant: Measuring error using
the normal RMS approach would result in the network being encouraged to get outputs
extremely close with occasional large errors tolerated, while the bits measurement results
in no benefit to the network for choosing the right answer with a higher confidence, and a
significant penalty for each wrong answer. The maximum tolerated error was set to zero,
meaning the network is not allowed to incorrectly classify any words. It is believed this
combination of bits with no tolerance to error will result in a network that is fully trained
but not over-trained, although this was not extensively investigated.
7.5.2 Evaluation
Results from the initial network were successful with the network making zero errors af-
ter a little over a minute3 and using seven hundred and fifty epochs and six hidden units
on average. As a rough estimate of the complexity of the problem, ninety percent of the
words were correctly classified without any hidden units, implying that the input repre-
sentation was almost good enough to make the problem linearly separable. Generalisation
after learning just four hundred words was remarkably good, with a small sample being
presented in Table 7.5. Given the tiny training space, this result is quite surprising. In this
table, the left column being identical to the middle column is unsurprising since this is the
result we trained the system to produce. However there is a strong relationship between
the left column and the right column which is very promising for the network’s ability to
3Times taken throughout this section are informal — based on a single run on a multi-user system under
varying load. They are intended as ballpark figures.
168
Word Nearest Next nearest
the the this
is is does
as as through
has has was
will will would
market market value
But But and
share share income
shares shares securities
Inc. Inc. Corp.
prices prices markets
interest interest real
earlier earlier ago
buying buying selling
Table 7.5: Learned mapping of words to words from four hundred words
generalise.
A much harder test of generalisation is evaluating it on words it has never seen. In this
test we derive the best mapping for the first four hundred words that the network was not
trained on. That is, words 401 to 800. Predictably, it performed much worse on this as is
shown by Table 7.6. While results in this table are not especially good, it is remarkable there
is any generalisation at all. This bodes very well for the later tests. In subsequent runs, I
used a lexicon of one, four and ten thousand words with equally good results.
In presenting these results we have not concentrated significantly on how hard it was
for the network to learn the task, or how well it learned the task, that is, the ability of the
neural networks to generalise. This omission is deliberate; we have not concentrated on ease
of learning because we do not know how closely this task correlates with genprob, and in
order to fully test the network’s ability to learn the mapping we set the error tolerance to zero
which will cause the network to significantly overfit the data and sacrifice generalisation for
accuracy. Were we interested in testing generalisation, we would have tolerated some error
and so stopped training before the network overfitted its training data.
169
Word Nearest Next nearest
production industry use
total small foreign
third-quarter major revenue
President American president
notes bonds interest
25 20 1989
London West Mr.
latest financial late
credit debt loss
earthquake loss past
almost far only
Table 7.6: Evaluation of network generalisation after learning from the
first four hundred words
7.6 A vector representation of tags
While the goal in shifting to a neural network was principally to smooth more finely over
words, it is essential that all inputs to the neural network are vectors and so we need a
vector representation of POS tags. By far the easiest method would be a simple enumeration,
with one bit for every POS tag. There are fifty tags in the WSJ so a simple enumeration
would lead to a fifty bit vector. This would be within the limits of the number of nodes that
a network can process, but since a single dependency event uses five tags, two-hundred
and fifty nodes is probably too many for just the tags in the input. Apart from a simple
enumeration pushing the boundaries, it seems very wasteful that say NNPand NNPSshare
no information in common. Again, we have a problem of dimensionality reduction.
My first approach to generating a vector representation was to create one by hand. With
only fifty tags my intuition was that it would be faster to hand-encode a representation than
to write software to derive a representation automatically. Unfortunately, writing a vector
by hand proved even more difficult than examining the raw word vectors directly. Based
on this observation, the second approach was to create a dendrogram by hand. This proved
slightly more successful than writing the vectors directly, but the only way I had of encoding
this dendrogram as a word vector was as a binary classification tree. Such a representation
requires a bit in the representation for every single branch in the tree, this approach ends up
with the same number of nodes as the simple enumeration — we have not gained anything.
Because they have not resulted in reduced dimensionality, manual approaches were
abandoned and an attempt was made to recreate the process used in generating vectors
170
for words. Again we need a corpus of ‘words’ (in this case, the words are tags) and that
corpus must be much larger than the ‘lexicon’ (the set of all tags). While the lexicon for
words had a cardinality of around forty thousand, there are only fifty tags and so it seems
reasonable that a corpus of around one thousandth of the size of T/G will be adequate, that
is, one million words. We already have such a corpus, after the words are removed, the WSJ
contains a little over a million sequential tags (after Section 23 is removed).
Running the code from Chapter 6 over the corpus of tags required a few minor tweaks,
such as using RMS scaling instead of logarithmic, but nothing significant. A total of twenty-
two dimensions were selected from the output of SVD because this is where the contribution
of each dimension starts to drop off rapidly. In Section 6.4.7 we used these tag vectors to
assist in giving the word vectors a global hierarchy. Then, we used vectors with a length of
just fifteen, but analysis of the standard derivatives from SVD shows that fifteen is perhaps
a little too liberal. The fifteenth component has a contribution of 0.03 (out of a total of one),
but it is not until we get to the early twenties that the contributions fall away: the twenty-
first component has a contribution of 0.004, the twenty-second of 0.002 and the twenty-third
of 0.0000004.
As with the words, raw vectors are impossible to interpret and so a dendrogram is given
in Figure 7.3. To explain some of the relationships in this figure: SVD has found that full
stops are closely related to colons and the end of sentence marker. For nouns it found that
common nouns are similar to plural common nouns, and that proper nouns are similar to
plural proper nouns. It also found that foreign words are more similar to proper nouns
than to common nouns. Verbs are also well clustered, with modal verbs, past tense verbs
and third-person present tense verbs forming a cluster, and so on. Definitions for all of the
elements in this figure are given in Appendix A.
There are very few studies in the literature which can be usefully compared to this
dendrogram. An extremely large number of works, such as (Ushioda, 1996; Powers, 2001;
Schutze, 1998; Finch, 1993) produce dendrograms of words and note the POS structures that
are being formed as a side-effect, but I am not aware of any attempts to produce such a den-
drogram directly. This makes evaluation a little more complex since we have no reference
points, but overall the dendrogram looks quite good.
Returning to neural networks, the first test of the tags is to attempt to learn a mapping
between their vector representation and an enumeration of all tags, much as we did for
words in Section 7.5.1. The results from this test were remarkably successful. With only
fifty training patterns it would be expected for the network to learn the mapping; what was
surprising was that the network was able to generalise from only fifty training points. The
neighbours generated using this method are significantly better than those generated using
the Euclidean approach.
171
VBRP
RBSJJ
JJSCDDT
PRP$PDT
$#
VBGUNKNOWN
RBVBNJJR
RBRIN
TO.
DOT:
NNNNS
−RRB−,
−LRB−FW
POSNNP
NNPS#STOP#
LS‘‘
SYMUH
PRPEX
VBPMD
VBDVBZ
’’WRBWP$WDT
CCWP
0.0 0.4 0.8 1.2
Tag
Distance
Figure 7.3: Dendrogram of tag representation
172
A more neural-network oriented approach for deriving tag vectors would have been to
implement a simple predicting neural network, much like a POS tagger that has no words,
and then use the hidden units for the representation. This approach was not taken prin-
cipally because I already had perfectly working code for generating vectors from words
and so it was easier to reuse the more complex but existing code than to build the simpler
approach from scratch.
7.7 A vector representation of nonterminals
As with tags and words, it is necessary to derive a vector representation for nonterminals.
Like tags, a raw enumeration would be possible but is perhaps undesirable since there are
close to one hundred nonterminals. Also, it would be good if, for example, NP and NPB
have a similar representation. One problem with deriving this representation is that the par-
ent of a word is its part-of-speech tag, which makes it necessary to treat all part-of-speech
tags as nonterminals. This is a problem because it doubles the number of nonterminals,
significantly increasing the complexity of building the tag representation. Another prob-
lem is that while the method for deriving tag representations was fairly obvious since tags
occur sequentially, the method for deriving nonterminal representations is not obvious. Re-
viewing the literature I was unable to find anybody who had attempted to derive a vector
representation for nonterminals, so this representation can be considered a somewhat naıve
first step.
Ideally the vectors should have similar representations for nonterminals when they are
interchangeable. Furthermore, it is known that the bigram algorithm used for tags and
word produces similar representations for events with similar bigram counts. Based on this,
the approach is to derive an event corpus which lists all nonterminals used as defined by
Collins’ event concept. For example, Collins looks at unary events, where a nonterminal
selects its parent, so all such events are extracted from the WSJ, simplified to just include
the nonterminals, and stored in a table. There are six such tables: head → parent, parent
→ head, head→ left (adjacent), head→ right (adjacent), head→ left (non-adjacent), head
→ right (non-adjacent). This approach generated about five thousand events, and output of
acceptable quality, as shown in Figure 7.4.
However the importance of a good nonterminal representation means attempts were
made to improve the results. This was done by asking a linguist4 to hand encode something
approximating the desired output and use this as extra training data. Hand encoding a full
vector representation proved too difficult, but instead a simple classification of tags was
made, which is presented in Table 7.7. For nonterminals, a more complex classification was
4Thanks to my supervisor, Alistair Knott.
173
RBSNNPRBRPDTJJRNNCD
VP−AADJP−A
VPSINV
SQWHNP−A
X−APRT|ADVPSBARQ−A
VBDVBZMD
VBPPP−A
WHPPX
NP−ASG−A
−RRB−SBAR−A
SGCONJP
LSTTOP
FRAG−ANAC
VBVBG
WHADVPWHADJPADVP−A
NXPOS
SSBARQ
S−AFRAG
UCPRRC
UCP−ASYMWDT
FWWP$
WPSQ−A
QP−LRB−PRP$
WHNPNNSJJS
PRT$
WRBPRP
NNPSUH
#LSEX
‘‘VBNPRN
RBNPCC
#STOP#INJJ
DTADVP
PPINTJ
TOADJP
RPSBAR
NPBINTJ−A
0.0 0.5 1.0 1.5 2.0
Nonterm
inal
Distance
Figure 7.4: Dendrogram of nonterminals produced using only unsuper-
vised training
174
Category Members
Nounish CD NNP DT NNPS IN NNS JJ POS NN
Verbish , quotes CC VBD RB DOT VBN TO PRP VBP UH VB MD VBZ VBG EX
Other FW JJR JJS LS PDT WDT RBR symbols RP WRB RBS WBR WP
Table 7.7: Hand encoded categories for POS tags
developed and this is given in Figure 7.5. Note: this figure includes the tag classifications
just mentioned.
Figure 7.5: Hand encoded representation of nonterminals
Encoding this tree into the computer is relatively easy, especially by taking advantage
of the directory tree which means membership can be computed by running built-in com-
mands, reducing the programming required to simple shell scripts. Once encoded, the data
cannot be integrated by simply adding the information as extra features because it would
result in too many features. Instead, we add the information as an extra type of event and so
can use PCA to choose what information to keep. This proved to work very well at detecting
and removing redundant information. The standard deviations showing the importance of
different dimensions do not have a sudden drop like the tags do, making it hard to decide
the number of dimensions to keep. The first five dimensions capture approximately forty
percent of the variations, the first ten capture approximately fifty percent, the first fifteen
capture almost sixty percent, and the first twenty capture about sixty-four percent. While
this is clearly diminishing returns, it is not obvious where to draw the line and so fifteen di-
175
mensions was chosen as fewer dimensions appears to produce a worse dendrogram. Since
there is no similar work in the literature, it is impossible to compare this result to others.
Overall it appears the approach works quite well.
Returning to the now familiar neural network enumeration test, we are able to learn the
nonterminal mapping in just nine seconds (two thousand epochs), with the resulting net-
work containing sixteen hidden units. This test was as successful as the others: all nonter-
minals mapped correctly and showed some good generalisation such as the nearest neigh-
bour to a gapped sentence being a prepositional phrase, or that nonterminal complements
are more similar to other nonterminal complements than to nonterminal adjuncts.
In conclusion, results are better than would be expected for a first attempt at a vector
representation of nonterminals. Though they are not as good as the tag representations,
they are likely good enough for our purposes.
7.8 Neural network design
We have now shown that the quality of our input is high enough for processing by a neural
network. This still leaves us with two tasks before we can begin to replace genprob. The
first of these is deciding what to use for training data, and the second task is the parameters
we are going to use. These will be discussed below.
7.8.1 Training data
There are two obvious sources of training data for a neural network parser: the raw event
file or actual output from genprob. Using genprob would be easiest — it is a function that is
called many times and so simply logging its inputs and outputs will provide an unlimited
amount of training data. It would be preferable if we could use the event file directly since
we are trying to improve on genprob, which is very hard to do when genprob is the training
data being used.
However, it is not obvious how to use the event file for training. Logging from genprob
will lead to training instances in a format similar to: event→ probability, the exact format
necessary to replace genprob. Using the raw event file would instead lead to training in-
stances similar to rhs→ lhs. Converting this into a probability would require something
like placing an enumeration of all possible lhs on the output and using a winner-take-all
learning strategy.
Even if we use the winner-take-all approach, there are still a number of serious prob-
lems. The number of outputs is infeasibly large in places; for instance dependency events
generate a word, tag, and nonterminal for 50000× 50× 100 possible outputs, several orders
of magnitude higher than any network can manage. Restricting ourselves to events seen
176
SBAR−APRN
SBARADVPSINV
SSG
ADJPPP
UCPIN
VPNP−A
NPRB
#STOP#JJ
CDCCDT
PRP$JJR
NPBQPNN
NNSNNP
NNPSPRPVBNVBGVBP
VBTO
VBZ‘‘
EXUHMD
VBDFRAG
SQSBARQ
VP−AINTJ
NXCONJP
PRTLSTFW
NACPOS
$JJS
−LRB−−RRB−
RBRWRBWDTRBS
#LS
PDTSYM
RPWP$
WPS−ARRCTOP
XWHADVP
WHNPWHADJP
WHPPSG−A
FRAG−AWHNP−A
INTJ−AX−A
SBARQ−APRT|ADVP
SQ−AADJP−AUCP−A
PP−AADVP−A
0.0 0.5 1.0 1.5 2.0
Nonterm
inal
Distance
Figure 7.6: Dendrogram of the representation of nonterminals
177
during training would reduce the number of possible outputs to an acceptable number but
would prevent the parser generating novel combinations. Even if we are able to somehow
represent the outputs (perhaps using three neural networks where each one generates a
component of the event, analogous to Collins generating the tag and nonterminal before the
word) we run into the problem that we are training with only a tiny number of positive in-
stances. Ordinary training algorithms are unable to learn under these circumstances and so
we would have to investigate nonstandard training techniques (for example pseudo-events
or training in parts.)
Since I was unable to design a network that used the event data for training, I was forced
to use the simpler logs from genprob. The main goal is thus to test the feasibility of imple-
menting a statistical backoff system using a neural network, rather than make an improve-
ment on the state-of-the-art. Even if this is less than ideal, it might nonetheless be that the
network is able to learn to interpolate better than the algorithm which provided its training
data.
7.8.2 Neural network parameters
Even having decided to train from logged events, there are a number of details that need to
be worked out, and these will be discussed next.
Multiple networks for different genprob calls Recall that Collins’ genprob function is
called with different numbers of parameters for different types of event. It makes sense to
reproduce this by creating a different neural network for each event type. Thus we need
six separate networks, for tag events, unary events, prior events, subcat events, dependency
events and top events. The tag network corresponds to my POS tagger and is used here
as it was in the parser largely as a prototyping tool. (If Collins’ model three were to be
implemented, then an extra network would be needed to generate gapping information.)
Many zero outputs We have already seen that genprob is intolerant to noise. Another
major concern is that genprob is not at all Gaussian in shape. Almost 99% of the calls to gen-
prob result in a generated probability of zero. Even amongst non-zero values, the output is
non-Gaussian, as is shown in Figure 7.7. This graph is best categorised as a bimodal Poisson
distribution, so that most of the time we should choose to produce a value of approximately
zero, while in about ten percent of the remaining instances we should produce a value of
approximately one.
Representation of input units The vector representation for tags, nonterminals and words
has already been described. If any of these turn out to be insufficiently precise then they
178
0.0 0.2 0.4 0.6 0.8 1.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
density(x = d, from = 0, to = 1)
N = 25981 Bandwidth = 0.05026
Den
sity
Figure 7.7: Probability of different outputs from genprob, after outputs of
zero are excluded
179
can be supplemented with an enumeration; supplementing them would however have the
disadvantage of greatly increasing the number of units. The best input representation thus
needs to be determined empirically. For subcategorisation frames and the distance metric,
the representation used will be discussed where these are used.
Single vs multiple output nodes The genprob function only has one output, the proba-
bility. However, it is equally possible for the neural networks to have many outputs, with
each output corresponding to one possible value that the parser could be generating. For
instance, one value could correspond to generating a parent of SBAR, while another cor-
responds to S. In practice I found some networks favoured a single output, while others
favoured multiple outputs. These will be discussed in the tuning section (Section 7.9.)
Activation function Hidden units use sigmoidal activation while output units use asym-
metric sigmoidal activation.
Number of hidden units A maximum of six hundred hidden units is permitted for each
network. It is more likely that this is too many (so permitting overfitting) than that it is too
few. Later we found this was much too many, and most networks use only a tiny fraction of
the permitted six thousand.
Number of ‘candidate’ units Recall that a candidate unit is a potential hidden unit con-
sidered by Cascade. Due to concerns that with so many training epochs we could easily
have weights going to infinity, sixteen candidate units are permitted instead of the default
of eight.
Volume of training data Even having decided every single property of the training data,
one critical aspect that has not been addressed is, how much data should we use? Since
training data is derived from genprob, it is available in virtually limitless quantities. The
rule of thumb that ‘more is better’ does not apply to neural networks. Increasing the amount
of training data will effectively increase the amount weights will change every epoch which
can easily cause a network to fail at learning a task in much the same way as setting the
learning constant too high will.
Unique training data Recall that training data is generated by sampling the genprob func-
tion. Since genprob is frequently called with the same parameters, this means the sample
has a significant number of duplicate entries. For instance, a random sample of one hun-
dred thousand calls resulted in just over six thousand unique calls. This may be a good
thing because it will encourage the neural network to assign more importance to data that
180
occurs frequently, which is why it has not been discussed up until now. It may also be a bad
thing, since it will both cause overfitting and will result in training on just a tiny sample of
the space at the optimal training size of around one-hundred thousand training instances.
Incremental training A technique for training a neural network with a large amount of
data is to partially train the network with a sample of this data first, to produce a rough
approximation of the weight space, and then incrementally train the resulting network with
larger samples of the data. This approach may be necessary for some of our larger networks.
Number of training epochs The maximum number of epochs for training connections to
and from each candidate hidden unit was set to five thousand. While many values timed
out before stagnating at five thousand, it was decided that increasing the value would be
as likely to hinder the networks with false progress as it would be to assist with premature
locking of the weights.
Validation data Cascade supports testing data to detect when overfitting is outweighing
the learning of the basic function. This is implemented by running the test set on the inter-
mediate network and outputting the current error index. This facility is used throughout
the results, with the same amount of test data as training data. It would possibly have been
better to use a consistent amount of testing data instead, but varying the amount makes it
easier to cope when the functions being simulated do not have so much data. The validation
approach is somewhat simpler than cross-validation, but again we are looking for successful
networks rather than proof of consistent repeatability.
Error measure The difference between measuring error as an index or in bits has already
been discussed for the enumeration networks in Section 7.5.1. There we decided ‘bits’ were
appropriate since we wish to get binary values correct and are concerned about overfitting.
Here however, ‘index’ is often appropriate since we are both dealing with real-valued data
and we often wish to get very high accuracy on common training instances. We will swap
between these measures for the different networks as appropriate.
Victory criterion The victory criterion is not set due to concerns about overfitting.
Evaluation of trained networks The error measure is not a perfect measure of the quality
of the network, so it is useful to consider other (more accurate and expensive) methods
of evaluating a trained network. A very useful visual aid is a scatterplot of the network’s
output against the output of genprob which it was trained to reproduce. This can help
identify where the network makes the mistakes it does. Another evaluation method is simply
181
to run the parser using the network as a replacement for genprob, and see what its effect on
precision and recall is. Obviously both procedures are very time-consuming; they cannot
be used as part of the process of training, but they are useful in helping to decide between
alternative network designs and training regimes.
Evaluation of reproducibility Normal practice in neural networks is to test that results
are reproducible by performing multiple trials. There are two reasons this is not performed
here. Firstly, we are not so much interested in reproducible results as we are in a trained
network, and secondly at several days to train each network it is impractical to wait the
weeks that multiple trials would require.
7.9 Training the tag network
As in the reimplementation of Collins’ parser, I began by considering the task of POS tag-
ging, since this task has all of the complexity of the other parsing tasks, but can be inde-
pendently evaluated. My tag network was therefore a prototype for my other networks;
accordingly, I will motivate its design and training regime in some detail, and present the
other networks more succinctly. In addition, I include discussion of techniques here that are
not necessary for the tag network, such as incremental training, because they can be most
accurately evaluated on the tag network.
Recall from Section 4.4 that the tagger is a simple HMM based POS tagger. It does not
include the argmax that is standard in any high-performance tagger (see for example Char-
niak, Hendrickson, Jacobson, and Perkowitz (1993)) because it was developed to fit with
the parser as closely as possible rather than as a high performance tagger. The probability
model uses the previous two tags, the current word and the current tag. The exact number
of input and output units this corresponds to in the neural network depends on a few de-
sign decisions. If tags have an enumeration of nonterminals included and a single output
node then this gives a total of four hundred and forty units. If the single output node is
replaced with multiple outputs then this eliminates the need for one of the input tags and so
reduces the network’s size to three hundred nodes. If tags do not include an enumeration
of nonterminals then these values drop to one hundred and sixteen and one hundred and
fifty nodes respectively. So, the choice of representation has a huge effect on the complexity
of the neural network.
Actually integrating the neural network into the parser is quite simple. The distribution
of Cascade already includes code for performing feed-forward operations, and so all that
is required is to encapsulate it into a class so as to hide its global variables from the rest
of the system. A little extra work is required for each operation in that the input is in the
182
0
0.05
0.1
0.15
0.2
0.25
0.3
0 100 200 300 400 500 600
Err
or in
dex
Units used
"tag-network-errors.points" using 1:2
Figure 7.8: Plot of errors in the tag network against units used
parser’s internal format rather than the vector format, but a simple lookup suffices for the
transformation.
Before considering alternative network designs and training regimes, we will first present
a single network all the way through to evaluation. This will make it much easier to contrast
alternatives later.
7.9.1 The initial tag network
The first network evaluated uses only tag vectors (no enumerated tags). It uses twenty
thousand non-unique training instances, corresponding to roughly twelve thousand unique
instances. Training was given a victory criterion of 0.10. A graph of the error reducing as
hidden nodes are added is included in Figure 7.8. This figure has a number of interesting
properties. Firstly, the error consistently decreases with time, and while it is approaching
an asymptote, it is not approaching it especially fast as we would expect given that the
problem is particularly hard. There are two conclusions we can make from this: firstly, it
means the problem surface is amenable to neural network learning, and secondly it appears
that the problem simply takes a long time to learn. This is, of course, concerning for our
other networks since they are up to twice the size, and somewhat more complex.
As can be seen from the graph, victory was not quite achieved with the final network
using six hundred hidden nodes for an error index of 0.1054. Even partially training the
183
tag network is an extremely slow process. Just loading the training data file before learning
begins takes over a minute, and additional units are trained and added to the network in ap-
proximately linear time, so the six hundred node network took seventeen hours to build on
a two gigahertz AMD CPU. Curiously, using a one gigahertz IBM CPU instead resulted in
almost identical training times. Whether this is because of the IBM machine’s better mem-
ory bandwidth, the use of altivec, or other issues was not investigated since we are only
interested here in the final network.
Since the network was unable to reduce the error to zero, it is obvious that there are some
values which the network is still getting ‘wrong’. A single error index is not an especially
useful indicator of these, so in order to get a better visual picture, a scatterplot of the output
of the network and the values genprob produces is presented in Figure 7.9.
Figure 7.9: Scatter plot of output in the tag network using six hundred
hidden units against genprob’s output
This figure contains some interesting information — for instance the odd horizontal lines
between 0.2 and 0.45 (indicating that the network generates certain values for a relatively
large range of inputs) and the greater error at the extremities of the function. Certainly it
seems that the basic shape of the function has been learned correctly. However, recall that
there is a much more effective method of evaluating the network, which is simply to test
how well it works at replacing genprob. As described in Section 4.4, the genprob-based
POS tagger obtains 93% and as a baseline figure the tagger can obtain an accuracy of 89.4%
without the use of any historical context. The tag network just described obtains an accuracy
184
of 87.6%, lower than that which can be obtained by simply using each word’s most common
tag.
Clearly, we need to experiment with alternative architectures and training regimes. These
experiments are presented in Sections 7.9.2 and 7.9.3.
7.9.2 Network architecture
In this section we will select the input format, the number of hidden units, and the output
format.
Number of hidden units
The first trial was to vary the number of hidden units in the tag network. My hypothesis
was that there were too many hidden nodes in the initial network, leading to overfitting
problems, and that a network with fewer hidden nodes would perform better. To test this,
I used the same input and output representations as in the initial network, increased the
number of training instances from twenty to fifty thousand, and experimented with different
numbers of hidden nodes. In Table 7.8 I present the effect of adding hidden units to the
tagger’s accuracy, along with the network’s error.
Tagger Network error Tagger accuracy
10 units 0.2269 0.9374
20 units 0.2182 0.9363
40 units 0.2025 0.9343
60 units 0.1910 0.9332
80 units 0.1816 0.9327
120 units 0.1644 0.9311
160 units 0.1489 0.9299
Genprob N/A 0.9299
Table 7.8: Tagger accuracy as hidden units are added to the neural net-
work
This result is very interesting. Firstly, I have shown that the minimally trained neural
network generalises better than genprob and so obtains a higher accuracy than its training
data. This is an extremely promising result since the parser will be attempting to perform
a similar generalisation. Another important result is that while the neural network was
continuing to improve, this was having no beneficial effect on the accuracy of the tagger.
It may seem counter-intuitive that we should stop training long before the error reaches a
plateau, but the reason this is correct can be seen by examining the error index in the held-off
185
test set. For this data, the error drops rapidly to 0.25 after ten hidden units, and then slowly
decreases for about one hundred hidden units before beginning to rise. The conclusion
from this is clear; Cascade will begin to overfit after just ten hidden units, and after twenty
or so hidden units the benefit of further training is outweighed by the inflexibility caused
by over-fitting. A graph showing the error index of both the training data and the testing
data is presented in Figure 7.10.
0
0.05
0.1
0.15
0.2
0.25
0.3
0 20 40 60 80 100
Err
or in
dex
Units used
Training errorTesting error
Figure 7.10: Graph of the training set and the test set error as hidden
nodes are added to the tag network
To see exactly what the network has learned after twenty units, a scatterplot showing the
output of the network when compared to genprob is presented in Figure 7.11. This figure
shows that the network has learned when to produce a one, and when to produce a zero,
but within these boundaries pays only lip-service to genprob. We again see the horizontal
lines in this figure, so they are likely a property of using Cascade rather than a quirk in the
initial scatterplot. Another interesting property to note in this figure is that the network has
trouble producing values very near to one. This is possibly caused by a hack in the Cascade
code for sigmoidal activation functions which says if the output is near one it should be
treated as one. (A similar hack rounds near-zero outputs to zero, but there is much more
training data in the near-zero case, which helps it maintain accuracy in this region.)
Since almost all outputs from genprob are zero, the network is able to get a relatively
low error rate just by getting these values correct. This is of concern for the parser since
occasional catastrophic errors are known to be problematic. If necessary, the initial network
186
Figure 7.11: Scatterplot of output from the tag network against genprob’s
output, using just twenty hidden nodes
shows that over-training can be used to eliminate these occasional large errors in exchange
for very small errors and no generalisation.
It is hard to see through the cloud of central values in the scatterplot. To determine if the
core shape of the function is correct, a density plot is presented in Figure 7.12. A density plot
is identical to a scatterplot except instead of plotting individual points, a colour is displayed
whose brightness is determined by how many points are nearby.5 This figure clearly shows
that the core of the function has indeed been learned after just twenty nodes; there is a clear
(if blurred) leading diagonal in the graph.
Error measure
We have been using the index method of assessing the network’s error because it discour-
ages many small errors. The reason these errors are concerning is that the initial test of noise
in genprob (Section 7.1) showed even small amounts of noise are unacceptable. However the
bits error method proved more effective than the index error method for the enumeration
networks since it has much less of a problem with overfitting.
5I was unable to find code to derive a density plot, and so I wrote it myself. My code views the data as a
three-dimensional surface where the height is the log of the number of points in the scatterplot within a certain
radius. More sophisticated approaches where points are assigned a weighting based on their distance would
also be possible but have not been explored here.
187
Figure 7.12: Density plot of output from the tag network against gen-
prob’s output, after twenty hidden nodes have been added
188
We evaluated the tagger network using the bits method and were able to obtain only
slightly lower performance to that derived using the index method. With ten hidden units,
performance dropped from 93.7% to 93.4%, and for twenty hidden units, performance dropped
from 93.6% to 93.2%. Since it is likely the bits method will cause more problems in the parse
and it leads to lower performance, we will not be using the bits method. However it is worth
noting that performance dropped much less than expected.
Representation of the input layer
So far we have not included an enumeration of tags on the input layer because the larger
network takes longer to both learn and run once trained. However, it is well worth including
if the final accuracy is higher. I therefore experimented with a network which has enumer-
ated tags in the input layer. This raises the number of input nodes from 94 to 324, with a
corresponding doubling in training time.
The error index for any given number of hidden nodes appears to slightly favour enu-
merated tags. Additionally, the enumerated network asymptotes to a significantly lower
total error (0.060 instead of 0.1). However, neither of these details is particularly important:
by any other metric, such as total number of nodes, number of weights, or training time, the
enumerated network has a higher error rate. Further, we have already shown that networks
with a large number of hidden nodes actually perform worse. The key test is how well it
tags, and the answer to this is (within statistical error) identically. Since the enumeration is
not helping, the idea was abandoned.
Representation of the output layer
The next test is whether a single output should be used, instead of the multiple outputs that
have been used until now. If a single output is used then the target tag will be placed on the
input layer, and the output will correspond to its probability (much as genprob takes the tag
as input and generates a probability). Because this will increase the number of training in-
stances by a factor of fifty without significantly speeding up each epoch, its effect on training
time is problematic. The reason this test is important is that the dependency network cannot
enumerate all possible outputs and so we need to investigate the feasibility of placing the
output on the input layer with a simpler test case.
There is one other major benefit of using a single output representation: the probability
no longer has to be generated in one step. Recall that the distribution of probabilities in
genprob is far from normal, with an extremely high number of zero outputs, quite a lot of
one outputs, and just a few values in between. Essentially, the network is initially trained
to produce zero, and then under which circumstances to produce one instead of zero, and
finally when to produce another output. This is a lot of steps for the network to learn, and it
189
makes sense to split training instead into two subtasks — deciding if the output is zero, and
if not deciding what the output should be. This technique of splitting complex tasks into
logical subtasks has been shown to be very effective in solving complex problems in neural
networks (see for example, Rueckl, Cave, and Kosslyn (1989)), although it is more normal to
combine the multiple networks in parallel rather than serially.
Testing the splitting hypothesis was done by training both sub-networks. The non-zero
network was able to learn quite accurately, as was the discriminator network. When the
discriminator is supposed to predict zero it does so two-thirds of the time, and when it
is supposed to predict non-zero it does so over 99% of the time. (Training patterns were
weighted to discourage false zeros since the non-zero network correctly produces zero about
95% of the time, as shown on the density plot.)
However, integrating the networks with the tagger was less successful. Recall that the
tagger’s accuracy with multiple outputs is around 93%. With a single output in the net-
work, and using a similar number of training instances, tagging accuracy dropped to just
5% (about twice as good as guessing randomly). Investigating why this is the case, a hack
was added to the code causing the probability estimate from the non-zero network to be re-
placed by an estimate from the multiple-output network. This gave an acceptable accuracy
of 92.7%, showing that all the problems are in training the non-zero network.
It is likely the plummeting accuracy is due to the significantly reduced training data. To
test this, the network was trained with half a million training instances (the limit of Cascade,
and yet corresponding to just ten percent of the training data with multiple outputs). This
test resulted in a final accuracy of 89%. Based on this result we conclude that using a single
output is undesirable.
Another potential benefit of using multiple outputs is that it becomes much easier to
ensure that all probability distributions sum to one. Curiously, I found the inclusion of this
normalisation resulted in a slight drop in performance (about half a percent) so, while it is
desirable from a theoretical perspective, it has been left disabled for performance reasons.
Vector length
The length of the vectors has been based on the output from SVD. When SVD sorts the
dimensions, it outputs not just the sorted dimensions, but the amount of information con-
tained in each dimension. Using this information, I have cut off the vectors at the point
where the information content of new dimensions drops significantly.
However, it could well be that this approach is either too conservative or too liberal. For
example, the tag vectors are likely to be extremely important in the tag network, so it may
well be that we need the less informative dimensions. However the words are probably
principally used as keys in a lookup table, and so we may well only need twenty or so
190
dimensions rather than fifty.
Due to time constraints, the effects of vector length on the tagger’s accuracy were not
investigated. Instead we simply rely on the output from SVD.
7.9.3 Training data
In this section we will decide on exactly how to choose the training data, how much to use,
and how to use the training data to train the network.
Amount of training data
The networks presented so far have been trained with fifty thousand training instances. In
Table 7.9 I compare the tagging accuracy as I vary the amount of training data.
Tagger Network error Units Tagger accuracy Training time
10k 0.1873 20 0.9295 12m
20k 0.2124 20 0.9347 17m
50k 0.2182 20 0.9360 1h
100k 0.2212 20 0.9396 2h10m
200k 0.2241 20 0.9353 8h58m
500k 0.2205 40 0.9359 10h15m
Genprob N/A N/A 0.9299 N/A
Table 7.9: Tagger accuracy as extra training data is provided to the neural
network
This table shows that we can continue to see performance improvements in the tagger
up to about one hundred thousand training instances. Beyond this results are less obvious
since they depend more on the number of hidden units used, but they do not appear to be
getting better.
Incremental training
For the larger networks, such as dependency, it will not be feasible to train them on all of
their data right from the start. Therefore in Table 7.10 I present the same experiment as in
Table 7.9 but this time the training starts from the learned value in the previous network
rather than from scratch. In neural network literature this is known as serial learning and is
generally an extremely ineffective approach. Since my data is much less contradictory than
is typical in neural networks, I am hoping for more success.
191
The publicly available implementation of Cascade does not support loading and sav-
ing of weights during training, but I implemented it using only slight modifications to the
source code. A side effect of how Cascade trains requires each of these networks to have
more hidden units than the previous row, which quickly leads to very large networks. While
Tagger Network error Units Tagger accuracy Training time
10k 0.2011 10 0.9272 7m
20k 0.2130 20 0.9311 16m
50k 0.2150 30 0.9355 36m
100k 0.2116 40 0.9367 1h20m
200k 0.2283 50 0.9360 2h45m
500k 0.2108 60 0.9353 4h50m
Genprob N/A N/A 0.9299 N/A
Table 7.10: Tagger accuracy as extra training data is incrementally provided
to the neural network
these results are not especially interesting for the tagger, they are important for the parser.
It means that for the huge and very complex networks such as dependency, we can safely
train the network on a small number of training instances and then increase the training
data. Another interesting result is that ten neurons appears not quite enough to correct the
errors from the previously trained network, implying that we have been overfitting the data.
An alternative interpretation of incremental training is to train a second network on the
output from the first network since the first network outperformed genprob. Roughly, the
idea is that generalisation can be bootstrapped. Predictably however, this approach did not
work with performance dropping to 89%.
Using unique training data
Training data has so far been generated by sampling the genprob function. Since the gen-
prob function is frequently called with the same parameters, this means the sample has a
significant number of duplicate entries. A sample of one-hundred thousand calls resulted in
just over six thousand unique calls. This may be a good thing because it will encourage the
network to assign more importance to data that occurs frequently, which is why it has not
been discussed up until now. It may also be a bad thing, since it will both cause overfitting
and will result in training on just a tiny sample of the space at the optimal training size of
around one-hundred thousand training instances.
To determine which hypothesis is correct, we have tested the network when using only
unique training data. Unfortunately, performance dropped from ninety percent to just thirty
192
percent, and the absolute error in the network also jumped. Since the absolute error has also
increased, it was trained until the absolute error was a similar value. This produced a more
acceptable 86.5% accuracy after thirty hidden units. While this is much worse than the
network with duplicate data, it can be shown to make fewer gross errors and so we do not
want to abandon the idea of unique data quite yet.
Recall that the probability distribution from the tagger means it must produce zero in
most instances, and in instances where it is to produce nonzero it must usually produce
one. However, in the unique data we have much fewer probabilities of zero (just 70%).
Therefore it is likely the unique network was being less careful about these zero values than
the network with duplicate data. This is confirmed with a scatterplot in Figure 7.13, which
shows the network does quite well at all areas of the input rather than concentrating on zero
and one.
Figure 7.13: Scatter plot of output from the tag network against genprob’s
output, using unique training data
As noted previously, the duplicate network does well at zero and one but takes a long
time to accurately map the rest of the function. This scatterplot shows a quite different result
with the network performing much more evenly. Perhaps the best approach is a compromise
where half the data comes from duplicated events and half is unique. The results from this
mixed approach are presented in Table 7.11.
Clearly we have corrected the initial problem with using unique training data, our per-
formance is close to that with only duplicate data. But what have we gained? While the
193
Network Units Network error Tagger accuracy
50k 10 0.2876 0.933
50k 20 0.2763 0.932
100k 10 0.2870 0.932
100k 20 0.2729 0.931
200k 10 0.2351 0.937
200k 20 0.2279 0.929
400k 10 0.2888 0.909
400k 20 0.2748 0.904
Table 7.11: Performance of different taggers using half unique and half
duplicate training data
tagger’s accuracy is slightly lower than we were able to achieve using duplicated training
data, this does not tell the whole story. A scatterplot of this semi-unique data in Figure 7.14
shows the network has much tighter control over all areas of the output rather than just
around zero and one. This approach is therefore the safest to use when we wish to avoid
major discrepancies with genprob. We will therefore mix unique training data into future
networks.
Training with raw data
By far the most ambitious experiment was to eliminate genprob from the loop and train on
the raw event file. We have shown that the tag network can be trained based on genprob.
But can it also be trained based directly on the raw data?
Every line in the event file can be viewed as an event occurring with a probability of one,
and all the alternative tags occurring with a probability of zero. This interpretation gives
exactly the same file format as the multiple outputs from genprob and so it is worthwhile at
least trying to train directly on the raw file. The benefits of successful training are obvious:
any weaknesses in the smoothing function are eliminated, and future extensions do not have
to be simulated in genprob before being converted to a neural network, which is particularly
beneficial if they cannot be simulated in genprob.
The obvious flaw in this approach is that we are training a network to learn a function
mapping (or, technically, a relation) that is impossible to learn. Since the probability model is
necessarily incomplete, a given input will sometimes lead to one output and at other times
will lead to another. We are hoping that the neural network will average between these
outputs, effectively producing a probability. Setting this concern aside, converting the event
file to a neural network was straightforward, requiring just a tweak to the code that adds
194
Figure 7.14: Scatter plot of output from the tag network against genprob’s
output, using a mix of unique and duplicate training data
unary events to the hash-table.
There are one million training events available in the training file (because every word
generates a new event). Of these, quarter of a million are unique, and the rest are duplicates.
The only way then to include all unique events and still have a representative sample of du-
plicates is to go to the largest possible neural network (400k training patterns). Surprisingly,
this network was very fast to train; after three hours it had added ten hidden units and was
only about 0.01 away from the asymptote with an error of 0.346. Testing this network results
in a final accuracy of 94.8%. This is the best result achieved and so it is very positive that it
occurs with our most ambitious experiment.
7.9.4 Conclusion
The vector representations seem to be adequate for learning neural representations, which
means we should be able to train the other networks to replace genprob. The network’s
output was better than genprob in most of the tests, so it seems likely we will be able to
improve on genprob in the parser.
Apart from a feasibility study, the goal of this section was to determine good parameters
for training later networks. In this regard we found that multiple outputs outperformed
single outputs, and so we will favour them whenever possible, but that the inclusion of
enumerated inputs is unnecessary. The index error method seems the most appropriate, but
195
the bits method is almost as good.
For the training data we found that around one hundred thousand training instances was
optimal although this number can be varied quite significantly without a large change in
accuracy. Just twenty hidden units seems more than adequate, with a near optimal number
being easily derived from the local minima for the error index of the test data. We also
found that where necessary, we can save a lot of training time at only a slight error cost by
training incrementally. Finally, we found that using genprob to derive our training data is
unnecessary at least for this simple case, and we can obtain better results by attempting to
learn mapping directly from the raw data.
7.10 Training the other networks
While the tag network is easy to evaluate in isolation, this is not the case for the other net-
works. So we will assume what worked best for the tag network will also work best for
them.
In the previous section we found that while the neural network’s errors will continue
to drop over training, this does not lead to better performance and it is more desirable to
stop training when the evaluation on the cross-validation data approaches its asymptote
(around 20 hidden units). We also found that the more data used, the better results we get,
and so the absolute maximum amount of training data is used in all cases. Because this
amount of training data would make initial learning impossible in several cases, we have in
those cases pre-trained the network on a smaller sample and then tuned it on the full set of
training instances.
There are five different networks to be trained, and their training will be discussed in-
dividually below. In each case we will be comparing the parser to performance based on
genprob after just the first hundred sentences of Section 23 of the WSJ, which has a precision
of 85.2% and a recall of 85.6% (remember that precision and recall both drop slightly over
Section 23).
7.10.1 Training the prior network
The role of the prior network is to predict whether the current edge is likely to be part of
the global parse, or if it is leading the parser up the garden path. It is perhaps one of the
least interesting since its output does not directly affect the parse, but it is important in that
any errors will make parsing virtually impossible, since a poor prior probability will result
in immediate discarding of the edge, and too many high prior probabilities will swamp the
beam and thereby cause parse failure. Because the network is so important, we need to be
more careful than we would normally be in training a network that has only three hundred
196
units.
Prior probabilities have a very different distribution to other probabilities, resembling a
gamma distribution. Because of this, the zero/non-zero distinction is inappropriate and the
multiple-output approach is best. In the multi-output approach we must select a single out-
put to generate a probability distribution over and, somewhat arbitrarily, the nonterminal
was chosen. This leads to a network with two hundred units before the inclusion of hidden
units.
Another complication with the probability distribution is that even in the few instances
it is non-zero, the probabilities are very close to zero. About 95% of the probabilities are so
close to zero as to be indistinguishable, but well over 99% are less than 0.002. This level of
accuracy is likely to be hard to achieve in the neural network. One final complication is that
calls to the prior network are highly redundant; over a million calls produced just thirty-two
thousand unique calls.
Ignoring these concerns, building the network for training is simply a matter of sampling
genprob, just as it was with the tagger. In parsing the first fifteen sections of the WSJ, only
thirty thousand unique training instances were found and so these were all included in
training, and supplemented with one-hundred thousand duplicate entries to provide a little
extra training data and an equal amount of testing data.
Training of this network stagnated immediately, with the test data approaching an asymp-
tote after just five hidden units and reaching a minimum error after thirty hidden units.
When the parser was run with the network version of prior, it was apparent that the beam
was overflowing with too many low probability nodes, so the parser code was modified
to round approximately zero probabilities down to zero. With this modification, the parser
achieved 86.0% precision and recall, which is slightly better than the performance of Collins’
genprob version of prior, but well below statistical significance. The network obtained sim-
ilar scores when trained from the raw data, rather than from genprob output.
7.10.2 Training the top network
The top network is somewhat similar to the prior network. As with the prior network it is
simple, with just one word, one tag, and one nonterminal involved in each production. It
also has a similar probability distribution although slightly easier to manage as the highest
probability is 0.1 instead of 0.02.
The role of the top network is to determine when a parse is complete since the WSJ uses
sentence nonterminals both for complete sentences and for encapsulated sentences. This
does not matter to the neural network which is only concerned with the inputs, outputs,
and their mapping. In this case there is only one tag, one nonterminal and one word in-
volved in the network, for a total of two hundred units. Multiple outputs were used with
197
the nonterminal on the output, so the network has roughly the same number of inputs as
outputs.
As with the prior network, the first fifteen sections of the WSJ lead to just thirty thou-
sand unique calls to genprob, and so these were supplemented with one-hundred thousand
duplicate calls to make sixty-four thousand patterns for both training and testing.
When the parser was run using the network version of top, with ten hidden nodes, it
obtained precision of 85.4% and a recall of 85.8%. Using more hidden nodes resulted in
the same precision and recall, as did increasing the amount of training data. This result is
about the same as obtained by the hash-table approach; it is slightly lower but with only
one hundred sentences we cannot say if the difference is statistically significant. Training
the top network directly from the event file was also experimented with, with performance
dropping slightly to 83%. Since our purpose here is to find out what works, we will defer
more detailed analysis to Section 7.11.
7.10.3 Training the unary network
The unary network differs from the top network in three ways. Firstly, it is more compli-
cated; compared to the top network it takes an extra nonterminal (the parent to produce),
leading to about three hundred input units. Secondly, it has a wider range of inputs; the
range of words, tags and nonterminals is significantly higher than that of the prior and top
networks. (Previously we used the full set of unique training instances; for the unary net-
work we have to use a subset of the full set.) Finally, while the top network is only used
once per sentence in sorting the parses, the unary network is used about once per word, so
the quality of the unary network is much more important to the parser than either of the
previous two networks. In effect, this is the first ‘serious’ network we have trained.
For the first attempt, a total of one-hundred thousand training patterns were used, fifty
thousand from the unique subset and fifty thousand from the duplicate subset. An equiva-
lent number of both were used to form the testing patterns. However, watching training, it
became clear that there was insufficient training data as evaluating on the test patterns gave
a similar error to the training patterns rather than asymptoting. Even with large numbers
of hidden units, Cascade learnt only the outline of the function, as is shown in Figure 7.15.
In this figure, the horizontal banding that has been noted before, is much more pronounced.
Despite this, precision and recall of the parser using the unary network were quite good at
84.6% and 84.7% respectively (regardless of the number of hidden units tested). Given the
weakness in the graph and that cross-validation had not asymptoted it makes more sense to
increase the amount of training data rather than to increase the number of hidden units. The
maximum increase is a factor of four, which both uses all the unique data at the current ratio
and also is just shy of Cascade’s limit (since the network file is just under C’s 2GB limit).
198
However, this increase did not change precision or recall significantly.
A separate property first noticed in this network is a large increase in parsing time. The
parser was previously slightly slower than one sentence per second but when using Cascade
for all unary evaluations the time per sentence increases to roughly five seconds. This is un-
fortunate, and will be particularly annoying when evaluating the dependency network, but
there is little that can be done about it. Another interesting property is that the network with
fewer hidden nodes actually performed slower. Since the feed-forward operation is slightly
more complex than linear on the number of hidden nodes, this result is somewhat counter-
intuitive but is perhaps caused by the increased ambiguity in the less precise network.
Training the unary network from raw events proved significantly harder; with 400k
training patterns it took eleven hours just to add ten hidden nodes. However, it did appear
to learn, with the error index coming out lower than it did with genprob’s data. Evaluat-
ing on the one hundred sentence corpus gives a precision of 85.1% and a recall of 85.0%,
showing that we have replaced genprob with a neural-network-based function that gives
approximately the same performance. Since this network is not derived from genprob, it is
interesting to graph it against genprob, particularly to see how tightly it follows genprob.
This graph is presented in Figure 7.16. This graph clearly shows that the neural network
is reproducing genprob only in the most cursory manner, and yet we know it performs as
well as genprob. Either unary events are irrelevant or else we do not have to follow gen-
prob closely to get good results. If the latter, then the conclusion from the preliminary noise
experiment described in Section 7.1 was incorrect.
7.10.4 Training the subcategorisation network
The subcategorisation network determines the number of arguments each new phrase takes.
Since this is rarely obvious based on just the head, I had anticipated it being one of the
hardest networks to train.
Subcategorisation events also introduce a new type of output: how should the set of all
subcategorisation frames be enumerated? The solution I decided to use was to enumerate
the maximum number of different noun phrases (6), sentences (2), SBARs (2), verb phrases
(1), and others (1). This leads to 252 different outputs, making this the largest network
trained so far.
Only twenty of these outputs were actually seen during training, but that just means
the neural network learns to always produce zero on some outputs which is extremely easy
to learn. An alternative approach would be to enumerate the outputs in the order they
were seen during training. This would give the same result and use significantly fewer
connections, but it would mean that any extension to the probability model would break
the enumeration.
199
Figure 7.15: Scatter plot of output from the unary network against gen-
prob’s output, using 100k training patterns and eighty hidden nodes
Figure 7.16: Scatter plot of output from the unary network against gen-
prob’s output, using the network trained directly on the raw event file
200
Training the subcategorisation network was not initially successful, with Cascade timing
out instead of stagnating. After a full day of training and having only added six hidden
units, it finally started stagnating. Even then the network’s training would best be described
as minimal, as shown in Figure 7.17. Predictably, this network does not work well at all,
Figure 7.17: Scatter plot of output from the subcat network, trained with
ten thousand events with ten hidden units
resulting in parser precision of 51.7% and recall of 52.2%. Clearly we have to do something
different. A first attempt (which proved overambitious) was to try training on the raw event
file but this caused precision and recall to drop even further (to 25%).
Recall from the tag network that where training is too hard, it is possible to train smaller
networks using serial learning. Instead of training directly on half a million patterns, train-
ing commenced with just 2k patterns, and then the weights from this network were used
to train a network with 8k patterns, then 32k, then 128k, and finally 512k as before. This
approach resulted in stagnation, implying that it was working, and leads to a final precision
of 77.2% and a recall of 75.8%.
In summary, the subcategorisation network seems trainable, but is at the limit of what
can be trained using the representation and training schemes we have considered so far.
Exploring variants on these schemes is likely to improve performance, but further variants
will be not be considered in this thesis.
201
7.10.5 Training the dependency network
The dependency network predicts when two phrases should be combined. Since it is effec-
tively generating the new daughter phrase, it is impractical to enumerate possible outputs
and we must resort to the zero/non-zero approach.
This network is by far the most complex. Not only does it have the most parameters (as
many as 566), it also is called at least as often as the unary network. Finally, the parameters
it is called with show much more variation; the training file leads to half a million unique
events, over twice the number of unique unary events. This means that we cannot use our
earlier approach integrating unique information to span the function’s range, with a sample
of duplicate information to emphasise dense areas.
Despite this complexity, it was relatively simple to train the zerotest network. A total of
133k training patterns was used (sampled randomly, with no emphasis on unique patterns)
and training the network until the test set stopped improving took twelve hours and fifty
hidden units. Evaluating these partially trained networks shows that after five hidden units
the network makes the correct prediction about 89% of the time, and with forty-five units
this rises to about 94% of the time.
For the non-zero network, things are not quite so simple and we must plan the represen-
tation more carefully. One possible improvement is in the representation of subcategorisa-
tion frames. The two candidate representations are: using a simple enumeration for twenty
one nodes, or representing the count of each nonterminal type for eighteen nodes. The deci-
sion about which is best depends on how much data is shared between different events and
since we do not know how much is shared, we tried both and found the first representation
was slightly better.
Apart from changing the representation, we could alternatively split the training process
into two subtasks, in the same way as Collins does. Collins first generates the dependent
nonterminal and tag (‘dep1’) and then the dependent word (‘dep2’). The problem with this
simplification is that neither dep1 or dep2 can use enumerated output and so it turned out
to give slightly worse results.
Despite additional complexity caused by all the different parameters, training on the
network proceeded quite well. With forty thousand training patterns, the network error
approached an asymptote of about 0.2 after twenty hidden nodes. Using the network with
twenty hidden nodes, the parser was very slow. This is because the dependency network
is used as often as the unary network but is unable to benefit from the enumerated output
which reduces parsing time by an order of magnitude. Ignoring this, precision and recall
stayed at 76.0% and 79.9% respectively. While this is a drop compared to genprob, it is quite
reasonable given the complexity of training the model, and given the less than perfect word
and nonterminal representations. It thus serves as a good proof-of-concept for the feasibility
202
of a neural network model of genprob.
7.11 Final evaluation
We have been able to replace all calls to genprob with neural networks, but the performance
of those networks has been mixed; some of the networks work as well as or better than
genprob, and some work less well. It therefore makes sense to retain genprob in the latter
cases. We thus need a way of allocating responsibility between genprob and the networks.
There are cases where we can be very confident that the value returned by genprob is
accurate, such as when the numerator and denominator at level one are high. In these cases,
there is no advantage in using the neural network and so, unless the neural network is a
perfect replacement, we do not. By smoothing between the value returned by genprob and
the neural network’s estimate, we can minimise any weaknesses in the neural network. The
equation I used to derive confidence is:
min(1.0, denom/50) (7.1)
This equation was derived by examining the cumulative density function of the derived
distribution and adjusting the scaling constant until the hash table averaged a confidence
of 0.5. It is possible that a more sophisticated method would lead to better results although
empirically I found very little variation.
With this smoothing in operation, evaluating the final system on the first half of Section
23 of the WSJ results in a precision of 80.02% and a recall of 80.40%. This contrasts with
genprob which obtained a final precision and recall of 85.1% and 85.2% on this same test.
Clearly the neural-network based parser is working, but not working as well as Collins’
parser.
It is also interesting to consider the system’s performance on the two-hundred ‘hard’ sen-
tences featuring rare headwords. For these sentences, the system’s precision and recall are
almost identical at 80.0% and 80.4%. This compares to Collins’ 83.0% and 83.2%. While this
result is still lower than Collins’, it is interesting to note that performance has not dropped
at all on the harder sentences, where Collins’ system dropped two percent. This lends some
weight to the hypothesis that the neural network approach generalises better than Collins
on rare words. If the networks could be improved to reach Collins’ base level performance,
there are prospects for exceeding Collins’ performance on sentences containing rare words.
203
Chapter 8
Conclusion
The work described in this thesis has involved the creation of three large programs and
many small programs. Following how various components of these systems tie together
is complicated and so to aid the reader in following the summary, a somewhat simplified
data flow diagram for the entire thesis is presented in Figure 8.1. Relating the figure to the
thesis itself, the rightmost branch corresponds to Collins’ system as described in Chapter
4; the leftmost branch corresponds to the conversion of words to vectors as described in
Chapter 6; the bottom corresponds to the integration of the two principal components by
way of training neural networks as described in Chapter 7; finally the middle of the figure
corresponds to the reuse of the vector code from Chapter 6, with the reuse itself discussed
in Chapter 7.
In the remainder of this chapter, I summarise the work completed in the thesis, and list
some avenues for future work.
8.1 Summary
This thesis began by introducing the topic of statistical parsing, and surveying the current
state of the art. Within the field of statistical parsing I examined a number of different ap-
proaches and noted that the approach taken by Michael Collins obtains the best results.
While Collins’ approach performs well on the WSJ, it is not especially well suited to gener-
alising to other domains and I hypothesised that this problem was due to its backoff algo-
rithm, particularly how it backs off words. I reimplemented Collins’ parser, principally so I
had a parser I knew perfectly that would be suitable for later modifications.
Having settled on word backoff, I surveyed the field of word clustering and decided to
extend one of the oldest approaches, that of Hinrich Schutze, as the most suitable since it
could be scaled to a full lexicon. I reimplemented Schutze’s work, including my extensions
to support a lexicon far too large to process directly, and explored different parameters and
205
To vector rep
SVD
words.vectors
bigrams
Sort by WSJ freq.
corpus
Generate bigrams
SVD
tag.vectors
bigrams
SVD
nts.vectors
bigrams
genprob log
PARSER
Generate bigrams
Count syntax events
event hash-tables
make hash-tables
Event counter
NPB, Head, SG etc.
wsj.collins
model 2
Select tags
wsj.tags
Tipster/Gutenberg
corpus.unsorted
TOKENISE
Treebank
To numbers
words.freq wsj.combined
PARSER
weight files
genprob vectors
Cascade correlation
Figure 8.1: Data flow diagram for the entire thesis (simplified)
206
their effects on the quality of results generated.
Returning to the parser, I incorporated the word clustering first using a simple neigh-
bours approach and then using a much more ambitious neural network approach. The
neighbours approach is simple and assists the parser slightly when generalising at a slight
cost in places where counts are already high. The net effect is approximately zero on the
WSJ, but it is likely the better generalisation would be helpful in other domains. Replac-
ing probabilistic backoff with a neural network is an approach that has significant potential.
If successful, it would provide innumerable benefits since it would change the process of
developing a language model from a complex process full of compromises to a matter of
simple exploration. It is also a large step towards fully automatic learning in the statistical
parser, which would eliminate our current dependence on the WSJ. While I was able to get
most of the intermediate results necessary, I was unable to get all of them. Until the other re-
sults can be generated, perhaps through better input or better training, the resulting neural
network will not perform as well as the current statistical system.
There are numerous areas in which this thesis incorporated new ideas, and in many ways
it was the interaction of all of these which makes it hard to say categorically which aspects of
the new system work well, and which need to be rethought. Work was largely divided into
three main areas: reimplementing Collins’ parser, generating word vectors, and integrating
the word vectors. These are discussed in turn below.
8.1.1 Implementing Collins’ 1997 parser
There was very little work in the literature describing the complexities in implementing a
statistical parser, concentrating instead on justifying differences in their probability mod-
els. Despite this, as Bikel (2004) and Klein and Manning (2003) have both shown, the im-
plementation details can easily make more difference to the parser’s performance than the
probability model. In this thesis, I have explained how the parser has been implemented in
enough detail that it can be used as the basis for building another parser or experimenting
with the effects of different implementations. The parser I built performs as well as Collins’
1997 parser.
The implementation itself has a modular design, so it is easy to read and modify. It also
includes several new ideas. Most significant is a new data structure for implementing beam
search that significantly outperforms heap-based approaches for extremely large beams.
This is useful both within statistical parsing, and in other areas of artificial intelligence.
The chart data structure also demonstrates how peculiarities of the problem being solved
can be taken advantage of to provide a much faster implementation. From a programming
perspective, the use of a macro language provides a method for reducing redundancy in the
same way as a function does, in situations where it is impractical to use a function. Finally,
207
locking data structures so they become read-only in order to detect pointer-related bugs is an
extension over the standard technique of simply locking the pages before and after arrays.
8.1.2 Word and nonterminal representations
While the field of unsupervised thesaurus generation is very extensive, virtually all of the
techniques concentrate on obtaining the best possible results over a small lexicon rather than
developing good representations for all words.
I have concentrated on one existing technique, that of using singular value decomposi-
tion on a matrix of bigram counts. I have demonstrated how to extend this technique for an
arbitrarily large lexicon, and my extension will continue to produce better results as comput-
ers become more powerful. Additionally, I have shown a number of ways that the technique
can be tuned in order to produce different representations, depending on intended use of
the representations.
This phase also included the development of vector representations for nonterminals
and tags, a task that I have not seen tackled anywhere in the literature. Despite this, the
results showed good generalisations between syntactic structures.
8.1.3 Experiments in using word vectors for backoff
An initial experiment showed that a nearest-neighbours approach to grouping words for
backoff in parsing was basically as good as Collins’ existing backoff scheme, with some
small improvements to Collins for sentences featuring rare headwords. But the more inter-
esting experiment involved the use of neural networks to implement an entirely new backoff
technique.
While neural networks are frequently referred to as function approximators, and well
known for their excellent interpolation, the idea of using them instead of hash-tables in
backoff is new. The idea of using a neural network instead of a hash-table has considerable
merit: neural networks’ distributed representations means they require only a tiny fraction
of the amount of RAM that a comparable hash-table approach uses. Furthermore, they can
take an arbitrarily large number of inputs and so give the researcher much more flexibility
in designing the probability model. Finally, the problems of deciding how to discard data
are completely eliminated by the use of a neural network. The final neural network version
of Collins’ parser was quite successful; when replacing backoff in the POS tagger, the neu-
ral network performs slightly better than the hash-table based approach, and in the parser
it performs equally well in all but two of genprob’s sub-functions. There is considerable
promise that networks will be able to train directly from raw WSJ events. While it would
be desirable for the neural network to outperform hash-tables in the parser, the close result
208
justifies the technique used and implies that with tweaks to word representations, network
designs and/or training schemes, it should be possible to outperform genprob.
8.2 Further work
There are many further improvements which could be made to the system described in this
thesis; I outline the most interesting of these below.
8.2.1 Reimplementing Collins
The most obvious and easiest method of significantly improving results would be to replace
Collins’ 1996 parsing model with his later 1999 model. This change is likely to be relatively
simple since the models do not differ very much. However, there are a number of extensions
over Collins’ approach that are worth considering.
Parsing as search
Recall that the parsing algorithm is implemented as two loops. An outer loop iterates over
every possible span, while an inner loop expands edges to add parents and grandparents.
This inner loop uses AI search techniques to find the best edge quickly, but the larger outer
loop does not, leading to a lot of wasted time that is addressed in this optimisation and the
next one.
It would be natural to treat the entire parsing process as a single AI search. The reason
Collins used many smaller searches is not clear but may be related to his simplistic imple-
mentation of beam search. Using multiple searches also has the advantage of ensuring the
chart always has some nodes considered at each span, while a single search will concentrate
on spans that are generating many edges. So Collins’ approach may be more able to to work
with a poor probability model.
Using a single search would require a complete redesign of the internal parsing algo-
rithm. It is possible that Klein and Manning’s (2002) search approach could be used instead,
but this has not been investigated.
Parsing in chunks
A different approach to the same problem is simply noting that the parser spends most of
its time searching for phrases which cannot possibly be part of the final parse. The reason it
does this, is that the parser does not note that certain parts of the input sentence must contain
a phrase (chunking) and therefore there is no point searching any span that intersects these
chunks.
209
I investigated using a chunking parser such as Abney, Schapire, and Singer (1999) before
the main parse to ban any searches which cross brackets with chunks. This approach was
discarded as chunking parsers at the time I wrote the statistical parser did not obtain signif-
icantly higher precision than the current statistical parser (recall is unimportant). Since then
some advances in the field have been made; Abney has released a revised version of the
chunking parser that supports dependencies, and Kudo and Matsumoto (2001) have devel-
oped a parser obtaining almost 96% precision. If parsing time becomes more of a problem
then the integration of this could assist significantly.
Dynamic updates
Currently, all access to the probability model is static which is not a problem if we are going
to continue parsing the WSJ but inappropriate if we wish to parse sentences from other
domains. The psycholinguistics literature has shown for years that people are primed to
prefer phrase structures they have heard recently (for example, Bencini, Bock, and Goldberg
(2002)); it would be useful if the parser could take advantage of priming.
Parsing text with errors
One of the first promises in statistical parsing was its better handling of erroneous text. De-
spite this, no work I am aware of has investigated the performance of a statistical parser on
erroneous text. This seems a shame since a statistical model provides a great deal more in-
formation than the coarse penalty model usually used (for example Min and Wilson (1998)).
8.2.2 Word vectors
The representation of words worked well, but the results from other people such as Dekang
Lin (1998) are better. While we cannot use Lin’s approach here, the success he had strongly
implies that we could also do better.
Integrate tagging properly
The integration of tags with the word vectors improved them significantly, and yet this
integration was actually very crude. There are numerous errors made by this integration,
such as homographs and polysemous words having their bigram counts merged together.
This has resulted in the regrettable result of words that can act as both nouns and verbs
being clustered with other such words when it would be far more favourable to create two
words in each case. It is often clear that Lin’s results are better than mine because of this
omission.
210
Such a solution would be quite easy to implement by tagging the T/G corpus and cre-
ating new words by using the old word combined with its POS tag. However the tagger I
developed would be inappropriate for this task since its tagset is slightly too coarse and it is
much too slow. It is very likely that a suitable tagger already exists.
Eliminating R
All of the word vectors are derived using the R statistical toolkit. This toolkit is excellent
for prototyping, presenting hundreds of different tools to the programmer. Furthermore,
it scaled to relatively large amounts of data, a property that was not true for any other of
the off-the-shelf systems investigated. However, it was found to be very inefficient when
compared to writing the same algorithm in C.
For instance, the maximum PCA that can be performed in R on a machine with two
gigabytes of ram is 4000 by 4000. However, some arithmetic shows that a much larger matrix
should be easily manipulated with this much ram. PCA requires four matrices to be stored
simultaneously, and assuming eight bytes per cell, we should be able to process a square
matrix with over ten thousand rows. Even if IEEE extended floating point is assumed, we
should still be able to have over six thousand rows. If we scrapped our use of R and rewrote
in C then it is reasonable to expect the much larger matrices to lead to significantly better
word representations.
8.2.3 Backoff
Training from the event file
All but two of my networks already train directly from the WSJ event file. If all networks
could be thus trained, there would be a number of benefits. Most obviously, we avoid hav-
ing genprob’s output as an approximate ceiling on performance. Moreover, if we can train
directly on the event file, we would be able to experiment with the effects on performance
of adding any number of interesting features to the events file. Collins’ probability model
was carefully tuned to the amount of available data in the Penn treebank and so with better
backoff we can expect to usefully develop a more complex probability model.
8.2.4 Using Maximum Entropy methods instead of a neural network
Neural networks were chosen as our learning algorithm because simple hash-tables were
inappropriate. However, neural networks have their share of problems, especially when
scaling to this much data and when quite accurate lookups are the norm, with interpola-
tion happening only rarely. One option that has come to prominence recently in NLP and
would almost certainly work significantly better is Maximum Entropy modelling (Curran,
211
2004). Taking the event file and mapping it into vectors using the same approach as was
used for neural networks should lead directly to a data file that is suitable for building a
MaxEnt model. At 200MB, the file size is probably a little too large for MaxEnt on any of the
workstations I have access to, but is within reach for current cluster machines.
8.2.5 Using a different parser
Collins’ parser was chosen at the start of this project because it was, and still is, the highest
performing statistical parser. However, it may well be that smoother backoff is more appro-
priate in other parsing models such as DOP or CCG. A more ambitious idea would be to
use a neural-network based parser such as that in Lawrence, Giles, and Fong (1996). This
parser obtained excellent performance for a parser using only POS tags, and noted in the
conclusion that words could not be used because it was neural-network based. However,
word vectors such as those developed in this thesis would work extremely well in such a
parser.
8.3 Concluding remarks
In conclusion: I have proposed that the principal weakness in current statistical parsers is
their use of a fragile backoff algorithm. By developing vector representations for all the
fields incorporated in a typical probability model, I have been able to train a neural network
to simulate the probability model. This approach is both more extensible to new domains
and to more complex probability models, as well as being more robust in situations of lim-
ited training data.
212
References
Abney, S., Schapire, R., and Singer, Y. (1999). Boosting applied to tagging and PP attachment.
In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro-
cessing and Very Large Corpora.
Allen, J. F. (1995). Natural Language Understanding (2nd. ed.). Redwood City, CA: Benjam-
in/Cummings.
Banerjee, S. and Pedersen, T. (2003). The Design, Implementation and Use of the Ngram
Statistics Package. In Proceedings of the fourth International Conference on Intelligent Text
Processing and Computational Linguistics, Mexico City, Mexico.
Bencini, G., Bock, K., and Goldberg, A. (2002). How abstract is grammar? Evidence from
structural priming in language production. In Proceedings of the 15th Annual CUNY Sen-
tence Processing Conference, Queen’s College, City University of New York, NY.
Bengio, S. and Bengio, Y. (2000). Taking on the Curse of Dimensionality in Joint Distributions
Using Neural Networks. IEEE-NN, 11(3), 550.
Bengio, Y. (2003). Personal correspondence.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A Neural Probabilistic Language
Model. Journal of Machine Learning Research, 3, 1137–1155.
Bies, A., Ferguson, M., Katz, K., and MacIntyre, R. (1995). Bracketting Guidelines for Treebank
II style Penn Treebank Project. Linguistic Data Consortium.
Bikel, D. (2004). Intricacies of Collins’ Parsing Model. Computational Linguistics, 30(4), 479–
511.
Bikel, D. (2005). Web page.
Black, E., Jelinek, F., Lafferty, J., Magerman, D., Mercer, R., and Roukos, S. (1992). Towards
history-based grammars: using richer models for probabilistic parsing. In M. P. Marcus
(Ed.), Fifth DARPA Workshop on Speech and Natural Language, Arden Conference Center,
Harriman, New York.
213
Bod, R. (1996). Efficient Algorithms for Parsing the DOP Model? A Reply to Joshua Good-
man. Computational Linguistics Archives: cmp-lg/199605031.
Bod, R. and Scha, R. (1996). Data-Oriented Language Processing: An Overview. Technical
Report LP-96-13, University of Amsterdam.
Booth, T. L. and Thompson, R. A. (1973). Applying Probability Measures to Abstract Lan-
guages. IEEE Transactions on Computers, 22(5), 442–449.
Brooks, F. P. (1982). The Mythical Man-Month: Essays on Software Engineering. Reading, MA:
Addison-Wesley Publishing Company.
Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, S. A. D., and Lai, J. C. (1992). Class-based
n-gram models of natural language. Computational Linguistics, 18(4), 467–479.
Chapman, R. L. (Ed.) (1992). Roget’s International Thesaurus (5th ed.). HarperCollins.
Charniak, E., Carroll, G., Adcock, J., Cassandra, A., Gotoh, Y., Katz, J., Litman, M., and
McCann, J. (1996). Taggers for Parsers. Artificial Intelligence, 85(1–2), 45–57.
Charniak, E., Hendrickson, C., Jacobson, N., and Perkowitz, P. (1993). Equations for Part-
of-Speech Tagging. In Proceedings of the 11th National Conference on Artificial Intelligence,
Washington, DC, 784–789. AAAI Press.
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. (1990). AutoClass:
A Bayesian Classification System. In J. W. Shavlik and T. G. Dietterich (Eds.), Readings in
Machine Learning, 296–306. San Mateo, CA: Kaufmann.
Chen, S. F. and Goodman, J. (1996). An Empirical Study of Smoothing Techniques for Lan-
guage Modeling. In A. Joshi and M. Palmer (Eds.), Proceedings of the Thirty-Fourth Annual
Meeting of the Association for Computational Linguistics, San Francisco, 310–318. Association
for Computational Linguistics: Morgan Kaufmann Publishers.
Chen, S. F. and Rosenfeld, R. (2000). A Survey of Smoothing Techniques for ME Models.
IEEE Transactions on Speech and Audio Processing, 8(1), 37–55.
Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Choueka, Y. and Luisgnan, S. (1985). Disambiguation by Short Contexts. Computers and the
Humanities, 19(3), 147–157.
Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system.
In COMPLEX’94, Budapest.
214
Collins, M. (1996). A New Statistical Parser Based on Bigram Lexical Dependencies. In
Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz.
Collins, M. (1997). Three Generative, Lexicalized Models for Statistical Parsing. In P. R. Co-
hen and W. Wahlster (Eds.), Proceedings of the Thirty-Fifth Annual Meeting of the Association
for Computational Linguistics and Eighth Conference of the European Chapter of the Association
for Computational Linguistics, Somerset, New Jersey, 16–23. Association for Computational
Linguistics: Association for Computational Linguistics.
Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph. D. thesis,
Computer Science Department, University of Pennsylvania.
Copestake, A. and Flickinger, D. (2000). An open-source grammar development environ-
ment and broad-coverage English grammar using HPSG. In Proceedings of LREC 2000,
Athens, Greece.
Curran, J. (2004). Maximum Entropy Models for Natural Language Processing. In Aus-
tralasian Language Technology Summer School, Sydney, Australia, 29.
Earley, J. (1970). An efficient context-free parsing algorithm. In K. Sparck-Jones, B. J. Grosz,
and B. L. Webber (Eds.), Readings in Natural Language Processing, 25–33. Los Altos: Morgan
Kaufmann Publishers.
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211.
Fahlman, S. E. and Lebiere, C. (1990). The Cascade-Correlation Learning Architecture. In
D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems: Proceedings of the
1989 Conference, San Mateo, CA, 524–532. Morgan Kaufmann Publishers.
Finch, S. (1993). Finding structure in language. Ph. D. thesis, Edinburgh University.
Gale, W. A. and Sampson, G. (1995). Good-Turing Frequency Estimation without Tears.
Journal of Quantitative Linguistics, 2, 217–237.
Garfield, S. and Wermter, S. (2003). Recurrent Neural Learning for Classifying Spoken Ut-
terances. Neural Language Processing, 6(3), 31–36.
Garner, S. R. (1995). WEKA: The Waikato Environment for Knowledge Analysis. Technical
report, Computer Science Dept., Waikato University.
Ginzburg, J. and Sag, I. A. (Eds.) (2000). Interrogative investigations. Stanford: CSLI Publica-
tions.
215
Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model. In E. Brill and
K. Church (Eds.), Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 143–152. Somerset, New Jersey: Association for Computational Linguistics.
Goodman, J. (1998). Parsing Inside-Out. Ph. D. thesis, Harvard University.
Goodman, J. (2001). A Bit of Progress in Language Modeling: Extended Version. Technical
Report MSR-TR-2001-72, Microsoft Research (MSR).
Haegeman, L. (1991). Introduction to Government and Binding Theory. Oxford: Blackwell.
Harman, D. (1992). The DARPA TIPSTER project. SIGIR Forum, 26(2), 26–28.
Hart, M. (2005). Project Gutenberg e-text archive. Web page: http://www.gutenberg.net/.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57(1), 97–109.
Honkela, T. (1997a). Comparisons of self-organized word category maps. In Proceedings
of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4–6, 298–303. Espoo,
Finland: Helsinki University of Technology, Neural Networks Research Centre.
Honkela, T. (1997b). Self-Organizing Maps in Natural Language Processing. Ph. D. thesis,
Helsinki University of Technology, Espoo, Finland. (Some citations give this as 1997,
while others as 1998).
Honkela, T., Pulkki, V., and Kohonen, T. (1995). Contextual Relations of Words in Grimm
Tales, Analyzed by Self-Organizing Map. In F. Fogelman-Soulie and P. Gallinari (Eds.),
Proceedings ICANN’95, International Conference on Artificial Neural Networks, Volume II,
Nanterre, France, 3–7. EC2.
Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters
from sparse data. In E. S. Gelsema and K. L. N. (Eds.), Pattern Recognition in Practice,
381–397. Amsterdam : North Holland Publishing Co.
Joachims, T. (2001). A Statistical Learning Model of Text Classification with Support Vector
Machines. In W. Croft, D. J. Harper, D. H. Kraft, and J. Zobel (Eds.), Proceedings of SIGIR-
01, 24th ACM International Conference on Research and Development in Information Retrieval,
New Orleans, US, 128–136. ACM Press, New York, US.
Katz, S. M. (1987). Estimation of Probabilities from Sparse Data for the Language model
Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Pro-
cessing, 35(3), 400–401.
216
Klein, D. and Manning, C. D. (2001a). An O(n3) Agenda-Based Chart Parser for Arbitrary
Probabilistic Context-Free Grammars. Technical Report dbpubs/2001-16, Stanford Uni-
versity.
Klein, D. and Manning, C. D. (2001b). Parsing and Hypergraphs. In The Seventh International
Workshop on Parsing Technologies.
Klein, D. and Manning, C. D. (2002). A* Parsing: Fast Exact Viterbi Parse Selection. Technical
Report 2002-16, Natural Language Processing Group, Stanford University.
Klein, D. and Manning, C. D. (2003). Accurate Unlexicalized Parsing. In Proceedings of the
41st Annual Meeting of the Association for Computational Linguistics.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological
Cybernetics, 43(1), 59–69.
Kudo, T. and Matsumoto, Y. (2001). Chunking with Support Vector Machines. In Proceedings
of North American Chapter of the ACL.
Lakeland, C. and Knott, A. (2001). POS Tagging in Statistical Parsing. In Proceedings of the
Australasian Language Technology Workshop, Sydney, Australia.
Lakeland, C. and Knott, A. (2004). Implementing a lexicalised statistical parser. In Proceed-
ings of the Australasian Language Technology Workshop, Sydney, Australia.
Lawrence, S., Giles, C. L., and Fong, S. (1996). Can Recurrent Neural Networks Learn Nat-
ural Language Grammars? In Proceedings of the IEEE International Conference on Neural
Networks, 1853–1858. Piscataway, NJ: IEEE Press.
Lee, L. (2004). ”I’m sorry Dave, I’m afraid I can’t do that”: Linguistics, Statistics, and Natural
Language Processing circa 2001. In C. on the Fundamentals of Computer Science: Chal-
lenges, C. S. Opportunities, and N. R. C. Telecommunications Board (Eds.), Computer Sci-
ence: Reflections on the Field, Reflections from the Field, 111–118. The National Academies
Press.
Li, W. (1992). Random Texts Exhibit Zipf’s Law-Like Word Frequency Distribution. IEEE
Transactions on Information Theory, 38, 1842–1845.
Liddle, M. (2002). Learning Lexical Relations from Natural Language Text. Technical report,
University of Otago.
Lin, D. (1997). Using Syntactic Dependency on Local Context to Resolve Word Sense Ambi-
guity. In P. R. Cohen and W. Wahlster (Eds.), Proceedings of the Thirty-Fifth Annual Meeting
of the Association for Computational Linguistics and Eighth Conference of the European Chapter
217
of the Association for Computational Linguistics, Somerset, New Jersey, 64–71. Association
for Computational Linguistics: Association for Computational Linguistics.
Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. In COLING-ACL,
768–774.
Magerman, D. M. (1995). Statistical Decision-Tree Models for Parsing. In Proceedings of the
33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, 26–30
Jun 1995.
Magerman, D. M. (1996). Learning grammatical stucture using statistical decision-trees. In
Grammatical Inference: Learning Syntax from Sentences, 3rd International Colloquium, ICGI-
96, Montpellier, France, September 25-27, 1996, Proceedings, Volume 1147 of Lecture Notes in
Artificial Intelligence, 1–21. Springer.
Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing.
MIT Press.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. (1993). Building a large annotated cor-
pus of English: the Penn Treebank. Computational linguistics, 19, 313–330. Reprinted in
Susan Armstrong, ed. 1994,g Using large corpora, Cambridge, MA: MIT Press, 273–290.
Mayberry III, M. R. and Miikkulainen, R. (1999). Combining Maps and Distributed Rep-
resentations for Shift-Reduce Parsing. In S. Wermter and R. Sun (Eds.), Hybrid Neural
Symbolic Integration. New York: Springer.
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval,
classification and clustering. Web page: http://www.cs.cmu.edu/mccallum/bow.
Miikkulainen, R. (1993). Subsymbolic Natural Language Processing: An Integrated Model of
Scripts, Lexicon, and Memory. Cambridge, MA: MIT Press.
Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the
ACM, 38(11), 39–41.
Min, K. and Wilson, W. H. (1998). Integrated Control of Chart Items for Error Repair. In
COLING-ACL, Volume 2, 862–868. Morgan Kaufmann Publishers.
Ney, H., Mergel, D., Noll, A., and Paeseler, A. (1992). Data Driven Search Organization for
Continuous Speech Recognition. IEEE Transactions on Signal Processing, 40(2), 272.
Plasmeijer, M. J. (1998). CLEAN: a programming environment based on term graph rewrit-
ing. Theoretical Computer Science, 194(1–2), 246–255j.
218
Pollard, C. and Sag, I. A. (1986). Head Driven Phrase Structure Grammar. Stanford, CA, USA:
Center for the Study of Language and Information.
Powers, D. (2001). Experiments in Unsupervised Learning of Natural Language. Interna-
tional Journal of Corpus Linguistics, 6(1), 8.
Pugh, W. (1989). Skip Lists: A Probabilistic Alternative to Balanced Trees. In WADS: 1st
Workshop on Algorithms and Data Structures.
R Development Core Team (2004). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Rueckl, J. G., Cave, K. R., and Kosslyn, S. M. (1989). Why Are “What” and “Where” Pro-
cessed by Separate Cortical Visual Systems? A Computational Investigation. Journal of
Cognitive Neuroscience, 1(2), 171–186.
Scha, R. and Bod, R. (2003). Efficient Parsing of DOP with PCFG-reductions.
Schutze, H. (1992). Dimensions of meaning. In Proceedings of Supercomputing ’92, Minneapolis,
787–796.
Schutze, H. (1993). Word Space. In S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), Advances
in Neural Information Processing Systems, Volume 5, 895–902. Morgan Kaufmann Publish-
ers, San Mateo, CA.
Schutze, H. (1995). Distributional Part-of-Speech Tagging.
Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1),
97–124.
Smith, L. I. (2002). A tutorial on Principal Components Analysis. Technical report, Univer-
sity of Otago, Dunedin, New Zealand.
Smrz, P. and Rychly, P. (2002). Finding Semantically Related Words in Large Corpora. In
TSD, 108–115. Revised 2002, origonally published in 2001.
Stuart, I., Cha, S.-H., and Tapper, C. (2004). A Neural Network Classifier for Junk E-Mail. In
Proceedings of 6th DAS 2004, Florence, Italy, 442–450.
Ushioda, A. (1996). Hierarchical Clustering of Words. In COLING, 1159–1162. Expanded
version published as ‘Hierarchical Clustering of Words and Applications to NLP tasks’.
Vapnik, V. N. (1997). The Support Vector Method. Lecture Notes in Computer Science, 1327,
263–273.
219
Viterbi, A. J. (1967). Error Bounds for Convolutional Codes and an Asymtotically Optimum
Decoding Algorithm. IEEE Transactions on Information Theory, IT-13, 260–267.
Williams, R. (1992). FunnelWeb User’s Manual. ftp://ftp.adelaide.edu.au/pub/funnelweb,
University of Adelaide, Adelaide, South Australia, Australia.
Wu, J. and Zheng, F. (2000). On enhancing katz-smoothing based back-off language model.
In ICSLP-2000, Volume 1, 198–201.
220
Appendix A
Tags and Nonterminals used
Since all the statistical parsers use the tags and nonterminals defined by the Penn treebank, it
makes sense to define them here. A deep understanding of these tags is much less important
than it would be in a conventional parser, but it is still useful for following the examples
and understanding some of the mistakes. This chapter includes a brief description of every
nonterminal and every terminal.
A.1 Tags
There are forty five tags used in the Penn treebank. Two extra tags (#STOP#, and #UNKNOWN#)
are also used by my parser, with #STOP#being used to terminate phrases and #UNKNOWN#
necessary to keep the probability theory in step with the implementation. The other tags are
described below, they have been split into several tables for ease of reading, this distinction
is not explicit in the treebank.
The symbols are described in Table A.1; the nounish words are described in Table A.2;
the verbs in A.3; the adjectives in A.4; the pronouns in Table A.5 and all others are given in
table A.6
This section is based very heavily (almost word for word) on the web page:
http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html
A.2 Nonterminals
Note: this information comes from “Bracketing Guidelines for Treebank II Style Penn Tree-
bank Project” (Bies, Ferguson, Katz, and MacIntyre, 1995)
221
Tag Description Examples
$ dollar $ -$ –$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
‘ ‘ opening quotation
mark
‘ “
’ ’ closing quotation
mark
’ ”
( opening parenthe-
sis
( [ {
) closing parenthesis ) ] }
, comma ,
- - dash - -
. sentence termina-
tor
. ! ?
: colon or ellipsis : ; . . .
SYM symbol % & ’ ” ”. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R
Table A.1: Tags related to symbols
Tag Description Examples
FW foreign word gemeinschaft hund ich jeux habeas Haementeria Herr K’ang-si
vous lutihaw alai je jour objets salutaris fille quibusdam pas trop
Monte terram fiche oui corporis . . .
NN noun, common,
singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed
thermostat investment slide humour falloff slick wind hyena
override subhumanity machinist . . .
NNP noun, proper, sin-
gular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane
Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin
ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool . . .
NNPS noun, proper, plu-
ral
Americans Americas Amharas Amityvilles Amusements
Anarcho-Syndicalists Andalusians Andes Andruses Angels An-
imals Anthony Antilles Antiques Apache Apaches Apocrypha
. . .
NNS noun, common,
plural
undergraduates scotches bric-a-brac products bodyguards facets
coasts divestitures storehouses designs clubs fragrances averages
subjectivists apprehensions muses factory-jobs . . .
Table A.2: POS tags used for nouns
222
Tag Description Examples
MD modal auxiliary can cannot could couldn’t dare may might must need ought shall
should shouldn’t will would
VB verb, base form ask assemble assess assign assume atone attention avoid bake
balkanize bank begin behold believe bend benefit bevel beware
bless boil bomb boost brace break bring broil brush build . . .
VBD verb, past tense dipped pleaded swiped regummed soaked tidied convened
halted registered cushioned exacted snubbed strode aimed
adopted belied figgered speculated wore appreciated contem-
plated . . .
VBG verb, present par-
ticiple or gerund
telegraphing stirring focusing angering judging stalling lactat-
ing hankerin’ alleging veering capping approaching traveling be-
sieging encrypting interrupting erasing wincing . . .
VBN verb, past partici-
ple
multihulled dilapidated aerosolized chaired languished panel-
ized used experimented flourished imitated reunifed factored
condensed sheared unsettled primed dubbed desired . . .
VBP verb, present tense,
not 3rd person sin-
gular
predominate wrap resort sue twist spill cure lengthen brush ter-
minate appear tend stray glisten obtain comprise detest tease at-
tract emphasize mold postpone sever return wag . . .
VBZ verb, present tense,
3rd person singu-
lar
bases reconstructs marks mixes displeases seals carps weaves
snatches slumps stretches authorizes smolders pictures emerges
stockpiles seduces fizzes uses bolsters slaps speaks pleads . . .
Table A.3: POS tags used for verbs
223
Tag Description Examples
JJ adjective or numer-
ical, ordinal
third ill-mannered pre-war regrettable oiled calamitous first sep-
arable ectoplasmic battery-powered participatory fourth still-to-
be-named multilingual multi-disciplinary . . .
JJR adjective, compari-
tive
bleaker braver breezier briefer brighter brisker broader bumper
busier calmer cheaper choosier cleaner clearer closer colder com-
moner costlier cozier creamier crunchier cuter . . .
JJS adjective, superla-
tive
calmest cheapest choicest classiest cleanest clearest closest com-
monest corniest costliest crassest creepiest crudest cutest darkest
deadliest dearest deepest densest dinkiest . . .
RB adverb occasionally unabatingly maddeningly adventurously profess-
edly stirringly prominently technologically magisterially pre-
dominately swiftly fiscally pitilessly
RBR adverb, compari-
tive
further gloomier grander graver greater grimmer harder harsher
healthier heavier higher however larger later leaner lengthier
less-perfectly lesser lonelier longer louder lower more . . .
RBS adverb, superlative best biggest bluntest earliest farthest first furthest hardest hearti-
est highest largest least less most nearest second tightest worst
Table A.4: POS tags used for adjectives
Tag Description Examples
PRP pronoun, personal hers herself him himself hisself it itself me myself one oneself
ours ourselves ownself self she thee theirs them themselves they
thou thy us
PRP$ pronoun, posses-
sive
her his mine my our ours their thy your
WP WH-pronoun that what whatever whatsoever which who whom whosoever
WP$ WH-pronoun, pos-
sessive
whose
Table A.5: POS tags used for pronouns
224
Tag Description Examples
CC conjunction, coordinating & ’n and both but either et for less minus neither nor or plus so
therefore times v. versus vs. whether yet
CD numerical, cardinal mid-1890 nine-thirty forty-two one-tenth ten million 0.5 forty-
seven 1987 twenty ’79 zero two 78-degrees ’60s .025 fifteen
271,124 dozen quintillion
DT determiner all an another any both each either every half many much nary
neither no some such that the them these this those
EX existential there there
IN preposition or conjunc-
tion, subordinating
astride among uppon whether out inside pro despite on by
throughout below within for towards near behind atop around
if like until below next into if beside
LS list item marker A B C First One SP-44001 Second Third Three Two one six three
two
PDT pre-determiner all both half many quite such sure this
POS genative marker ’s
RP particle about across along apart around aside at away back before be-
hind by down ever fast for forth from go high i.e. in into just
later low more off on open out over per raising start that through
under unto up upon whole with you
TO “to” as a preposition or
infinitive marker
to
UH interjection Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Oops
amen huh howdy uh dammit whammo shucks heck anyways
honey golly man baby hush sonuvabitch . . .
WDT WH-determiner that what whatever which whichever
WRB Wh-adverb how however whence whenever where whereby whereever
wherein whereof why
Table A.6: Other POS tags
225
Tag Description Examples
ADJP Adjective Phrase
CONJP Conjunction Phrase
FRAG Fragment
INTJ Interjection Corresponds approximately to the part-of-speech
tag UH.
LST List marker Includes surrounding punctuation.
NAC Not a Constituent used to show the scope of certain prenominal mod-
ifiers within an NP.
NP Noun Phrase.
NX Used within certain complex NPs to mark the head
of the NP. Corresponds very roughly to N-bar level
but used quite differently.
PP Prepositional Phrase
PRN Parenthetical Asides to the main sentence, usually delimited by
commas, brackets or dashes.
PRT Particle Category for words that should be tagged RP.
QP Quantifier Phrase complex measure/amount phrase; used within NP.
RRC Reduced Relative Clause A relative clause that does not attach neatly to the
rest of the sentence, e.g. yesterday in I read the books
on the shelf yesterday quickly and the books on the shelf
today slowly.
UCP Unlike Coordinated Phrase Similar to CC but for where the tags do not match,
e.g. big/ADJP and/UCP growing/VP
VP Verb Phrase
WHADJP Wh-adjective Phrase Adjectival phrase containing a wh-adverb, as in how
hot.
WHAVP Wh-adverb Phrase Introduces a clause with an NP gap. May be null
(containing the 0 complementizer) or lexical, con-
taining a wh-adverb such as how or why.
WHNP Wh-noun Phrase Introduces a clause with an NP gap. May be null
(containing the 0 complementizer) or lexical, con-
taining some wh-word, e.g. who, which book, whose
daughter, none of which, or how many leopards.
WHPP Wh-prepositional Phrase Prepositional phrase containing a wh-noun phrase
(such as of which or by whose authority) that either
introduces a PP gap or is contained by a WHNP.
X Unknown, uncertain, or unbracketable X is often used for bracketing typos and in bracket-
ing the...the-constructions.
Table A.7: The main nonterminal categories
226
Appendix B
Code specifications for my parser
B.1 Data structures
Pseudocode for Collins’ algorithm has already been given in Figure 3.12. This results in
the data flow shown in Figure B.1 which is implemented using the class structure given in
Figure B.2. The classes are described briefly below. For the complex classes references are
given to the sections where they are described fully.
Figure B.1: Data flow diagram of the parser
227
Main
Globals
Node
Parser
Grammar
Prob
Chart
BeamPunc
Sentence
Hash
Nodes
BeamArray BeamList
BeamElement
Figure B.2: Class structure of the parser
228
Arguments was written by Jared Davis. It gets runtime arguments such as the beam size
from the user which saves recompilation when testing.
Beam performs the beam search. It is described in Section B.3.
Beam array is one implementation of the Beam class, also described in Section B.3.
Beam list is another implementation of the Beam class, also described in Section B.3.
Beam element stores elements in the beam and provides the operations performed on all
beam elements. It is also described in Section B.3.
Chart stores all edges that might be of later use. It is described in Section 4.5.
Convert is used to convert between words and strings, as well as to make words frequent,
drop semantic information from nonterminals, and so on. It is all very simple but
putting it in a separate class made the rest of the code easier to read.
Cutoff stores the minimum probability, below which all edges are immediately rejected. It
is separated from the chart because passing its test does not guarantee entry to the
chart, and it is used in a number of places inside the beam which do not otherwise
need the chart. However it could just as easily have been implemented inside the
chart.
Globals is a dummy class that eliminates the need for global variables. This worked out as
a good compromise between the convenience of globals and the increased maintain-
ability of code not containing globals. This also solved a classic C++ problem of global
constructors executing before their needed data had been loaded.
Grammar implements the grammar checking code. It uses simple lookup tables.
Hash implements a fast and simple hash-table with integer keys and integer values. It is
used by the probability class and for grammar lookups. Even accounting for it having
two different input formats and two different storage mechanisms, it is still under two
hundred lines of code.
Node is the data structure for phrases. It implements all of the operations on phrases, such
as join two edges , and provides an interface between the beam and the probability
model. It is described in Section B.2.
Nodes is a preallocated bag of nodes. This is used by the chart, and by some implementa-
tions of the beam, to provide the simplicity of malloc /free without their overhead.
229
Parser implements the main control loop given in Figure 3.4. It is only four hundred lines
long and the most complicated part is in keeping the innermost loop both fast and
readable. It is described in Section 3.2.
Prob is the probability model. It is a set of functions for estimating the probability of differ-
ent structures. It is described in Section 4.3.
Punc encapsulates all punctuation so that the rest of the system doesn’t have to deal with
it. Collins’ design treats punctuation as second-class lexical items, almost throwing
them away. This class strips punctuation from the sentence but allows the parser to
ask where punctuation is present. This enabled me to experiment with not stripping
punctuation while keeping a lot of code the same.
Sentence is mainly used for reading sentences. It abstracts away the way the sentence is
read which means the parser can also be used to parse html or parse interactively.
The sentence class also allows its input to either be in words or already converted into
numbers1.
Tagger implements a couple of different tagging models. Separating tagging meant the
effect of the tagger’s performance on parsing can be examined. The tagger is described
in Section 4.4.
B.2 The node data structure
The main data structure is the Node. Its fields are given in Table B.1. This table highlights
an interesting weakness with C++, relating to C’s historical use for hardware programming.
In order to ease coding of hardware devices, C++ always allocates class members in exactly
the order they are defined. This means that the intuitively obvious declaration may result in
poor memory efficiency. When millions of such structures are being declared, this can waste
a hundred of megabytes with memory alignment that is of no use to the program. To avoid
this wastage, all classes have to be declared with the largest variables first, through to one
byte variables. Doing otherwise was even causing memory corruption when optimisation
flags were used.
The class also contained a number of helper functions to simplify other classes and to
hide the very complex API presented by the probability model. These are detailed in Table
B.2.
All member variables are public. This is normally a very bad design decision because
it greatly decreases flexibility. However, in this case the parser was a reimplementation of
1Numbers are used by all internal functions as they are easier to manipulate than strings
230
Variable Type Description
info double The inside probability of the phrase
prior double The outside probability of the phrase
prob double info + prior, precomputed for efficiency
children Node list An array of children
next Node The next node in the chart
prev Node The previous node in the chart
lc Subcat Arguments to the left still to be found
rc Subcat Arguments to the right still to be found
headtag Tag POS tag of the phrase’s head
headword Word The head word
headnt Head The nonterminal of the phrase’s head
parent Parent The phrase’s parent
begin short The word the phrase starts on
end short The word the phrase ends on
numkids short The number of children
used node state Used for my own garbage collection
terminal bool True if the phrase is a POS tag
adj l bool Is the head the leftmost element
adj r bool Is the head the rightmost element
verb l bool Is a verb between the head and the leftmost element
verb r bool Is a verb between the head and the rightmost element
stop bool Is this phrase complete
hasverb bool True if this phrase contains any verbs
Table B.1: Data structure for phrases
231
Function Description
make silly Corrupts the node to try and highlight bugs
clear Wipes all data ready to reuse the node
collins equal True if the node is equal to another at Collins’ simplified
shallow level
shallow equal True if the node is equal to another at a shallow level
equal True if the node is equal to another at a deep level, used for
debugging
print Prints the node to the terminal for debugging
print for eval Prints the node in the format used to test precision/recall
print for gml Prints the node in GML for producing graphs, useful in
testing
verify Performs internal consistency checks for debugging
lc info, etc. Encapsulate the calls to the probability functions
join follow Joins two nodes together
join cc Deals with the special case of coordination
Table B.2: Member functions for phrases
a design that was known to work so there would be no design changes. Making variables
public slightly increased efficiency and code readability.
Nonterminals were also differentiated based on their use. For example, parent nonter-
minals were much more commonly used with other parents than with heads. The only time
parents and heads are interchanged is in add singles where a head is promoted to a par-
ent. To represent this in the code, the different types were implemented as tiny dummy
classes containing a single integer. This method meant it was impossible to accidentally use
a head where a parent was expected and it caught a number of bugs that would otherwise
have been almost impossible to catch.
Unfortunately it also ran into a weakness in the C++ language — it turns out to be im-
possible to give a class the same privileges as a normal integer. I had hoped to write the
header so that if a DEBUGflag was set then classes would be used but if it was unset then
the nonterminals would be synonyms for integers to increase speed and decrease memory
consumption. A fully OO language such as Smalltalk might have avoided this problem.
Because of the weakness in the language it required a large amount of mechanical code
changing to swap between the class representation and the integer representation.
232
B.3 The beam data structure
Section 4.6.1 describes how the beam was implemented as a doubly linked skiplist in some
detail. It was noted at the end of that section that Collins’ implementation uses an array
and bears little resemblance to beam search as I understand it. This section describes how
the two interpretations of beam search were abstracted away in the parser so that I could
accurately simulate Collins’ model without having to give up my solution, because I con-
sider it more elegant. The other role of this section is to describe how add singles stops
interleaves two independent beams.
In perhaps the only use of true object orientation in the parser, the two interpretations of
beam search were implemented independently as two seperate classes: Beam Array uses
an array based representation with probability thresholding in exactly the same way as
Collins’, and Beam List uses a skiplist based representation with length thresholding as
was described in Section 4.6.1. Both of these classes are subclasses of the abstract virtual
class Beam, which allows the entire parser to just use the beam without caring about how it
is implemented. An API showing all the methods provided by the beam class is presented
in Figure B.3. This virtual beam class provides all the functions that the rest of the parser
Function Description
Beam(void) Pruduce a beam of the default size
Beam(void) Destructor
void store and clear(void) Insert all elements onto the chart
and erase the beam. Used after the
last recursive call.
void clear(void) Erase any remaining elements
int length(void) const The number of element on the
beam
Beamdata * pop(void) Best item on the beam
void push(Beamdata * data) Insert an item
Beamdata * pop back(void) Worst item
void push back(Beamdata * data) Discard this item
void process(const int depth, Beam * dest) Perform the recursive add singles
stops.
void print(void) print the beam (for debugging).
Table B.3: High level API for the beam
needs, and so they just instantiate beams without knowing if they are going to get an array
implementation or a list implementation. It is even possible to instantiate one beam as an
233
array and the other as a skiplist. Some readers may have noticed that the elements on the
beam are of type Beamdata rather than of type Node. This distinction was to make the
beam manipulation code simpler. It enables the skiplist to include next and previous point-
ers in the actual elements instead of storing them seperately. It also allows syntactic sugar
functions like beam operators (one node is less than another node if it has a lower priority)
to simplify the search routines.
Efficency was a major concern in the design of this API. The single process method is
used instead of the more fashonable iterators. Similarly, there a few functions here that are
unnecessary such as clear or push back but assist efficency; if say add stop knows an
edge will not meet the threshold then it is a waste of time to perform the same calulation
twice. A related function is pop back which returns the worst item in the beam so that it
can be overwritten rather than allocating a new item.
There are two types of nodes used throughout the system: complete nodes and incom-
plete nodes. While from an implementation perpsective the only difference between these
nodes is the setting of a single boolean flag, they are conceptually very different because the
kinds of functions that apply to them are different. Within add singles stops , nodes of
both types are generated in rapid succession: incomplete nodes come in from initialise
or join two edges and are completed using add stop with the resulting complete nodes
then being processed by add singles to generate more incomplete nodes. These are then
passed to add stop for completion, and so on. The job of the beam throughout this process
is to keep track of the nodes: accepting nodes from the previous stage in the pipeline, insert-
ing likely candidates into the chart, discarding unlikely candidates, and continually passing
nodes on to the next stage in the pipeline.
It is entirely possible to implement add singles stops using just one beam, but it re-
quires the priority queue algorithm to be efficient since nodes coming out of a function could
have higher priorities than nodes already in the beam. The list implementation does have
an efficient priority queue, but the array implementation does not and rather than develop
a relatively efficient (i.e. heap based) priority queue for the array implementation, I simply
implemented two beams. Under this model, one beam can be viewed as the add singles
beam and the other as the add stop beam. Each beam is always operating in one of two
modes: either accepting edges and inserting (or discarding) them, or processing and re-
moving edges. The bimodal operation makes it easy to implement the array processing
efficiently.
234
Appendix C
Relevant source code
The complete source code for programs used in the thesis is over two megabytes, or around
thirty five thousand lines. Printing such a large amount of source code would be impractical,
almost doubling the length of the thesis. At the same time there are a number of places
where the description of the code given in the text is insufficient for understanding the
method, a problem I know very well from trying to implement the parser based on Collins’
thesis.
To help resolve this, the complete source code is available from:
http://cs.otago.ac.nz/postgrads/lakeland/phd code.tgz
as well as on several backup mirrors:
http://cs.otago.ac.nz/staffpriv/alik/lakeland code.tgz
http://go.org.nz/lakeland/phd code.tgz
Some of the source code is also included here. I have included the following:
• the complete build system since it is one of the easiest ways of understanding how
everything fits together
• the scripts used to drive R since they are short and include a lot of information
• A small sample of funnelweb code since using funnelweb is a novel method of refac-
toring complex code, and readers may be more interested in the idea than the complete
source.
• The source code for converting the Penn treebank into Collins’ format since I do not
believe the process has been adequately documented elsewhere
• The bigram processing code since this process is normally not mentioned in the litera-
ture and yet I found it had a significant difference to final results
235
C.1 Build script
The build system is implemented as a unix makefile. This allows much easier access to shell
commands than more sophisticated build systems, so it is perhaps better to view the use of
make more as a shell script with built in dependency resolution.
NETS=tag . wgt
NETS+=p r i o r . nz . wgt p r i o r . z e r o t e s t . wgt
NETS+=dep . nz . wgt dep . z e r o t e s t . wgt
NETS+=subcat . nz . wgt subcat . z e r o t e s t . wgt
NETS+=unary . nz . wgt unary . z e r o t e s t . wgt
NETS+=top . nz . wgt top . z e r o t e s t . wgt
a l l : $ (NETS)
# Approx 2 0 mins
words : wsj . tagged
nts : wsj . tagged
tags : wsj . tagged
wsj . raw : wsj . tagged
best−tags : wsj . tagged
d i c t : best−tags
load .mem: wsj . tagged
# Approx 3 0 mins
t e s t . tagged :
make −C $$CODEDIR/treebank t e s t−tokenize . done
cp −u $$CODEDIR/treebank/ t e s t−tokenize/ t e s t . tagged .
wsj . tagged :
make −C $$CODEDIR/treebank
cp −u $$CODEDIR/treebank /{wsj . tagged , words , nts , tags , wsj . raw , best−tags , d ic t , load .mem} .
r e a d a b l e t e s t . tagged : wsj . tagged
make −C $$CODEDIR/treebank/ t e s t−tokenize t e s t . tagged
cp −u $$CODEDIR/treebank/ t e s t−tokenize/ t e s t . tagged r e a d a b l e t e s t . tagged
readable . tagged : wsj . tagged
make −C $$CODEDIR/treebank/combined−tokenize readable . tagged
cp −u $$CODEDIR/treebank/combined−tokenize/readable . tagged .
# Approx one hour
model2−shor t : model2−long
cp −u $$CODEDIR/preproc/model2 .02−21 model2−shor t
model2−long :
make −C $$CODEDIR/preproc
236
cp −u $$CODEDIR/preproc/model2 model2−long
# Approx one hour
fwords : h t s i z e s . h
l c : h t s i z e s . h
rc : h t s i z e s . h
l e f t : h t s i z e s . h
r i g h t : h t s i z e s . h
unary : h t s i z e s . h
h t s i z e s . h : load .mem readable . tagged r e a d a b l e t e s t . tagged model2−long model2−shor t
make −C $$CODEDIR/events # Done t w i c e t o a v o i d a s t u p i d bug ( in make ? )
make −C $$CODEDIR/events # make doesn ’ t r e a l i s e i t has c r e a t e d . d a t a and a b o r t s
cp −u $$CODEDIR/events /{fwords , l c , rc , l e f t , r ight , unary , h t s i z e s . h } .
# P r o b a b l y unneeded with my p a r s e r d i s a b l e d
p unary 1 n . boot : h t s i z e s . h
cp −u $$CODEDIR/events/p ∗ . boot .
# Approx 2 mins
p pos 1 n . boot : tags . h
cp −u $$CODEDIR/events/p ∗ . boot .
tags . h : load .mem
make −C $$CODEDIR/events
cp −u $$CODEDIR/events /{ tags . h , t a g s i z e s . h } .
# Approx 2 min
model2 : load .mem
make −C $$CODEDIR/events model2 . re tok
cp −u $$CODEDIR/events/model2 . re tok model2
tags . v e c t o r s : wsj . tagged
make −C $$CODEDIR/v e c t o r s tags . v e c t o r s
nts . v e c t o r s : tags . v e c t o r s model2
make −C $$CODEDIR/v e c t o r s nts . v e c t o r s
rev−neighbours : words . v e c t o r s
make −C $$CODEDIR/v e c t o r s/words rev−neighbours
cp −u $$CODEDIR/v e c t o r s/words/rev−neighbours .
words . v e c t o r s : tags . v e c t o r s wsj . raw best−tags
make −C $$CODEDIR/v e c t o r s words . v e c t o r s
237
cp −u $$CODEDIR/v e c t o r s/words/words . v e c t o r s .
# # Ha l f way
# The c o d e up t o now i s r e l a t i v e l y s t a b l e
# And c h a n g e s t end t o need a f u l l r e g e n e r a t i o n anyway
# Next we have t o g e n e r e t e t h e NN t r a i n i n g d a t a and
# t r a i n t h e NNs .
# Approx 1 min
tokenised . l e x i c o n : tokenised . nts
tokenised . nts : model2
make −C $$CODEDIR/parser/mcol l ins empty input grammars . done nts . done l e x i c o n . done
cp −u $$CODEDIR/parser/mcol l ins/grammars/tokenised .{ nts , l e x i c o n } .
# Approx 2 0 minutes
# n t s . v e c t o r s not e x p l i c i t l y ne e ded but t h e c o d e l o a d s i t anyway
tag . net : tags . v e c t o r s nts . v e c t o r s words . v e c t o r s tokenised . nts tags nts words fwords l c l e f t unary tags . h p pos 1 n . boot
p unary 1 n . boot d i c t
make −C $$CODEDIR/parser/mine empty input tagger
cp −u $$CODEDIR/parser/mine/{ tag . header , tagger } .
# Approx 2 hours
dep . nz . net : p r i o r . nz . net
unary . nz . net : p r i o r . nz . net
subcat . nz . net : p r i o r . nz . net
top . nz . net : p r i o r . nz . net
dep . z e r o t e s t . net : p r i o r . nz . net
unary . z e r o t e s t . net : p r i o r . nz . net
subcat . z e r o t e s t . net : p r i o r . nz . net
top . z e r o t e s t . net : p r i o r . nz . net
p r i o r . z e r o t e s t . net : p r i o r . nz . net
p r i o r . nz . net : tags . v e c t o r s nts . v e c t o r s words . v e c t o r s tokenised . nts tokenised . l e x i c o n rev−neighbours
make −C $$CODEDIR/parser/mcol l ins
cp −u $$CODEDIR/parser/mcol l ins /{dep , prior , unary , subcat , top } .{nz , z e r o t e s t } . header .
cp −u $$CODEDIR/parser/mcol l ins /∗ . net .
sec00 . mcol l ins : p r i o r . nz . net
make −C $$CODEDIR/parser/mcol l ins sec00 . t e s t a b l e
cp −u $$CODEDIR/parser/mcol l ins/sec00 . t e s t a b l e sec00 . mcol l ins
parser−hash . r e s u l t s : sec00 . mcol l ins
make −C $$CODEDIR/eval/parser / empty input parser−hash . r e s u l t s
238
cp −u $$CODEDIR/eval/parser/parser−hash . r e s u l t s .
tag . wgt : tag . net
make −C $$CODEDIR/ t r a i n−net empty input tag . wgt
cp −u $$CODEDIR/ t r a i n−net/tag .{wgt , log } .
tagger−net . r e s u l t s : tag . wgt
make −C $$CODEDIR/parser/mine tagger−net . r e s u l t s
cp −u $$CODEDIR/parser/mine/tagger−net . r e s u l t s .
p r i o r . z e r o t e s t . wgt : p r i o r . nz . wgt
p r i o r . nz . wgt : p r i o r . nz . net
make −C $$CODEDIR/ t r a i n−net empty input p r i o r . { nz , z e r o t e s t } . wgt
cp −u $$CODEDIR/ t r a i n−net/p r i o r .{ nz , z e r o t e s t } .{wgt , log } .
dep . z e r o t e s t . wgt : dep . nz . wgt
dep . nz . wgt : dep . nz . net
make −C $$CODEDIR/ t r a i n−net empty input dep . { nz , z e r o t e s t } . wgt
cp −u $$CODEDIR/ t r a i n−net/dep .{ nz , z e r o t e s t } .{wgt , log } .
top . z e r o t e s t . wgt : top . nz . wgt
top . nz . wgt : top . nz . net
make −C $$CODEDIR/ t r a i n−net empty input top . { nz , z e r o t e s t } . wgt
cp −u $$CODEDIR/ t r a i n−net/top .{ nz , z e r o t e s t } .{wgt , log } .
subcat . z e r o t e s t . wgt : subcat . nz . wgt
subcat . nz . wgt : subcat . nz . net
make −C $$CODEDIR/ t r a i n−net empty input subcat . { nz , z e r o t e s t } . wgt
cp −u $$CODEDIR/ t r a i n−net/subcat .{ nz , z e r o t e s t } .{wgt , log } .
unary . z e r o t e s t . wgt : unary . nz . wgt
unary . nz . wgt : unary . nz . net
make −C $$CODEDIR/ t r a i n−net empty input unary . { nz , z e r o t e s t } . wgt
cp −u $$CODEDIR/ t r a i n−net/unary .{ nz , z e r o t e s t } .{wgt , log } .
c lean :
make −C treebank clean
make −C preproc c lean
make −C events c lean
make −C tag c lean
make −C v e c t o r s c lean
make −C parser/mine c lean
239
make −C parser/mcol l ins c lean
make −C t r a i n−net c lean
rm − f input
Some points in this are worth commenting on. The preprocessing of the treebank was
principally intended to remove Section 23 so that it is impossible for any later parts of the
system to gain any knowledge from it, but at the same time it provides easy access to Section
23 where that is necessary (such as in evaluation). The preprocessing stage was also used
later when I experimented with tokenising the treebank (which leads to lower precision and
recall but better neighbours).
The ‘input’ directory is used to contain all of the system’s generated files. That way if
something breaks it is trivial to go back to older intermediate files just as it is trivial to go
back to old versions of the code using CVS.
This build script was only rarely perfectly up-to-date, but even so it was invaluable on
numerous occasions. For example: make clean has to be explicitly coded and so can easily
leave important files behind. However, checking out the project from CVS ensures a totally
clean build, and so failures due to missing files leads quickly and easily to finding source
code that I neglected to add to CVS. Similarly, running two identical build is much easier
using a single huge build-scripts and because this should in theory lead to identical inter-
mediate files, it provides a method of testing that the toolchain is performing consistently.
For example, if the language is set to English then the sort command will not perform
identically in the shell as in C. This sort of bug is virtually impossible to track down since it
will be identical for most test cases, but running MD5 on intermediate data files finds any
discrepancies very quickly.
C.2 R scripts
The following scripts control R system. Source code for each of the functions in R is available
inside R by typing the name of the function without arguments.
240
Bigram generation
library(mva)
wvs <- as.numeric(Sys.getenv("WORD_VECTOR_SIZE"))
data <- read.table("bigram.4000")
pc <- prcomp(data, scale=FALSE,retx=FALSE)
rot <- pc$rotation
save(rot, file="rotation")
data <- as.matrix(data)
wordslong <- data %*% pc$rot
words <- wordslong[,0:wvs]
write.table(words,file="output.1-4000")
Dendrogram generation
library(mva)
words <- read.table(file="words.700")
dist <- dist(words)
clust <- hclust(dist)
pdf(file="english.pdf",paper="special",width=200,height=20)
plot(clust,hang=-1)
PCA for words beyond the first four thousand
library(mva)
wvs <- as.numeric(Sys.getenv("WORD_VECTOR_SIZE"))
load("rotation")
rot <- as.matrix(rot)
data <- read.table("bigram.8000")
data <- as.matrix(data)
wordslong <- data %*% rot
words <- wordslong[,1:wvs]
write.table(words,file="output.4001-8000")
C.3 Funnelweb code
@$@<valid join@>@(@4@)@M==@{
@! @1 = follow/precede
@! @2 = parent (l for follow, r for precede)
241
@! @3 = child
@! @4 = right/left
bool Parser::valid_@1_join(const Node * l, const Node * r) const
{
Subcat new_subcat = @2->@3c;
new_subcat -= g->conversions->nt_type(@3->parent);
if (l->dontuseme || r->dontuseme) return false;
assert(g->gram->@4_gram(@2->parent, @2->headnt, @3->parent));
return g->gram->@3c_gram(@2->parent, @2->headnt, new_subcat);
}
@}
@$@<gram rules@>==@{@-
@<valid join@>@(cc@, l@, r@, right@)
@<valid join@>@(follow@, l@, r@, right@)
@<valid join@>@(precede@, r@, l@, left@)
@}
@$@<combine loop@>@(@7@)@M==@{
@! Arguments:
@! 1 = call = follow vs precede
@! 2 = 1 if parent is left
@! 3 = left or right (parent val)
@! 4 = left or right (child val)
@! 5 = cc node (NULL or a real node)
@! 6 = left end (typically split)
@! 7 = right start (typically split + 1)
{
const int left_end = @6;
const int right_start = @7;
int parent_as_int;
int child_as_int;
int numkids, child_nr;
....
}
242
@$@<combine@>==@{@-
void Parser::combine(const int from, const int to)
{
int start,split,span,end;
int len = to - from + 1;
Chart *left_chart, *right_chart; // for combine loop
....
@<combine loop@>@(@’’follow@’’ @,
@’’1@’’ @,
@’’left@’’ @,
@’’right@’’ @,
@’’NULL@’’ @,
@’’split @’’ @,
@’’split + 1@’’ @)
@<combine loop@>@(@’’precede@’’ @,
@’’0@’’ @,
@’’right@’’ @,
@’’left@’’ @,
NULL @,
@’’split @’’ @,
@’’split + 1@’’ @)
}
While this source code is obviously incomplete, it will hopefully give the reader the idea
of how funnelweb was used. Macro expansion is extremely similar to a function call, but
it can be put in places where a function call cannot, or where refactoring into a function
call would be more awkward than duplicating code. For example, having a function that
calls join 2 edges follow or join 2 edges precede based on how it is called would
require a conditional in a really hard-to-read part of the loop. At first, reading the funnelweb
code is awkward, but with practice it reads as easily as ordinary function calls.
C.4 Processing the treebank
This section presents a full implementation of Collins’ preprocessor. This is important be-
cause Collins does not define the preprocessor in anywhere near enough detail to reimple-
ment it, and any errors implementing it lead to a significant loss of accuracy. This concern
243
has been previously reported in Bikel (2004). While Dan Bikel does provide code, it is not
especially easy to follow and is quite tightly tied to his parser.
C.4.1 Transforming the corpus
; add−compliment
; Input i s a t r e e wi th s e m a n t i c i n f o r m a n t i o n
; Output i s a t r e e w i t h o u t s e m a n t i c but wi th headword i n f o r m a t i o n
; add−npb
; Input i s a t r e e i n c l u d i n g compl iment i n f o r m a t i o n but not s e m a n t i c .
; Output i s a t r e e wi th b a s e NPs changed t o NPBs
; add−headword
; Input i s a t r e e i n c l u d i n g compl iment and npb i n f o r m a t i o n but not s e m a n t i c .
; Output i s a t r e e wi th compl iment npb and headword i n f o r m a t i o n .
; ( nt ( t a g word ) ( nt ( t a g word ))) −>
; ( nt word t a g ( t a g word t a g ) ( nt word t a g ( t a g word t a g ) ) )
; add−sg
; Input i s t h e f i n a l t r e e ( wi th headword ) .
; Output i s t h e t r e e wi th S changed t o SG when t h e s e n t e n c e has
; no s u b j e c t ( i . e . a NONE)
( defconstant badset
’ ( ”ADV” ”VOC” ”BNF” ”DIR” ”EXT” ”LOC” ”MNR” ”TMP” ”CLR” ”PRP” ) )
( defvar head−match−ht ( make−hash−table : s i z e 2 5 : t e s t # ’ equal ) )
( defvar get−direct ion−ht ( make−hash−table : s i z e 2 5 : t e s t # ’ equal ) )
( defvar co l l ins−events ) ; o u t put o f e v e n t s f i l e
( defun mygethash ( i ht ) ( gethash i ht ) )
( defun getword ( x )
( l e t ( ( r es ( mygethash x words ) ) )
( i f re s re s ( format t ”Eep−WORD: ˜ a˜%” x ) ) ) )
( defun getnt ( x )
( l e t ( ( r es ( mygethash x nts ) ) )
( i f re s re s ( format t ”Eep−NT : ˜ a˜%” x ) ) ) )
; Th i s b r e a k s an i t em i n t o i t s s y n t a c t i c and s e m a n t i c p a r t s
( defun d e t a i l s ( item )
244
( gethash ( gethash item nts ) n t s−d e t a i l s ) )
; Th i s r e t u r n s an item ’ s key s y n t a c t i c p a r t a s a symbo l
( defun s im pl i f y ( item )
( i f ( not ( ge tnt item ) )
item ; For t a g s j u s t use t h e t a g
( f i r s t ( d e t a i l s item ) ) ) )
( defun tags ( item )
( r e s t ( d e t a i l s item ) ) )
( defun compliment ( item )
( gethash ( + ∗ compliment−diff ∗ ( gethash item nts ) ) nts− inverse ) )
( defun nocompliment ( item )
( l e t ( ( item−as−num ( getnt item ) ) )
( i f ( > item−as−num ∗ compliment−diff ∗ )
( gethash (− item−as−num ∗ compliment−diff ∗ ) nts− inverse )
item ) ) )
( defun is−verb ( nt )
( find ( s im pl i f y nt ) ’ ( ”VP” ) : t e s t # ’ equal ) )
; t e s t e d .
; c h a n g e s i t em t o item−A
; The non− terminal must be :
; ( 1 ) an NP SBAR or S whose p a r e n t i s an S ;
; ( 2 ) an NP SBAR S or VP whose p a r e n t i s a VP ; or
; ( 3 ) an S whose p a r e n t i s an SBAR .
; 2 . The non− terminal must not have one o f t h e f o l l o w i n g s e m a n t i c t a g s :
; ADV VOC BNF DIR EXT LOC MNR TMP CLR or PRP .
( defun make−compliment ( parent item )
( l e t ∗ ( ( simple−parent ( s i mp l i fy parent ) )
( simple−item ( s i mp l i fy item ) )
( compliment−item ( compliment simple−item ) ) )
( cond ( ( find simple−parent ’ ( ”PP” ”PP−A” ) : t e s t # ’ equal )
( i f ( find simple−item ’ ( ”NPB” ”NP” ”SBAR” ”S” ”SG” ”PP” ”ADJP” ”ADVP” )
: t e s t # ’ equal )
compliment−item
simple−item ) )
( ( i n t e r s e c t i o n ( tags item ) badset : t e s t # ’ equal ) simple−item )
245
( ( find simple−parent ’ ( ”S” ”S−A” ”SG” ”SG−A” ) : t e s t # ’ equal )
( i f ( find simple−item ’ ( ”NPB” ”NP” ”SBAR” ”S” ”SG” ) : t e s t # ’ equal )
compliment−item
simple−item ) )
( ( find simple−parent ’ ( ”VP” ”VP−A” ) : t e s t # ’ equal )
( i f ( find simple−item
’ ( ”NPB” ”NP” ”SBAR” ”S” ”SG” ”VP” ) : t e s t # ’ equal )
compliment−item simple−item ) )
( ( find simple−parent ’ ( ”SBAR” ”SBAR−A” ) : t e s t # ’ equal )
( i f ( find simple−item ’ ( ”S” ”SG” ) : t e s t # ’ equal )
compliment−item simple−item ) )
( t simple−item ) ) ) )
( defun makequote ( x )
( format n i l ” \”˜ a\”” x ) )
( defun output− for−col l ins ( output depth t r e e )
( i f ( atom t r e e )
( format output ” ˜ a” ( makequote t r e e ) )
( progn
( format output ”˜%” )
( dotimes ( i depth ) ( format output ” ” ) )
( format output ” ( ˜ a ˜ a ˜ a”
( makequote ( f i r s t t r e e ) )
( makequote ( second t r e e ) )
( makequote ( thi rd t r e e ) ) )
( mapcar # ’ ( lambda ( x ) ( output− for−col l ins output ( 1 + depth ) x ) )
( cdddr t r e e ) )
( format output ” ) ” ) ) ) )
; t e s t e d
; c h a n g e s t h e p a r e n t o f a node t o −A i f t h e c h i l d has c e r t i a n f e a t u r e s
( defun add−compliment ( node )
( l e t ( ( parent ( s i mp l i fy ( f i r s t node ) ) ) )
( cons parent
( mapcar # ’ ( lambda ( c h i l d ) ( add−compliment−internal parent c h i l d ) )
( cdr node ) ) ) ) )
; t e s t e d
( defun add−compliment−internal ( parent c h i l d )
( i f ( atom c h i l d ) c h i l d
( cons
246
( make−compliment parent ( f i r s t c h i l d ) )
( mapcar # ’ ( lambda ( grandchild )
( add−compliment−internal ( f i r s t c h i l d ) grandchild ) )
( cdr c h i l d ) ) ) ) )
; Re turns a two i t em l i s t . F i r s t i t em i s t h e r e s u l t and s e c o n d i t em
; i s t r u e when a change has been made and n i l o t h e r w i s e .
; A lgo i r thm l o g i c :
; i f any c h i l d r e n changed
; th en a c h i l d c o n t a i n s NPB so we don ’ t change t h e c u r r e n t node
; Otherwi s e
; t h e c h i l d r e n can be d i s c a r d e d ( n o t h i n g was changed ) and we j u s t
; c o n s i d e r t h e c u r r e n t node
; i f non NP then ( X a b c ) −> (X a ’ b ’ c ’ )
; where a ’ means ( add−npb a )
; e l s i f baseNP then ( NP a b c ) −> (NP ( NPB a b c ) )
; e l s e ( NP a b c ) −> (NP a b c )
; t e s t e d
( defun add−npb ( node )
( i f ( atom ( second node ) ) ; t e r m i n a l
( l i s t node n i l )
( l e t ( ( ch i ldren ( mapcar # ’ add−npb ( cdr node ) ) ) )
( i f ( member t ch i ldren : key # ’ second ) ; s ome th ing was c o n v e r t e d
( l i s t ( cons ( f i r s t node ) ( mapcar # ’ f i r s t ch i ldren ) ) t )
( cond ( ( equal ( f i r s t node ) ’ ”NP” )
( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )
( ( equal ( f i r s t node ) ’ ”NP−A” )
( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )
( t ( l i s t node n i l ) ) ) ) ) ) )
; t h i s v e r s i o n r e p l a c e s ( NP . . . ) wi th ( NPB . . . )
; ( cond ( ( e q u a l ( f i r s t node ) ’ ”NP ” ) ( l i s t ( cons ’ ”NPB ” ( c d r node ) ) t ) )
; ( ( e q u a l ( f i r s t node ) ’ ”NP−A” ) ( l i s t ( cons ’ ”NPB ” ( c d r node ) ) t ) )
; t h i s v e r s i o n r e p l a c e s ( NP . . . ) (NP ( NPB . . . ) )
; ( cond ( ( e q u a l ( f i r s t node ) ’ ”NP”)
( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )
; ( ( e q u a l ( f i r s t node ) ’ ”NP−A”)
( l i s t ( l i s t ( car node ) ( cons ’ ”NPB” ( cdr node ) ) ) t ) )
; t e s t e d
247
( defun add−headword ( node )
( i f ( atom ( second node ) )
( l i s t ( f i r s t node ) ( second node ) ( f i r s t node ) ) ; t a g word t a g
( l e t ( ( head ( nocompliment ( f i r s t node ) ) )
( ch i ldren ( mapcar # ’ add−headword ( cdr node ) ) ) )
( cond
( ( or ( equal head ’ ”NP” ) ( equal head ’ ”NPB” ) ( equal head ’ ”NX” ) )
( add−headword−np ( f i r s t node ) ch i ldren ) )
( ( equal head ’ ”CC” ) ( add−headword−cc ( f i r s t node ) ch i ldren ) )
( t ( add−headword−normal ( f i r s t node ) ch i ldren ) ) ) ) ) )
; Th i s adds headword i n f o r m a t i o n by s e l e c t i n g t h e head c h i l d .
; t e s t e d
( defun add−headword−normal ( head ch i ldren & opt iona l ( mustfind t ) ( nt−head n i l ) )
( l e t ∗ ( ( head−for−output ( i f nt−head nt−head head ) )
( basehead ( nocompliment head ) )
( l e f t− to− r ight ( get−direc t ion basehead ) )
( p r i o r i t y− l i s t ( g e t−p r i o r i t y− l i s t basehead ) )
( search−chi ldren ( i f l e f t− to− r ight ch i ldren ( reverse ch i ldren ) ) )
( found ( remove n i l
( mapcar # ’ ( lambda ( item )
( find item search−chi ldren : key # ’ f i r s t : t e s t # ’ equal ) )
p r i o r i t y− l i s t ) ) ) )
( cond
( found ; f ound t h e headword
( append
( cons head−for−output
( l i s t ( second ( f i r s t found ) ) ( th i rd ( f i r s t found ) ) ) )
ch i ldren ) )
( ( not mustfind ) n i l ) ; headword not found and not ne e ded
( t ( append ; headword not found assume t h e f i r s t / l a s t c h i l d
( cons head−for−output
( l i s t ( second ( f i r s t search−chi ldren ) )
( th i rd ( f i r s t search−chi ldren ) ) ) )
ch i ldren ) ) ) ) )
( defun add−headword−cc ( head ch i ldren )
( append
( cons head
( l i s t ( second ( f i r s t ch i ldren ) ) ( th i rd ( f i r s t ch i ldren ) ) ) )
ch i ldren ) )
248
; t e s t e d
( defun add−headword−np ( head ch i ldren )
( cond ( ( last−word−pos head ch i ldren ) )
( ( add−headword−normal ’ ”FAKE−1” ch i ldren n i l head ) )
( ( add−headword−normal ’ ”FAKE−2” ch i ldren n i l head ) )
( ( add−headword−normal ’ ”FAKE−3” ch i ldren n i l head ) )
( t ( append
( l i s t head
( second ( f i r s t ( l a s t ch i ldren ) ) )
( th i rd ( f i r s t ( l a s t ch i ldren ) ) ) )
ch i ldren ) ) ) )
; t e s t e d
( defun last−word−pos ( head ch i ldren )
( l e t ( ( l a s t c h i l d ( f i r s t ( l a s t ch i ldren ) ) ) )
(when ( equal ’ ”POS” ( th i rd l a s t c h i l d ) )
( append
( l i s t head ( second l a s t c h i l d ) ( th i rd l a s t c h i l d ) )
ch i ldren ) ) ) )
( defun get−direc t ion ( head )
( multiple−value−bind ( r e s u l t found ) ( gethash head get−direct ion−ht )
( progn ( when ( not found ) ( format t ”Oops : get−direc t ion ˜ a −> NULL˜%” head ) )
r e s u l t ) ) )
( defun g e t−p r i o r i t y− l i s t ( head )
( multiple−value−bind ( r e s u l t found ) ( gethash head head−match−ht )
( progn ( when ( not found )
( format t ”Oops : g e t−p r i o r i t y− l i s t ˜ a −> NULL˜%” head ) )
r e s u l t ) ) )
( s e t f
( gethash ’ ”ADJP” head−match−ht )
’ ( ”NNS” ”QP” ”NN” ”$” ”ADVP” ” J J ” ”VBN” ”VBG” ”ADJP”
” J JR ” ”NP” ”NPB” ” J J S ” ”DT” ”FW” ”RBR” ”RBS” ”SBAR” ”BR” )
( gethash ’ ”ADVP” head−match−ht )
’ ( ”RB” ”RBR” ”RBS” ”FW” ”ADVP” ”TO” ”CD”
” JJR ” ” J J ” ”IN” ”NP” ”NPB” ” J J S ” ”NN” )
( gethash ’ ”CONJP” head−match−ht ) ’ ( ”CC” ”RB” ”IN” )
( gethash ’ ”LST” head−match−ht ) ’ ( ”LS” ” ˆ ” )
( gethash ’ ”NAC” head−match−ht ) ’ ( ”NN” ”NNS” ”NNP” ”NNPS” ”NP” ”NPB” ”NAC”
”EX” ”$” ”CD” ”QP” ”PRP” ”VBG” ” J J ” ” J J S ” ” J JR ” ”ADJP” ”FW” )
249
( gethash ’ ”FAKE−1” head−match−ht ) ’ ( ”NN” ”NNP” ”NNPS” ”NNS” ”NX” ”POS” ” J JR ” )
( gethash ’ ”FAKE−2” head−match−ht ) ’ ( ”NP” ”NPB” )
( gethash ’ ”FAKE−3” head−match−ht ) ’ ( ”$” ”ADJP” ”PRN” ”CD” ” J J ” ” J J S ” ”RB” ”QP” )
( gethash ’ ”FRAG” head−match−ht ) n i l
( gethash ’ ”INTJ” head−match−ht ) n i l
( gethash ’ ”PRN” head−match−ht ) n i l
( gethash ’ ”UCP” head−match−ht ) n i l
( gethash ’ ”PP” head−match−ht ) ’ ( ”IN” ”TO” ”VBG” ”VBN” ”RP” ”FW” )
( gethash ’ ”PRT” head−match−ht ) ’ ( ”RP” )
( gethash ’ ”QP” head−match−ht )
’ ( ”$” ”IN” ”NNS” ”NN” ” J J ” ”RB” ”DT” ”CD” ”NCD” ”QP” ” J JR ” ” J J S ” )
( gethash ’ ”RRC” head−match−ht ) ’ ( ”VP” ”NP” ”NPB” ”ADVP” ”ADJP” ”PP” )
( gethash ’ ”S” head−match−ht )
’ ( ”TO” ”IN” ”VP” ”S” ”SG” ”SBAR” ”ADJP” ”UCP” ”NP” ”NPB” )
( gethash ’ ”SG” head−match−ht )
’ ( ”TO” ”IN” ”VP” ”S” ”SG” ”SBAR” ”ADJP” ”UCP” ”NP” ”NPB” )
( gethash ’ ”SBAR” head−match−ht )
’ ( ”WHNP” ”WHPP” ”WHADVP” ”WHADJP” ”IN” ”DT” ”S” ”SG” ”SQ” ”SINV” ”SBAR”
”FRAG” )
( gethash ’ ”SBARQ” head−match−ht ) ’ ( ”SQ” ”S” ”SG” ”SINV” ”SBARQ” ”FRAG” )
( gethash ’ ”SINV” head−match−ht )
’ ( ”VBZ” ”VBD” ”VBP” ”VB” ”MD” ”VP” ”S” ”SG” ”SINV” ”ADJP” ”NP” ”NPB” )
( gethash ’ ”SQ” head−match−ht ) ’ ( ”VBZ” ”VBD” ”VBP” ”VB” ”MD” ”VP” ”SQ” )
( gethash ’ ”TOP” head−match−ht ) ; C o l l i n s didn ’ t have a t o p c a t e g o r y h e r e ??
’ ( ”TO” ”IN” ”VP” ”S” ”SG” ”SBAR” ”ADJP” ”UCP” ”NP” ”NPB” )
( gethash ’ ”VP” head−match−ht )
’ ( ”TO” ”VBD” ”VBN” ”MD” ”VBZ” ”VB” ”VBG” ”VBP” ”VP” ”ADJP” ”NN” ”NNS” )
( gethash ’ ”WHADJP” head−match−ht ) ’ ( ”CC” ”WRB” ” J J ” ”ADJP” )
( gethash ’ ”WHADVP” head−match−ht ) ’ ( ”CC” ”WRB” )
( gethash ’ ”WHNP” head−match−ht ) ’ ( ”WDT” ”WP” ”WP$” ”WHADJP” ”WHPP” ”WHNP” )
( gethash ’ ”WHPP” head−match−ht ) ’ ( ”IN” ”TO” ”FW” )
( gethash ’ ”X” head−match−ht ) n i l )
( s e t f ; t i s l e f t t o r i g h t n i l i s r i g h t t o l e f t
( gethash ’ ”ADJP” get−direct ion−ht ) t
( gethash ’ ”ADVP” get−direct ion−ht ) n i l
( gethash ’ ”CONJP” get−direct ion−ht ) n i l
( gethash ’ ”FRAG” get−direct ion−ht ) n i l
( gethash ’ ”INTJ” get−direct ion−ht ) t
( gethash ’ ”LST” get−direct ion−ht ) n i l
( gethash ’ ”NAC” get−direct ion−ht ) t
( gethash ’ ”FAKE−1” get−direct ion−ht ) n i l
250
( gethash ’ ”FAKE−2” get−direct ion−ht ) t
( gethash ’ ”FAKE−3” get−direct ion−ht ) n i l
( gethash ’ ”PP” get−direct ion−ht ) n i l
( gethash ’ ”PRN” get−direct ion−ht ) t
( gethash ’ ”PRT” get−direct ion−ht ) n i l
( gethash ’ ”QP” get−direct ion−ht ) t
( gethash ’ ”RRC” get−direct ion−ht ) n i l
( gethash ’ ”S” get−direct ion−ht ) t
( gethash ’ ”SG” get−direct ion−ht ) t
( gethash ’ ”SBAR” get−direct ion−ht ) t
( gethash ’ ”SBARQ” get−direct ion−ht ) t
( gethash ’ ”SINV” get−direct ion−ht ) t
( gethash ’ ”SQ” get−direct ion−ht ) t
( gethash ’ ”TOP” get−direct ion−ht ) t ; C o l l i n s doesn ’ t say
( gethash ’ ”UCP” get−direct ion−ht ) n i l
( gethash ’ ”VP” get−direct ion−ht ) t
( gethash ’ ”WHADJP” get−direct ion−ht ) t
( gethash ’ ”WHADVP” get−direct ion−ht ) n i l
( gethash ’ ”WHNP” get−direct ion−ht ) t
( gethash ’ ”WHPP” get−direct ion−ht ) n i l
( gethash ’ ”X” get−direct ion−ht ) n i l )
; True i f t h e t r e e has on ly ”−NONE−” l e a v e s
; t e s t e d
( defun has−only−none ( t r e e )
( i f ( atom t r e e ) ( e r r o r ”OOPS!˜% ” ) )
( i f ( atom ( second t r e e ) )
( equal ( f i r s t t r e e ) ”−NONE−” )
( every # ’ has−only−none ( cdr t r e e ) ) ) )
; I f t h e branch has no ”−NONE−” i t i s k e p t
; I f a S node has any d e c e n d e n t s o f ”−NONE−” i t i s changed t o SG .
;
; The r e t u r n i s o f t h e form ( r e s u l t r e t v a l )
; where r e s u l t i s t h e t r e e wi th c h a n g e s made
; and r e t v a l i s t when a −NONE− has been dropped but
; no S has been changed t o SG .
; t e s t e d
( defun drop−none ( t r e e )
( cond ( ( has−only−none t r e e ) ( l i s t n i l t ) )
( ( atom ( second t r e e ) ) ( l i s t t r e e n i l ) )
( t
251
( l e t ∗ ( ( newkids ( mapcar # ’ drop−none ( cdr t r e e ) ) )
( dropped ( second ( find t newkids : key # ’ second ) ) )
( nonnullkids ( remove n i l ( mapcar # ’ f i r s t newkids ) ) ) )
( i f ( and dropped ( equal ”S” ( f i r s t t r e e ) ) )
( l i s t ( cons ”SG” nonnullkids ) n i l )
( l i s t ( cons ( f i r s t t r e e ) nonnullkids ) dropped ) ) ) ) ) )
( defun process ( t r e e output )
( output− for−col l ins output 0
( add−headword
( f i r s t ( add−npb
( add−compliment
( f i r s t ( drop−none t r e e ) )
)
) )
)
) )
( defun doi t ( )
( with−open−file ( output ” wsj . c o l l i n s ” : d i r e c t i o n : output )
( with−open−file ( input ” w s j t r a i n . combined” : d i r e c t i o n : input )
( l e t ( ( s t a r t ( get−universal−time ) )
( f i l e− s i z e 5 0 0 0 0 ) )
( do ( ( t r e e ( read input ) ( read input n i l ’ eof ) )
( sentence 1 ( 1 + sentence ) ) )
( ( equal t r e e ’ eof ) ( format t ” ˜ c100% complete ˜% ” #\CR) )
( progn
(when ( zerop ( mod sentence 5 0 ) )
( l e t ( ( sec ( ∗ (− f i l e− s i z e sentence )
( / (− ( get−universal−time ) s t a r t ) sentence ) ) ) )
( format t ” ˜ c ˜ f% complete ˜ a remaining ”
#\CR
( / sentence ( / f i l e− s i z e 1 0 0 ) )
( cond
( ( > sec 3 6 0 0 )
( format n i l ” ˜ d hour ˜ : P” ( round ( + 0 . 5 ( / sec 3 6 0 0 ) ) ) ) )
( ( > sec 6 0 )
( format n i l ” ˜ d minute ˜ : P” ( round ( + 0 . 5 ( / sec 6 0 ) ) ) ) )
( t ( format n i l ” ˜ d second ˜ : P” ( round ( + 0 . 5 sec ) ) ) ) ) ) ) )
( process t r e e output ) ) ) ) ) ) )
; ( d o i t )
252
C.4.2 Deriving a grammar
Collins’ parser uses an explicit grammar to avoid generating edges that will inevitably have
a probability of zero. Code to derive this grammar is given below:
; ; F inds e v e r y p o s s i b l e p a r e n t o f e v e r y NT.
; ( s e t f ∗PRINT−PRETTY∗ n i l )
( defconstant nts−plus1 ( 1 + ∗num−nts ∗ ) )
( defun take ( n l ) ( i f ( zerop n ) n i l ( cons ( car l ) ( take ( 1− n ) ( cdr l ) ) ) ) )
( defun drop ( n l ) ( i f ( zerop n ) l ( drop ( 1− n ) ( cdr l ) ) ) )
( defvar l e f t−data ( make−hash−table : s i z e nts−plus1 : t e s t # ’ equal ) )
( defvar l e f t− f p
( open ” l e f t ” : d i r e c t i o n : output : i f− e x i s t s : overwrite : if−does−not−exist : c r e a t e ) )
( defvar right−fp
( open ” r i g h t ” : d i r e c t i o n : output : i f− e x i s t s : overwrite : if−does−not−exist : c r e a t e ) )
( defvar unary−fp
( open ”unary” : d i r e c t i o n : output : i f− e x i s t s : overwrite : if−does−not−exist : c r e a t e ) )
( defun p r o c e s s− l e f t ( l nt head−tag )
( d o l i s t ( item l )
( format l e f t− f p ” ˜ a ˜ a ˜ a˜%” nt head−tag item ) ) )
( defun process−r ight ( l nt head−tag )
( d o l i s t ( item l )
( format right−fp ” ˜ a ˜ a ˜ a˜%” nt head−tag item ) ) )
; F inds t h e c h i l d with t h e r i g h t word / t a g headword
( defun find−head ( headtag word tag ch i ldren & opt iona l ( l e f t n i l ) )
( i f ( null ch i ldren ) ( warn ( format n i l ”Find−head : Children n u l l ˜ a ˜ a” word tag ) ) )
( l e t ( ( c h i l d ( f i r s t ch i ldren ) ) )
( i f ( and ( equal word ( second c h i l d ) )
( equal tag ( th i rd c h i l d ) ) )
( l i s t l e f t headtag ( f i r s t c h i l d ) ( cdr ch i ldren ) )
( find−head headtag word tag ( cdr ch i ldren )
( append l e f t ( l i s t ( car ch i ldren ) ) ) ) ) ) )
; Adds e v e r y c h i l d / p a r e n t p a i r t o t h e hash t a b l e
( defun process ( parent ch i ldren )
(when ( consp ( f i r s t ch i ldren ) )
253
( progn
( l e t ( ( head ( find−head ( f i r s t parent ) ( second parent ) ( th i rd parent ) ch i ldren ) ) )
( progn
( format unary−fp ” ˜ a ˜ a˜%” ( thi rd head ) ( f i r s t parent ) )
( p r o c e s s− l e f t ( mapcar # ’ f i r s t ( f i r s t head ) )
( second head ) ( th i rd head ) )
( process−r ight ( mapcar # ’ f i r s t ( fourth head ) )
( second head ) ( th i rd head ) ) ) )
( mapcar # ’ ( lambda ( c h i l d ) ( process ( take 3 c h i l d ) ( drop 3 c h i l d ) ) )
ch i ldren ) ) ) )
; ( t r a c e p r o c e s s f ind−head )
; Saves t h e hash t a b l e a s t h e f i l e p a r e n t s
( defun output ( )
( c lose l e f t− f p )
( c lose right−fp )
( c lose unary−fp ) )
( with−open−file ( f i l e ” wsj . c o l l i n s ” : d i r e c t i o n : input )
( l e t ( ( s t a r t ( get−universal−time ) )
( f i l e− s i z e 5 0 0 0 0 ) )
( do ( ( t r e e ( read f i l e ) ( read f i l e n i l ’ eof ) )
( sentence 1 ( 1 + sentence ) ) )
( ( equal t r e e ’ eof )
( progn ( format t ” ˜ c100% complete ˜ % ” #\CR)
( output ) ) )
( progn
(when ( zerop ( mod sentence 5 0 ) )
( l e t ( ( sec ( ∗ (− f i l e− s i z e sentence )
( / (− ( get−universal−time ) s t a r t ) sentence ) ) ) )
( format t ” ˜ c ˜ f% complete , ˜ a remaining ”
#\CR
( / sentence ( / f i l e− s i z e 1 0 0 ) )
( cond
( ( > sec 3 6 0 0 )
( format n i l ” ˜ d hour ˜ : P” ( round ( + 0 . 5 ( / sec 3 6 0 0 ) ) ) ) )
( ( > sec 6 0 )
( format n i l ” ˜ d minute ˜ : P” ( round ( + 0 . 5 ( / sec 6 0 ) ) ) ) )
( t ( format n i l ” ˜ d second ˜ : P” ( round ( + 0 . 5 sec ) ) ) ) ) ) ) )
( process ( take 3 t r e e ) ( drop 3 t r e e ) ) ) ) ) )
254
C.5 Processing bigrams
C.5.1 Counting bigrams
# include < a s s e r t . h>
# include <math . h>
# include < s t d l i b . h>
# include < s tdarg . h>
# include < s t r i n g . h>
# include < s t d i o . h>
# include ” amalloc . h”
# define fudge 5 /∗ Add t o e v e r y m a l l o c t o f i t q u i r k y c a s e s in ∗ /
# define maxval 100000
# define maxWordSize 5 6 0
# define quote ( char ) 0 x22 /∗ ” ∗ /
# define numLines 407836244
# define num rows ( numDendro − 1)
# define progress ( numLines / 1 0 0 0 )
# define p r o g d i s t ( num rows / 9 9 0 )
# define max( a , b ) ( ( a > b ) ? a : b )
# define t a g o f f s e t numFeatures /∗ c o u n t s [ t a g o f f s e t . . 4 0 0 1 ] i s f o r t a g s ∗ /
double lowProb ;
i n t n e i g h b o u r s t o p r i n t = 1 0 0 ; /∗ No l o n g e r # d e f i n e d b e c . t a g s have < 1 0 0 ∗ /
i n t num cols = 0 ; /∗ Number o f d i m e n s i o n s a f t e r SVD ∗ /
i n t num dis t co ls = 0 ; /∗ Number o f words t o c o n s i d e r c l o s e ∗ /
i n t numDendro = 0 ; /∗ Read from command l i n e ∗ /
i n t numTagFeatures = 0 ; /∗ Number o f f e a t u r e s t o r e s e r v e f o r t a g s ∗ /
i n t windowSize = 0 ; /∗ Remember t o add one t o s k i p t h e c e n t e r ∗ /
i n t numFeaturesTotal = 0 ; /∗ Should be 4001 − a lways add 1 t o d e s i r e d ∗ /
i n t numFeatures = 0 ; /∗ n u m F e a t u r e s T o t a l − numTagFeatures ∗ /
i n t UNKNOWNWORD = −1 ; /∗ No l o n g e r cons t , s e t by l o a d c o n v e r t ∗ /
const i n t numWords = 2 0 0 0 0 0 ;
const i n t DEBUG = 0 ;
i n t i = 0 ;
i n t j = 0 ;
typedef s t r u c t {double d i s t ;
255
i n t word ;
} d i s t e n t r y ;
void l o a d f e a t u r e s ( i n t ∗ f e a t u r e s ) ;
i n t load conver t ( char ∗ ∗ word strings , i n t numDendro ) ;
void load neighbours ( double ∗ ∗ neighbours ) ;
void sca le ne ighbours ( double ∗ ∗ neighbours ) ;
void s e e d d i s t ( d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) ;
void c a l c d i s t a n c e ( d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) ;
double c a l c a d i s t ( i n t x , i n t y , d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) ;
void check ( d i s t e n t r y ∗ d i s t ) ; /∗ Check i s o r t ( and , p a r t i a l l , some o t h e r s ) i s working ∗ /
void pr int ne ighbours ( d i s t e n t r y ∗ ∗ neighbours , char ∗ ∗ word strings , i n t k ) ;
void der ive counts ( const i n t ∗ f e a t u r e s , const d i s t e n t r y ∗ ∗ dis t ,
i n t ∗ window , char ∗ ∗ word strings , i n t ∗ ∗ counts ,
i n t ∗ seen , i n t windowSize , const char ∗ corpus name , i n t l e f t ) ;
/∗ Pseudocode
i n i t i a l i s e window [ 0 . . windowSize ]
i = windowSize − 1
w h i l e c o r p u s {i f words [ window [ i ] ]
f o r j = 0 ; j < windowSize ; j ++
i f f e a t u r e s [ window [ j ] ]
c o u n t s [ i ] [ j ]++
}}window [ i ] = $
i f ++ i = = windowSize , i = 0 ;
}∗ /
void l o a d f e a t u r e s ( i n t ∗ f e a t u r e s ) {FILE ∗ FREQ ;
i n t i , count , word ;
FREQ = fopen ( ”words . f r e q ” , ” r ” ) ;
a s s e r t (FREQ ) ;
for ( i = 1 ; i < numFeatures ; i ++) {f s c a n f (FREQ, ”%d %d” , & count , & word ) ;
f e a t u r e s [ word ] = i ;
}
256
f c l o s e (FREQ ) ;
f p r i n t f ( s tderr , ”Loaded frequency information \n” ) ;
}
i n t load conver t ( char ∗ ∗ word strings , i n t numDendro ) {FILE ∗ CONVERT;
i n t num = 0 , numCorrect = 1 , oldNum = 1 ;
char ∗ curWord ;
char ∗ l i n e ;
char ∗ number ;
l i n e = malloc ( maxWordSize ∗ 2 ) ;
number = malloc ( 1 6 ) ;
CONVERT = fopen ( ” convert ” , ” r ” ) ;
a s s e r t (CONVERT) ;
f g e t s ( l i n e , maxWordSize , CONVERT) ;
number = s t r t o k ( l i n e , ” ” ) ;
num = a t o i ( number ) ;
oldNum = num − 1 ;
numCorrect = 0 ;
while ( numCorrect < numDendro && ! f e o f (CONVERT) ) {curWord = s t r t o k (NULL, ”\n” ) ;
/ / i f ( ∗ curWord = = ’ \ \ ’ ) { / / ug ly hack , c u r r e n t l y d i s a b l e d
/ / memmove ( curWord +2 , curWord +1 , ( s t r l e n ( curWord ) ) ) ;
/ / curWord [ 1 ] = ’ \ \ ’ ;
/ / }
/ / i f ( ∗ curWord = = ’ | ’ ) {/ / memmove ( curWord +4 , curWord +1 , ( s t r l e n ( curWord ) ) ) ;
/ / curWord [ 0 ] = ’ P ’ ; curWord [ 1 ] = ’ I ’ ;
/ / curWord [ 2 ] = ’ P ’ ; curWord [ 3 ] = ’ E ’ ;
/ / }
/ / i f ( ∗ curWord = = q u o t e ) {/ / memmove ( curWord +5 , curWord +1 , ( s t r l e n ( curWord ) ) ) ;
/ / curWord [ 0 ] = ’Q ’ ; curWord [ 1 ] = ’U ’ ;
/ / curWord [ 2 ] = ’O ’ ; curWord [ 3 ] = ’ T ’ ;
257
/ / curWord [ 4 ] = ’ E ’ ;
/ / }
i f ( num ! = oldNum ) {word str ings [num] = malloc ( s t r l e n ( curWord ) + 1 ) ;
a s s e r t ( word str ings [num ] ) ;
s t rcpy ( word str ings [num ] , curWord ) ;
i f ( 0 = = strcmp ( curWord , ”UNKNOWNWORD” ) ) {UNKNOWNWORD = num;
}}i f (DEBUG) p r i n t f ( ”%s\n” , curWord ) ;
f g e t s ( l i n e , maxWordSize , CONVERT) ;
oldNum = num;
number = s t r t o k ( l i n e , ” ” ) ;
num = a t o i ( number ) ;
numCorrect ++;
}return numCorrect ;
}
void load neighbours ( double ∗ ∗ neighbours ) {FILE ∗ fp ;
double val ;
char ∗ word str ;
i n t cur row , c u r c o l ;
word str = malloc (max( 4 0 0 , maxWordSize ) ∗ s i ze of ( char ) ) ;
fp = fopen ( ” output ” , ” r ” ) ; /∗ Format : word v a l 1 . . . v a l 5 0 ∗ /
f s c a n f ( fp , ”%s ” , word str ) ;
for ( cur row = 1 ; ! f e o f ( fp ) && ( cur row <= num rows ) ; cur row ++) {for ( c u r c o l = 0 ; c u r c o l < num cols ; c u r c o l ++) {
f s c a n f ( fp , ”%l f ” ,& val ) ;
neighbours [ cur row ] [ c u r c o l ] = val ;
}f s c a n f ( fp , ”%s ” , word str ) ;
}f r e e ( word str ) ;
s ca le ne ighbours ( neighbours ) ;
f p r i n t f ( s tderr , ”Loaded neighbour information \n” ) ;
258
}
void sca le ne ighbours ( double ∗ ∗ neighbours ) {double sum ;
i n t cur row , c u r c o l ;
for ( cur row = 1 ; cur row <= num rows ; cur row ++) {sum = 0 . 0 ;
for ( c u r c o l = 0 ; c u r c o l < num cols ; c u r c o l ++) {sum + = fabs ( neighbours [ cur row ] [ c u r c o l ] ) ;
}for ( c u r c o l = 0 ; c u r c o l < num cols ; c u r c o l ++) {
neighbours [ cur row ] [ c u r c o l ] / = sum ;
}}
}
void s e e d d i s t a n c e ( d i s t e n t r y ∗ ∗ d i s t ) {i n t x , y ;
for ( x = 1 ; x <= num rows ; x ++) {for ( y = 0 ; y < num dis t co ls ; y ++) {
d i s t [ x ] [ y ] . d i s t = maxval ;
d i s t [ x ] [ y ] . word = −1 ;
}}
}
/∗ I n s e r t c u r d i s t / word i n t o t h e s o r t e d l i s t d i s t ∗ /
void i s o r t ( d i s t e n t r y ∗ dis t , double c u r d i s t , i n t word ) {i n t i = 0 ;
for ( i = 0 ; ( i < num dis t co ls ) && ( d i s t [ i ] . d i s t < c u r d i s t ) ; i + + ) { } ;
i f ( i < num dis t co ls ) {memmove(& d i s t [ i +1] ,& d i s t [ i ] , s i ze of ( d i s t e n t r y ) ∗ ( num dis t co ls − i − 1 ) ) ;
d i s t [ i ] . d i s t = c u r d i s t ;
d i s t [ i ] . word = word ;
}}
void c a l c d i s t a n c e ( d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) {double d ;
f l o a t curPercent = 0 . 0 ;
i n t x , y ;
259
i n t curLine = 0 ;
s e e d d i s t a n c e ( d i s t ) ;
for ( x = 1 ; x <= num rows ; x ++ , curLine ++) {i f ( curLine >= p r o g d i s t ) {
curLine = 0 ;
f p r i n t f ( s tderr , ”\ r S o r t i n g : %.1 f%%” , curPercent ) ;
f f l u s h ( s t d e r r ) ;
curPercent + = 0 . 1 ;
}for ( y = 1 ; y <= num rows ; y ++) {
d = c a l c a d i s t ( x , y , d i s t , neighbours ) ;
i s o r t ( d i s t [ x ] , d , y ) ;
check ( d i s t [ x ] ) ;
}}
}
void check ( d i s t e n t r y ∗ d i s t ) {i n t cur w = d i s t [ 0 ] . word ;
i n t i = 1 ;
for ( i = 1 ; ( i < num dis t co ls ) & & ( d i s t [ i ] . word ! = − 1 ) ; i ++) {a s s e r t ( d i s t [ i ] . d i s t >= d i s t [ i −1] . d i s t ) ;
a s s e r t ( cur w ! = d i s t [ i ] . word ) ;
cur w = d i s t [ i ] . word ;
}}
/∗ P r i n t t h e k c l o s e s t n e i g h b o u r s f o r e a c h word t o t h e f i l e ” n e i g h b o u r s ” ∗ /
void pr int ne ighbours ( d i s t e n t r y ∗ ∗ neighbours , char ∗ ∗ word strings , i n t k ) {i n t word ;
i n t i ;
FILE ∗ fp = fopen ( ” neighbours ” , ”w” ) ;
for ( word = 1 ; word <= num rows ; word++) {for ( i = 0 ; i < k ; i ++) {
f p r i n t f ( fp , ”%s(% f ) ” , word str ings [ neighbours [ word ] [ i ] . word ] ,
neighbours [ word ] [ i ] . d i s t ) ;
}f p r i n t f ( fp , ”\n” ) ;
260
}}
/∗ Returns t h e s q u a r e o f t h e d i s t a n c e be tween words x and y ∗ /
double c a l c a d i s t ( i n t x , i n t y , d i s t e n t r y ∗ ∗ dis t , double ∗ ∗ neighbours ) {double d = 0 . 0 ;
double sum = 0 . 0 ;
i n t i ;
for ( i = 0 ; i < num cols ; i ++) {d = neighbours [ x ] [ i ] − neighbours [ y ] [ i ] ;
d ∗= d ;
sum + = d ;
}return s q r t (sum ) ;
}
/∗ Words a r e t h e words t o count , t h e rows
f e a t u r e s a r e t h e t h i n g s t o l o o k f o r , t h e columns
window i s where t h e c o r p u s p a s s e s through
w o r d s t r i n g s i s used f o r debugging / o u t pu t
c o u n t s i s where t h e r e s u l t s a r e s t o r e d
s e e n i s f a l s e t h e f i r s t t ime we s e e a word , t o d e a l wi th unknown words
windowSize i s how f a r t o l o o k
s t a r t p o s i s where t o s t a r t t h e c u r s o r −− t o e n a b l e t h e same c o d e t o do b o t h
l e f t and r i g h t
∗ /
void der ive counts ( const i n t ∗ f e a t u r e s , const d i s t e n t r y ∗ ∗ dis t ,
i n t ∗ window , char ∗ ∗ word strings , i n t ∗ ∗ counts ,
i n t ∗ seen , i n t windowSize , const char ∗ corpus name ,
i n t l e f t ) {FILE ∗ CORPUS=NULL;
f l o a t curPercent = 0 . 0 ;
i n t i = 0 , j = 0 , curLine =0;
i n t word = −1 ;
i n t pos = windowSize − 1 ;
i n t currentWord ;
i n t featureWord ;
i n t alsoDoUnknown ;
# i f d e f NEIGHBOURS
i n t k =0;
261
# endif
/∗ i i s t h e l o c t h a t we ’ r e go ing t o put t h e word i n t o ∗ /
/∗ j i s a l o o p c o u n t e r f o r l o o k i n g a t t h e f e a t u r e word ∗ /
/∗ pos i s t h e l o c o f t h e word we ’ r e l o o k i n g f o r n e i g h b o u r s o f ∗ /
/∗ c u r L i n e i s f o r debugging , c u r r e n t l i n e in t h e c o r p u s ∗ /
/∗ currentWord i s t h e word in p o s i t i o n pos
−− f o r n e i g h b o u r s i t i s a n e i g h b o u r o f t h i s word ∗ /
/∗ ac tua lCurrentWord i s on ly used in n e i g h b o u r s , i t i s t h e word in p o s i t i o n pos
−− i . e . same meaning as currentWord in non−n e i g h b o u r c a s e ∗ /
/∗ f e a t u r e W o r d i s t h e word in p o s i t i o n j , i t co−o c c u r s with currentWord ∗ /
/∗ alsoDoUnknown i s t r u e i f f t h i s i s t h e f i r s t t ime we ’ ve s e e n t h e c u r r e n t word ∗ /
CORPUS = fopen ( corpus name , ” r ” ) ;
a s s e r t (CORPUS ) ;
i f ( l e f t ) {for ( i = 0 ; i < windowSize ; i ++) {
window[ i ] = 0 ;
}i = 0 ;
} e lse { /∗ Un l i k e l e f t windows , we f i r s t have t o l o a d t h e whole c o n t e x t ∗ /
for ( i = 0 ; i < windowSize ; i ++) {f s c a n f (CORPUS, ”%d” ,&word ) ;
window[ i ] = word ;
}i = windowSize − 1 ;
}
while ( ! f e o f (CORPUS ) ) {i f (DEBUG) { /∗ May b r e a k f o r unknown ∗ /
p r i n t f ( ”Found %s\n” , word str ings [window[ pos ] ] ) ;
}i f ( 0 = = seen [window[ pos ] ] ) {
seen [window[ pos ] ] = 1 ;
alsoDoUnknown = 1 ;
} e lse {alsoDoUnknown = 0 ;
}for ( j = 0 ; j < windowSize ; j ++) {
i f ( ( j ! = pos ) && ( f e a t u r e s [window[ j ] ] ) ) {# i f d e f NEIGHBOURS
262
for ( k = 0 ; k < num dis t co ls ; k ++) {# endi f
/∗ Sorry a b o u t t h e nex t l i n e . . .
window [ pos ] i s t h e c u r r e n t word
window [ j ] i s t h e f e a t u r e word which i s w i t h i n t h e window o f t h e
c u r r e n t word
f e a t u r e s [ window ] i s ne ed ed t o map words i n t o columns
d i s t [ word ] g i v e s t h e 2 0 ( + / − ) n e a r e s t words t o t h e c u r r e n t word
c o u n t s i s t h e number o f t i m e s t h i s has o c c u r r e d
∗ /
/∗ The c u r r e n t i n p u t word ∗ /
# i f d e f NEIGHBOURS
currentWord = window[ pos ] ;
currentWord = d i s t [ currentWord ] [ k ] . word ;
# else
currentWord = window[ pos ] ;
# endif
/∗ F e a t u r e t h a t o c c u r e d w i t h i n t h e window with t h e c u r r e n t word ∗ /
featureWord = f e a t u r e s [window[ j ] ] ;
counts [ currentWord ] [ featureWord ]++;
i f ( alsoDoUnknown = = 1 ) {currentWord = UNKNOWNWORD;
# i f d e f NEIGHBOURS
currentWord = d i s t [ currentWord ] [ k ] . word ;
# endif
/∗ F e a t u r e t h a t o c c u r e d w i t h i n t h e window with t h e c u r r e n t word ∗ /
featureWord = f e a t u r e s [window[ j ] ] ;
counts [ currentWord ] [ featureWord ]++;
}# i f d e f NEIGHBOURS
}# endif
}}f s c a n f (CORPUS, ”%d” ,&word ) ;
/∗ Get r e a d y f o r t h e nex t word ∗ /
window[ i ] = word ;
263
i + + ; pos + + ; curLine ++;
i f ( i = = windowSize ) {i = 0 ;
}
i f ( pos = = windowSize ) {pos = 0 ;
}
i f ( curLine >= progress ) {curLine = 0 ;
f p r i n t f ( s tderr , ”\ r %.1 f%%” , curPercent ) ;
f f l u s h ( s t d e r r ) ;
curPercent + = 0 . 1 ;
}}f c l o s e (CORPUS ) ;
}
i n t main ( i n t argc , char ∗ argv [ ] ) {double ∗ ∗ neighbours ;
d i s t e n t r y ∗ ∗ d i s t ;
char ∗ ∗ word str ings ;
i n t ∗ ∗ counts ;
i n t ∗ window ;
i n t ∗ f e a t u r e s ;
i n t ∗ seen ;
# i f d e f NEIGHBOURS
a s s e r t ( argc = = 6 ) ;
# else
a s s e r t ( argc = = 4 ) ;
# endif
/∗ Next l i n e s w i l l s e g f a u l t i f env not s e t r i g h t ∗ /
numTagFeatures = a t o i ( getenv ( ”TAG VECTOR SIZE” ) ) ;
num cols = a t o i ( getenv ( ”WORD VECTOR SIZE” ) ) ;
numDendro = a t o i ( argv [ 1 ] ) + 1 ; /∗ + b e c a u s e words [ 0 ] unused ∗ /
numFeaturesTotal = a t o i ( argv [ 2 ] ) + 1 ;
numFeatures = numFeaturesTotal − numTagFeatures ;
windowSize = a t o i ( argv [ 3 ] ) ;
264
# i f d e f NEIGHBOURS
num cols = a t o i ( argv [ 4 ] ) ; /∗ Number o f d i m e n s i o n s a f t e r SVD ∗ /
num dis t co ls = a t o i ( argv [ 5 ] ) ; /∗ Number o f words t o c o n s i d e r n e i g h b o u r s ∗ /
# endif
lowProb = 1 . 0 / numFeatures ;
word str ings = ( char ∗ ∗ ) malloc ( s i ze of ( char ∗ ) ∗ numDendro + fudge ) ;
counts = ( i n t ∗ ∗ ) amalloc ( s i ze of ( i n t ) , NULL, 2 , numDendro + fudge , numFeatures + fudge ) ;
window = ( i n t ∗ ) malloc ( s i ze of ( i n t ) ∗ windowSize ) ;
f e a t u r e s = ( i n t ∗ ) malloc ( s i ze of ( i n t ) ∗ numWords ) ;
seen = ( i n t ∗ ) malloc ( s i ze of ( i n t ) ∗ numWords ) ;
neighbours = ( double ∗ ∗ )
amalloc ( s i ze of ( double ) , NULL, 2 ,
num rows + fudge , num cols + fudge ) ;
d i s t = ( d i s t e n t r y ∗ ∗ )
amalloc ( s i ze of ( d i s t e n t r y ) , NULL, 2 ,
num rows + fudge , num dis t co ls + fudge ) ;
a s s e r t ( word str ings ) ;
a s s e r t (window ) ;
a s s e r t ( f e a t u r e s ) ;
a s s e r t ( counts ) ;
a s s e r t ( counts [ 0 ] ) ;
a s s e r t ( d i s t ) ;
a s s e r t ( d i s t [ 0 ] ) ;
UNKNOWNWORD = numDendro ; /∗ Wil l change in l o a d c o n v e r t ∗ /
/∗# Next , d e t e r m i n e which w o r d s t r i n g s a r e words and f e a t u r e s
# a words word i s one t h a t i t i s worth computing a f e a t u r e v e c t o r f o r ( a row )
# a ve ry words word i s one t h a t h e l p s t r a i n t h e f e a t u r e v e c t o r ( a column )
∗ /
l o a d f e a t u r e s ( f e a t u r e s ) ;
numDendro = load conver t ( word str ings , numDendro ) ;
# i f d e f NEIGHBOURS
load neighbours ( neighbours ) ;
c a l c d i s t a n c e ( d i s t , neighbours ) ;
i f ( n e i g h b o u r s t o p r i n t >= numDendro ) {
265
n e i g h b o u r s t o p r i n t = numDendro − 1 ;
}pr int ne ighbours ( d i s t , word str ings , n e i g h b o u r s t o p r i n t ) ;
# endif
for ( i = 0 ; i < numWords ; i ++) {seen [ i ] = 0 ;
}
/∗ l o a d t a g s ( c o u n t s ) ; −− t h i s i s done us ing p a s t e ∗ /
/∗# Now l o a d t h e a b i l i t y t o c o n v e r t be tween w o r d s t r i n g s and numbers
# Th i s i s used t o make p r e t t y gr ap hs
∗ /
# i f d e f TIPSTER
f p r i n t f ( s tderr , ” Part 1 of 5 :\n” ) ;
# else
f p r i n t f ( s tderr , ” Part 1 of 1 :\n” ) ;
# endif
der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,
seen , windowSize , ” corpus . aa” , 1 ) ;
# i f d e f MAORI
/∗ Maori may p r e f e r t o l o o k r i g h t i n s t e a d o f l e f t −− c u r r e n t l y d i s a b l e d ∗ /
/∗ d e r i v e c o u n t s ( f e a t u r e s , d i s t , window , w o r d s t r i n g s , counts ,
s een , windowSize , ” c o r p u s . aa ” , 0 ) ;
∗ /
# endif
# i f d e f TIPSTER
f p r i n t f ( s tderr , ” Part 2 of 5 :\n” ) ;
der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,
seen , windowSize , ” corpus . ab” , 1 ) ;
f p r i n t f ( s tderr , ” Part 3 of 5 :\n” ) ;
der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,
seen , windowSize , ” corpus . ac ” , 1 ) ;
f p r i n t f ( s tderr , ” Part 4 of 5 :\n” ) ;
der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,
seen , windowSize , ” corpus . ad” , 1 ) ;
f p r i n t f ( s tderr , ” Part 5 of 5 :\n” ) ;
der ive counts ( f e a t u r e s , d i s t , window , word str ings , counts ,
seen , windowSize , ” corpus . ae” , 1 ) ;
266
# endif
/∗# P r i n t t h e body
∗ /
for ( i = 1 ; i <= numDendro ; i ++) {p r i n t f ( ”%c%s%c ” , quote , word str ings [ i ] , quote ) ;
for ( j = 1 ; j < numFeatures ; j ++) {p r i n t f ( ”%d ” , counts [ i ] [ j ] ) ;
}p r i n t f ( ”\n” ) ;
}
p r i n t f ( ”\n” ) ;
f p r i n t f ( s tderr , ”\n” ) ;
return 0 ;
}
C.5.2 Scaling bigrams
/∗ R e w r i t e o f s c a l e . p l in C due t o p e r l running out o f ram
∗ S i n c e t h i s i s a r e w r i t e , some p e r l c o d i n g c o n v e n t i o n s a r e used
∗∗ PROGRAM LOGIC :
∗ S c a l e s a bigram f i l e so t h a t a l l v e c t o r s a r e u n i t .
∗ To r e d u c e t h e d e p e n d e n c e on f r e q u e n t words we e x p e r e m e n t with l o g and / o r
∗ s c a l i n g columns . t h e do v a r i a b l e s c o n t r o l t h i s
∗∗ Each l i n e has t h e f o r m a t : ” word ” num num num . . .
∗∗ /
# include < a s s e r t . h>
# include <math . h>
# include < s t r i n g . h>
# include < s t d i o . h>
# include ” amalloc . h”
# define max word size 6 0 0 /∗ f o r s t r c p y , a l l words f e w e r b y t e s than t h i s ∗ /
# define m a x l i n e s i z e 6 5 5 3 6 /∗ f o r f g e t s , a l l l i n e s f e w e r b y t e s than t h i s ∗ /
# define fudge 5 /∗ Add t o e v e r y m a l l o c t o f i t q u i r k y c a s e s in ∗ /
267
# define p r i n t i t ( ) \i f ( do pr int ) { \
f p r i n t f ( s tderr , ”PRINTING THE MATRIX\n” ) ; \for ( cur row = 0 ; cur row < num rows ; cur row + + ) { \
f p r i n t f ( s tderr , ”%s ” , word str ings [ cur row ] ) ; \for ( cur column = 0 ; \
cur column < num cols ; cur column + + ) { \f p r i n t f ( s tderr , ”%f ” , c e l l s [ cur row ] [ cur column ] ) ; \
} \f p r i n t f ( s tderr , ”\n” ) ; \
} \}
/∗ ALGORITHM:
∗ arg 1 = bigram f i l e name
∗ arg 2 = number o f rows
∗ arg 3 = number o f columns
∗ arg 4 = do add1 −− add one t o e v e r y c e l l
∗ arg 5 = d o l o g −− l o g e v e r y c e l l
∗ arg 6 = d o c e n t e r −− c e n t e r rows on z e r o
∗ arg 7 = d o s q r t −− use RMS i n s t e a d o f l i n e a r
∗ arg 8 = d o c o l s c a l e −− c e n t e r columns on z e r o
STEP 0 : Read in t h e d a t a
STEP 1 : Add one t o e v e r y c e l l ( s i m p l e maximum−e n t r o p y c o u n t e r m e a s u r e )
STEP 2 : Log e v e r y c e l l ( t o c o u n t e r z i p f )
STEP 3 : Ce nt e r t h e rows on z e r o
STEP 4 : S c a l e columns t o 1 ( t o u n d e r e m p h a s i s e f r e q u e n t words )
STEP 5 : S c a l e rows t o 1 ( n o r m a l i s e t h e v e c t o r s )
I d e c i d e d t o i g n o r e d o s q r t in d o c o l s c a l e d e l i b e r a t e l y . I f o r g e t why
∗ /
i n t main ( i n t argc , char ∗ argv [ ] ) {/ / Input
FILE ∗ IN ;
char ∗ input ;
char ∗ i n p t r ; /∗ P o i n t e r i n t o i n p u t ∗ /
char ∗ spaceptr ; /∗ P o i n t e r i n t o i n p u t −− a t l o c o f nex t t o k e n ∗ /
/ / Arrays
f l o a t ∗ ∗ c e l l s ; / / 2D a r r a y o f a l l c e l l s
268
char ∗ ∗ word str ings ; / / The words as s t r i n g s
/ / What p a r t s o f t h e c o d e t o e n a b l e
i n t do add1 = 0 ; / / Add one t o e v e r y c e l l ( r e q u i r e d f o r l o g )
i n t do log = 0 ; / / Log e v e r y c e l l b e f o r e p r o c e s s i n g
i n t do center = 1 ; / / Cen t e r on z e r o
i n t do sqr t = 1 ; / / RMS i n s t e a d o f s i m p l e a d d i t i o n
i n t d o c o l s c a l e = 0 ; / / E q u a l i s e a l l columns
i n t do pr int = 0 ; / / P r i n t debugging
/ / S i z e s o f t h i n g s
double row sum ; / / Sum o f c e l l s in t h i s row
double col sum ; / / Sum o f c e l l s in t h i s column
i n t num cols ; / / Number o f columns
i n t num rows ; / / Number o f rows
i n t cur column ; / / Current column i n d e x b e i n g p r o c e s s e d
i n t cur row ; / / Current row i n d e x b e i n g p r o c e s s e d
a s s e r t ( argc = = 9 ) ;
f p r i n t f ( s tderr , ”Reading bigrams from %s\n” , argv [ 1 ] ) ;
num rows = a t o i ( argv [ 2 ] ) ;
num cols = a t o i ( argv [ 3 ] ) ;
do add1 = a t o i ( argv [ 4 ] ) ;
do log = a t o i ( argv [ 5 ] ) ;
do center = a t o i ( argv [ 6 ] ) ;
do sqr t = a t o i ( argv [ 7 ] ) ;
d o c o l s c a l e = a t o i ( argv [ 8 ] ) ;
/∗ S a n i t y ∗ /
a s s e r t ( 0 < num rows ) ;
a s s e r t (1000000 > num rows ) ;
a s s e r t ( 0 < num cols ) ;
a s s e r t (10000 > num cols ) ;
input = ( char ∗ ) malloc ( m a x l i n e s i z e ) ;
c e l l s = ( f l o a t ∗ ∗ ) amalloc ( s i ze of ( f l o a t ) , NULL, 2 ,
num rows + fudge , num cols + fudge ) ;
word str ings = ( char ∗ ∗ ) amalloc ( s i ze of ( char ) , NULL, 2 ,
num rows + fudge , max word size + fudge ) ;
IN = fopen ( argv [ 1 ] , ” r ” ) ;
a s s e r t ( IN ) ;
269
/ /
/ / STEP 0 : Read in t h e d a t a
/ /
cur row = 0 ;
f p r i n t f ( s tderr , ”Reading data\n” ) ;
for ( cur row = 0 ; cur row < num rows ; cur row ++) {f g e t s ( input , max l ine s ize , IN ) ;
a s s e r t ( input ) ;
i n p t r = input ;
spaceptr = s t r c h r ( inptr , ’ ’ ) ;
a s s e r t ( spaceptr ) ;
∗ spaceptr = ’ \0 ’ ;
s t rcpy ( word str ings [ cur row ] , i n p t r ) ;
i n p t r = spaceptr + 1 ;
for ( cur column = 0 ; cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ] = a t o f ( i n p t r ) ;
spaceptr = s t r c h r ( inptr , ’ ’ ) ;
a s s e r t ( spaceptr ) ;
i n p t r = spaceptr +1;
}a s s e r t (∗ i n p t r = ’ \n ’ ) ;
}f c l o s e ( IN ) ;
p r i n t i t ( ) ;
/ /
/ / STEP 1
/ / Add one t o e a c h c e l l
/ /
i f ( do add1 ) {f p r i n t f ( s tderr , ”Adding one to every c e l l \n” ) ;
for ( cur row = 0 ; cur row < num rows ; cur row ++) {for ( cur column = 0 ;
cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ]++;
}
270
}}
/ /
/ / STEP 2
/ / Log e a c h c e l l
/ /
i f ( do log ) {f p r i n t f ( s tderr , ”Computing logarithm f o r each c e l l \n” ) ;
for ( cur row = 0 ; cur row < num rows ; cur row ++) {for ( cur column = 0 ;
cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ] =
log ( c e l l s [ cur row ] [ cur column ] ) ;
}}p r i n t i t ( ) ;
}
/ /
/ / STEP 3 : C ent e r t h e v a l u e s on z e r o
/ /
i f ( do center ) {f p r i n t f ( s tderr , ” Centering rows on zero\n” ) ;
for ( cur row = 0 ; cur row < num rows ; cur row ++) {row sum = 0 ;
for ( cur column = 0 ; cur column < num cols ; cur column ++) {row sum + = c e l l s [ cur row ] [ cur column ] ;
}row sum / = num cols ;
for ( cur column = 0 ; cur column < num cols ; cur column ++) {c e l l s [ cur row ] [ cur column ] −= row sum ;
}}p r i n t i t ( ) ;
}
/ /
/ / STEP 4 : S c a l e columns t o 1 ( t o u n d e r e m p h a s i s e f r e q u e n t words )
271
/ /
i f ( d o c o l s c a l e ) {f p r i n t f ( s tderr , ”Computing column counts\n” ) ;
for ( cur column = 0 ; cur column < num cols ; cur column ++) {col sum = 0 ;
for ( cur row = 0 ; cur row < num rows ; cur row ++) {col sum + = c e l l s [ cur row ] [ cur column ] ;
}col sum / = num rows ;
for ( cur row = 0 ; cur row < num rows ; cur row ++) {c e l l s [ cur row ] [ cur column ] / = col sum ;
}}p r i n t i t ( ) ;
}
/ /
/ / STEP 5 : S c a l e rows t o 1 ( n o r m a l i s e t h e v e c t o r s )
/ /
f p r i n t f ( s tderr , ” Normalising the v e c t o r s ” ) ;
i f ( do sqr t ) {f p r i n t f ( s tderr , ” using s q r t (RMS)\n” ) ;
} e lse {f p r i n t f ( s tderr , ” using l i n e a r \n” ) ;
}
for ( cur row = 0 ; cur row < num rows ; cur row ++) {for ( cur column = 0 ; cur column < num cols ; cur column ++) {
i f ( do sqr t ) {row sum + = c e l l s [ cur row ] [ cur column ] ∗
c e l l s [ cur row ] [ cur column ] ;
} e lse {row sum + = c e l l s [ cur row ] [ cur column ] ;
}}i f ( do sqr t ) {
row sum = s q r t ( row sum ) ;
}for ( cur column = 0 ; cur column < num cols ; cur column ++) {
c e l l s [ cur row ] [ cur column ] / = row sum ;
272
}}p r i n t i t ( ) ;
/∗ F i n i s h e d ∗ /
for ( cur row = 0 ; cur row < num rows ; cur row ++) {p r i n t f ( ”%s ” , word str ings [ cur row ] ) ;
for ( cur column = 0 ; cur column < num cols ; cur column ++) {p r i n t f ( ”%.6 f ” , c e l l s [ cur row ] [ cur column ] ) ;
}p r i n t f ( ”\n” ) ;
}return 0 ;
}
273
Glossary
Active edge: An edge in the chart that has not yet been completed because only some of
its constituents have been found.
Adjunct: A phrasal argument that is not synatactically required, for example yesterday in
John died yesterday.
Ambiguity: See ambiguous.
Ambiguous: A string of words that can be interpreted in two or more ways.
Arc: An edge in a chart or a graph.
Argument: A phrase associated with a head constituent, for example the ball in John kicked
the ball.
Backoff: Replacing a complex statistical lookup with a simpler lookup in order to increase
counts.
Backpropagation: A simple neural network architecture in which errors are backpropa-
gated from the output layer back through the network.
Bag: A set in which duplicates are permitted.
Base-NP: A term invented by Collins that refers to a non-recursive noun phrase. Collins
hypothesised that noun phrases which include other noun phrases have differ-
ent usage rules. For example, The man with the big hat. is a noun phrase but not a
base noun phrase since it contains the constituent (base) noun phrase the man.
Beam: The data structure used to store the candidates in a beam search.
Beam search: A heuristic search in which the number of candidates being considered is
constrained.
Best edge: The edge with the highest probability.
Best first: A search strategy in which the most promising looking node is expanded next.
275
Bigram: The count of an event involving two items, for example two words co-occuring.
Bigram-statistics: Statistical analysis of bigrams.
Bit: The smallest complete unit of information. The term is used in information the-
ory to refer to the amount of information that is necessary to represent a binary
decision of equal probabilities (for example a coin toss). The term is also used in
Cascade neural networks to refer to a binary output unit.
Bits eror: Along with index error, one of the victory criteria used by Cascade-correlation.
Measuring the error in bits means a large error is given the same penalty as a
small error, making the network better at generalising but less accurate.
BNC: The British National Corpus; a billion-word corpus of text, commonly used in
natural language processing.
Bottom-up: Starting with the words and building towards some high level structure, com-
pare top down.
Branch: For code: Produce two different versions based on the same initial (root) version,
most commonly an unstable version containing the new features and a stable
version that is known to work.
Candidate hidden unit: Units used by Cascade-correlation; the best one will be incorpo-
rated into the network as a hidden unit, and the rest will be discarded.
Candidate training phase: Part of Cascade Correlation’s learning, in which it trains the
candidate hidden units to be active when the network’s error is at its highest.
Cascade Correlation: A neural network architecture developed by Fahlman and Lebiere
(1990). Cascade is similar to backpropagation with the most visible difference
being much faster learning. It is described in Section 7.4.
Centered vector: A vector in which the sum (mean) of the elements is zero.
CFG: Context free grammar. A grammar that does not require any context outside
that explicitly included in the sequence of words.
Chart: A data structure that stores the set of all edges going from word a to word b.
Child: The sub phrase which will be expanded by the addition of a parent. For instance,
a verb-phrase child could be expanded to form a sentence.
Collocation matrix: A two dimensional array of co-occurence counts.
276
Complement: An argument to a phrase that is necessary for the phrase to be well formed.
For instance, John kissed is missing its complement.
Complete: A finished edge, that is, one that has all of its arguments and so is ready to be
used in other structures.
Conditional independence: Two events are conditionally independent of a third if we can
discard each event in working out the probability of the other. That is: a and b
are conditionally independent iff P(a, b|c) = P(a|c).P(b|c).
Conditional probability: The probability of an event occuring is different if we know an-
other event has occurred.
Context: The linguistic information (usually words) surrounding the item being exam-
ined. For example, in a context of thievery, fence is likely to be interpreted differ-
ently.
Context-free grammar: A grammar made up of a finite number of rules. That is, every
nonterminal can always be expanded to other nonterminals or to a word, re-
gardless of the context the nonterminal occurs in.
Corpus: A body of text, particularly in terms of training data.
Cost: The distance between two nodes in a graph. For instance, a graph representing
the flight times between cities would represent the costs in terms of time.
Coverage: The proportion of a language that can be represented using a grammar. Wide-
coverage grammars require a large number of rules which tends to lead to over-
generalising.
Crossing brackets: A coarse measure of a paqrser’s accuracy in which the parser is pe-
nalised every time its predicted bracketing crosses the correct answer. The brack-
eting will cross when the parser gets both when a phrase starts and stops wrong.
It is a useful metric because the WSJ’s variable structure sometimes makes pre-
cision and recall inaccurate while a crossed bracket is always an error.
Cutoff: An explicit boundary where any phrase outside the boundary (i.e. too low prob-
ability) is discarded. (See also cutoff threshold.).
Cutoff threshold: In searching, the probability at which we say a given interpretation does
not look promising enough and discard it rather than expanding it further (see
cutoff).
277
Decomposed: Choosing a particular path for transforming a nonterminal, such as decid-
ing a sentence should be decomposed into a noun phrase followed by a verb
phrase instead of any other interpretation.
Dendrogram: A tree hierarchy, mainly used here in relation to words.
Density plot: A close relative of a scatter-plot where instead of plotting individual points,
the whole graph is coloured, and the colour is brighter where there are more
points nearby. This approach is extremely effective at showing trends where
there are so many points that outliers would otherwise mask the trend.
Dependency: A phrase that is part of another phrase. For instance the ball in John kicked the
ball.
Distance metric: A measure of the distance between two phrases since explicitly giving
the intermediate phrases would be too much context and reduce all counts to
near zero.
Distributed: A representation in which the meaning is not captured by a single symbol but
by the combination of a number of nominally independent items. For instance,
the word vectors are distributed because the values of any dimension in the
vector can be changed to give a new meaning.
Earley parsing: A simple top-down parsing algorithm in which the parser can either shift
the current input word, or reduce it to a nonterminal. The effect of this operation
is somewhat similar to a LR-parser.
Edge: In parsing literature: A phrase, either complete or with some parts still unex-
panded or; in graph literature, an arc.
Eigen decomposition: The decomposition of a square matrix into eigenvalues and eigen-
vectors such that the eigenvalues times the eigenvectors gives the origonal ma-
trix.
Eigenvalue: The amount a particular eigenvector is scaled when transformed.
Eigenvector: A vector that when transformed by a matrix retains its direction but perhaps
not length.
Event: The representation of a single transformation under a given grammar model.
For example NP → Det Noun. The event contains all the information that the
grammar model stipulates (lexicalised, distance, etc.) and nothing else.
278
Event file: A file containing every single event in the treebank. For a statistical parser, this
renders the treebank redundant.
Exact match: The hardest measure of a parser’s accuracy in which the parser is penalised
unless the predicted parse is identical to the gold standard. A single tagging
error (for example NNS instead of NNPS) or extra internal structure (NP (ADJP
(JJ big)) (NN man)) vs (NP (JJ big) (NN man)) will cause the parser to score zero
on the current sentence.
Expand: Replace a phrase by one interpretation of the phrase in the grammar. For in-
stance in a simple context-free grammar, a sentence could be expanded to a noun
phrase followed by a verb phrase. Later, the noun phrase could be expanded to
a determiner followed by a noun.
Expected: The most probable outcome (mean, not mode).
Feature: Properties of grammatical entities, such as count, gender or formality.
Feature vector: My term to refer to orthogonal representations, such as word-space.
Feature words: The words that bigram counts are computed between, rather than the
words we are trying to represent. For instance with may make a good feature
word, since certain classes of words co-occur with it.
Fourgram: An n-gram with four parameters.
Fragment: A small well-behaved subset of the language. The term is also used to refer to
a sequence of words that cannot be parsed into a phrase.
Fringe: The nodes in the search tree which are about to be expanded. Nodes on the
fringe are the current candidates for being in the parse.
Genprob: The name of Collins’ function to compute the probability of any event.
Grammar: A set of rules for deciding if a given sentence is well formed in the language.
Hash key: The result of mapping the event into a simple linear sequence, usually the array
refrerence.
HBG: History Based Grammar, the grammar formalism used by Black et al.’s parser.
Head constituent: The key sub-phrase in the phrase.
Head nonterminal: The nonterminal category of the head constituent.
279
Head production: The generation of a phrase’s head constituent. Along with sibling pro-
duction this allows the generation of all parse trees.
Headword: See lexical head.
Heuristic search: Any form of graph search (including parsing) in which the search is
guided towards the goal by evaluating how good each state is and expanding
promising states.
Hidden Markov model: a statistical machine in which not only transitions are probabilis-
tic but also output. HMMs are frequently used in speech recognition as well as
POS tagging. To take POS tagging as an example, observable events (words)
can be used to predict the internal state (the POS tag). They differ from Markov
models in that the internal state is not directly observable.
Hierarchical: A data represetation like a tree, so that nodes have a children, siblings and a
parent.
HMM: See hidden Markov model.
Homograph: Two different words with the same spelling. The term is a more extreme
version of polysemous in that a word is polysemous if it may be used in different
ways (for example telephone a friend vs answer the telephone) but it is a homograph
if the meanings are unrelated (for example fence the goods vs fence with a foil.
HPSG: Head driven Phrase Structure Grammar. Pollard and Sag’s formalism where
phrases contain linguistically information useful, most obviously the head word
of the phrase. Collins’ probability model takes advantage of HPSG in deciding
what information to discard.
Inactive edge: An edge that has been completed and so is ready for use by other edges
rather than being expanded itself.
Incomplete: An edge that has not yet been completed and so is still looking for neighbour-
ing words or phrases.
Independent: Two events are independent if one occuring does not affect the probability of
the other occuring.
Index array: An array of elements referencing into another array. For instance a large array
of all words sorted alphabetically might have an index array with an element for
each possible starting value.
280
Index error: Along with bits eror, one of the victory criteria, used by Cascade-correlation.
When using index, getting an output significantly wrong is worse than getting
it just a little wrong. This encourages the network to accurately fit the training
data but makes overfitting easy.
Information Theory: A branch of statistics concerned with the amount of information that
is represented by a fact. Since events do not occur independently, the sequence
of events can usually be used to predict the next event to some extent. The
amount we cannot predict the next event is the number of bits necessary to en-
code its occurence, although less efficient encodings will require more bits. It is
very useful when we wish to eliminate or at least make explicit any redundency
in the representation.
Inside: The part of the parse that has already been fully built.
Inside probability: The probability of a set of operations occuring, regardless of external
context.
Interpolation: Many functions cannot be represented explicitly but only indirectly through
some sample of input/output pairs. Given such a sample over a centain range,
interpolation is estimating what the output would be for a different input that
is still within the range of sampled inputs, such as midway between two input
values.
Iterative clustering: Performing the clustering multiple times where the input for a run
is the output from the previous run. Useful if the initial clustering only forms
‘hints’ of the clusters that are present, or to merge clusters.
Join: The combination of two phrases to produce a larger phrase.
Kernel function: A mapping between one multi-dimensional space and another (with po-
tentially a different number of dimensions) intended to make more explicit some
property of the data. For example, the classic ‘two spirals’ problem looks very
complicated in a cartesian space but is linearly seperable in a polar space.
Key generation: The process of combining several parameters to generate a hash-key. For
instance, the left grammar hash table is accessed by providing a head and a
parent; which are combined by the key generation algorithm to form a single
array index.
Lexical head: The head word of a phrase, for example kicked in John kicked the ball.
281
Markov model: A statistical machine in which the system is in an observable state and
will determine the next state probabilistically. Usually, a history of recent states
is encoded into the representation of the current state.
Markovian assumption: The assumption that only a certain amount of history for pre-
dicting the next state. While the assumption is often incorrect, it is usually close
enough to correct for practical purposes.
Maximum Likelihood Estimate: The MLE for a parameter(s) is the value for that parame-
ter that maximises the probability of observing the values which have occurred.
For instance, if we observe a coin giving heads 60% of the time, then the MLE
for P(heads) is 0.6.
MLE: See Maximum Likelihood Estimate.
Mutual information: The amount of information shared between two events x and y, mea-
sured in bits. If x and y are independent then it is zero, if they are identical then
it is the amount of information x contains.
N-gram: The count of an event involving n items, for example a word occuring with a
particular tag, nonterminal, etc.
Naive Bayes: A simple probabilitic classifier based on strong independent assumptions.
Neural network: An adaptive algorithm for approximating arbitary functions.
NLP: Natural Language Processing; any research in computational linguistics .
NPB: A base (non-recursive) noun phrase.
Object: The part of the sentence that is acted on; the ball in John kicked the ball.
Optimistic: Any search heuristic in which the cost of reaching the goal is always under-
estimated. Optimistic heuristics are useful because the sum of the cost to the
current node plus the estimate to the goal never more than the true cost to the
goal via the current node.
Order: The amount of context contained in the Markov model.
Output training phase: Part of Cascade Correlation’s learning, in which it trains the weights
to the output units.
Outside: The parts of the parse tree that have not been generated yet. Important because
some states could look good locally but be impossible to transform into a sen-
tence.
282
Outside probability: The probability that the current state will lead to a goal state.
Overfitting: Training a neural network until it very accurately reproduces the training
data, and as a result generalises poorly. It is generally best to stop training well
before the neural network reproduces the training data in order to maintain
smooth generalisations between training instances. Cascade is especially vul-
nerable to overfitting.
Parent: See parent nonterminal.
Parent constituent: The entire parent phrase.
Parent headword: The lexical head of the parent phrase. Due to the definition of head, this
will be the same as the lexical head of the head constituent.
Parent nonterminal: The nonterminal category of the parent constituent. i.e. the top non-
terminal in the tree.
PCA: See Principal component analysis.
PCFG: Probabilistic Context Free Grammar. Identical to normal context–free grammar
except rules have probababilities assigned to them.
Penn treebank: The treebank of fifty thousand hand-parsed sentences from the WSJ devel-
oped by the University of Pennsylvania, it is also known as the WSJ. This corpus
is used by all current statistical parsers for training data. The size (or frequently
lack of it) of this corpus determines most of the design decisions in building a
parser, and will continue to do so until better unsupervised learning methods
are developed.
Perplexity: The amount of information needed to convey the HMM of the language. Mea-
sured in bits, the perplexity refers to the how much information is necessary to
convey the next word in a given language model. It can be informally viewed as
the average number of words which can possibly occur next in a word sequence.
Phrase: A branch of a parse tree. Rather than just the nonterminal label, such as NP, the
phrase includes everything in the parse over the span of words.
Polysemous: A word which can be used with different (but semantically related) mean-
ings, including different POS tags.
POS: Part-Of-Speech. The tag given to denote the role of a word, for example kick is a
verb.
283
Precision: A measure of a parser’s accuracy defined as the percentage of phrases found by
the parser that are considered correct. This is distinct from recall in that a parser
which claims no input ever has any phrases will have perfect precision but zero
recall.
Primed: A word we would be unsurprised to see due to recent context.
Principal component: The most important information. More formally, if the data is repre-
sented by an n dimensional space then the principal component is a hyperplane
through the space which shows greatest variance.
Principal component analysis: a method of analysing multivariate data in order to ex-
press their variation in a set of orthogonal components, sorted by the amount of
variance they express.
Probabilistic grammar: Any grammar formalism in which the rules are associated with a
probability that the given rule applies.
Production: The expansion of a grammar rule.
Pseudo-event: A generated event which did not actually occur in the corpus but which is
treated as if it had occurred. Pseudo-events are used to increase counts and to
compensate for events not seen during training.
Recall: A measure of a parser’s accuracy defined as the percentage of phrases present
in the input which were found by the parser. This is distinct from precision in
that a parser which claims every possible word sequence as a phrase will have
perfect recall but nearly zero precision.
Reducing: In an Earley parser, noting that the top items in the stack exactly match a rule
in the grammar and replacing the items by the single item they match.
Regular expression: A formalism for representing simple languages. Regular expressions
are frequently used for complex pattern-matching or substitutions. While they
have only a fraction of the representational power that other grammars posess,
they can be used to solve a surprisingly large number of problems.
RMS: Root Mean Square; a standard statistical technique for combining a list of num-
bers to give a magnitude. Defined as
xrms =
√√√√ 1N
N
∑i=1
X2i
.
284
Serial learning: Learning a series of items in a fixed order. In neural networks this usually
leads to an inability to reproduce early items from the series as their representa-
tions are overwritten with newer items.
Shifting: In an Earley parser, placing the current input word onto the stack (and removing
it from the input).
Sibling production: The generation of dependent siblings to the left and right of the head.
Singular value decomposition: A type of principal component analysis.
Skiplist: A data type similar to a linked list in which multiple next pointers allow faster
traversing.
Smoothing: The process of combining multiple probability estimates into a single esti-
mate.
SOM: Self Organising Map. A neural network architecture developed by Kohonen
(1982) that uses unsupervised learning to extract patterns in its training data.
Stagnate: The term that uses to describe when a is no longer learning. At this point the
best candidate unit’s weights are locked and it is added to the network. Stagnat-
ing is generally a good indication that the network is still learning, as opposed
to when the network has a timeout.
Statistical parsing: Using a statistical grammar to find the best parse for a sentence, or the
possible parses sorted by probability.
Stop: Collins’ term to declare that a phrase has been completed and should now be
used as a constituent in larger phrases rather than being expanded itself. It is
computed using a special dependency production in which the abstract ‘stop’
phrase is attached to the left and the right.
Sub-event: A specific event that is counted directly in the corpus, rather than one which is
backed off to an approximated count.
Subcategorisation list: The bag of phrases that a phrase needs to have as complements in
order to be complete.
Subject: The key ‘actor’ in the sentence; John in John kicked the ball.
Supervised learning: A learning method used in neural networks where explit training
data is provided. The training data is in the form of input/output pairs.
285
Surprise: An information-theory term referring to how likely an event is to occur. Surprise
is measured in bits and if an event has a probability p of occurring then the
surprise of it occurring is computed using the formula: − log(p).
SVD: See singular value decomposition.
SVM: Support Vector Machine. An algorithm for classifying very high-dimensional
data.
TAG: Tree Adjoining Grammar. A lexicalised theory of syntax in which operations,
such as substitution, are applied to trees.
T/G: Tipster and Gutenberg. My corpus derived by concatenating the Tipster corpus
to Project Gutenberg.
Timeout: The term that uses to describe when it stops the training process because the
has taken too long to converge to a stable state and may be trapped in a (near)
infinite loop. This is generally an indication that the problem is too hard for
cascade.
Tokenisation: Breaking a sequence of letters into words.
TOP: Collins’ highest-level nonterminal structure, to make it easier to distinguish
from sentences that are embedded.
Top-down: Starting with a high-level structure such as sentence and attempting to expand
it into the low-level perceived events (typically words).
Tree: A data structure showing the nonterminal constituents in a particular parse of a
sentence, the output of a parser.
Treebank: An unordered set of (parse) trees.
Trigram: A n-gram in which an event consists of three terms.
Unary: The transformation of one item into another item. Most commonly used here to
refer to parent productions, where a nonterminal chooses a single parent.
Unigram: The simplest n-gram, an event containing exactly one term such as seeing a
particular word.
Unit vector: A vector centered on the origin with a length of one.
Victory error criterion: The point at which the neural network considers it has learned its
training data successfully.
286
Viterbi: The observation that we are typically only interested in the best interpretation
and can discard any search paths that we know cannot form the best interpreta-
tion (because another interpretation is locally more likely).
Well-formed sentences: A sentence for which the grammar will produce a valid parse; a
valid sentence.
Word space: Any representation of words in which words are ‘nearby’ when they are re-
lated, and ‘distant’ when they are not related.
WSJ: Wall Street Journal, See Penn treebank.
XBAR: A highly recursive grammar formalism. The name comes from a convention of
placing a bar over nonterminals to denote the end of recursion, with the X refer-
ring to the idea that the difference between nonterminals should be abstracted
away as a feature.
Zipf’s law: An observation that word frequency is distributed exponentially. That is, n
times the number of words occuring n times is approximately constant.
287
Index
Abney et al. (1999), 210, 213
Allen (1995), 12, 213
Banerjee and Pedersen (2003), 131, 213
Bencini et al. (2002), 210, 213
Bengio and Bengio (2000), 119, 213
Bengio et al. (2003), 119, 213
Bengio (2003), 121, 213
Bies et al. (1995), 213, 221
Bikel (2004), 32, 96, 207, 213, 244
Bikel (2005), 54, 213
Black et al. (1992), 16, 17, 22, 27–29, 31, 46,
213, 279
Bod and Scha (1996), 5, 20, 27, 29–31, 46, 47,
214
Bod (1996), 48, 213
Booth and Thompson (1973), 16, 214
Brooks (1982), 90, 214
Brown et al. (1992), 110, 113, 214
Chapman (1992), 108, 214
Charniak et al. (1993), 182, 214
Charniak et al. (1996), 75, 76, 79, 100, 214
Cheeseman et al. (1990), 131, 214
Chen and Goodman (1996), 39, 214
Chen and Rosenfeld (2000), 39, 214
Chomsky (1965), 2, 214
Choueka and Luisgnan (1985), 115, 214
Christ (1994), 114, 214
Collins (1996), 47–49, 58, 161, 214
Collins (1997), 5, 7, 48, 51, 57, 58, 94, 215
Collins (1999), 5, 20, 29, 46, 48, 51, 58, 95, 161,
164, 215
Copestake and Flickinger (2000), 1, 215
Curran (2004), 211, 215
Earley (1970), 12, 215
Elman (1990), 116, 215
Fahlman and Lebiere (1990), 166, 215, 276
Finch (1993), 68, 109, 121, 171, 215
Gale and Sampson (1995), 37, 215
Garfield and Wermter (2003), 167, 215
Garner (1995), 131, 215
Ginzburg and Sag (2000), 1, 215
Goodman (1996), 48, 215
Goodman (1998), 5, 43, 216
Goodman (2001), 136, 216
Haegeman (1991), 8, 216
Harman (1992), 128, 216
Hart (2005), 128, 216
Hastings (1970), 48, 216
Honkela et al. (1995), 121, 122, 216
Honkela (1997a), 121, 216
Honkela (1997b), 121, 216
Jelinek and Mercer (1980), 39, 216
Joachims (2001), 122, 216
Katz (1987), 39, 216
Klein and Manning (2001a), 153, 216
Klein and Manning (2001b), 43, 217
Klein and Manning (2002), 44, 209, 217
Klein and Manning (2003), 27, 31, 32, 46, 47,
52, 207, 217
288
Kohonen (1982), 217, 285
Kudo and Matsumoto (2001), 210, 217
Lakeland and Knott (2001), 74, 75, 217
Lakeland and Knott (2004), 67, 217
Lawrence et al. (1996), 212, 217
Lee (2004), 3, 217
Liddle (2002), 117, 217
Lin (1997), 114, 217
Lin (1998), 115, 210, 218
Li (1992), 4, 217
Magerman (1995), 25, 58, 71, 218
Magerman (1996), 5, 218
Manning and Schutze (1999), 45, 218
Marcus et al. (1993), 17, 218
Mayberry III and Miikkulainen (1999), 121,
218
McCallum (1996), 131, 218
Miikkulainen (1993), 117, 218
Miller (1995), 108, 218
Min and Wilson (1998), 210, 218
Ney et al. (1992), 64, 218
Plasmeijer (1998), 90, 218
Pollard and Sag (1986), 9, 10, 218
Powers (2001), 171, 219
Pugh (1989), 84, 85, 219
Rueckl et al. (1989), 190, 219
Schutze (1992), 125, 219
Schutze (1993), 106, 124, 219
Schutze (1995), 125, 126, 219
Schutze (1998), 125, 171, 219
Scha and Bod (2003), 48, 219
Smith (2002), 135, 219
Smrz and Rychly (2002), 113, 114, 219
Stuart et al. (2004), 165, 219
Ushioda (1996), 171, 219
Vapnik (1997), 122, 219
Viterbi (1967), 45, 219
Williams (1992), 91, 220
Wu and Zheng (2000), 39, 220
R Development Core Team (2004), 131, 219
active edge, 12
adjunct, 31, 53, 70, 71, 115, 176
ambiguity, 3, 11, 12, 14, 15, 20, 27, 44, 45, 61,
62, 74, 75, 79, 96, 119, 125, 145
ambiguous, 11, 74
arc, 12
argument, 25, 26, 31, 53, 72, 199
backoff, 7, 33–35, 38, 39, 47, 50, 55–57, 61, 65,
72, 75, 105, 107, 108, 110, 115, 121,
153, 154, 156, 157, 160, 205, 207, 211,
212
backpropagation, 276
bag, 53, 55, 60, 229
base-NP, 52
beam, 64, 83
beam search, 64, 65, 69, 83–85, 88, 89, 155,
209, 229, 233, 275
best edge, 82
best first, 44
bigram, 35, 276
bigram-statistics, 108
bit, 52, 109, 168, 170, 281, 282
bits eror, 168
BNC, 77
bottom-up, 11, 40, 42, 51
branch, 91
candidate hidden unit, 166, 285, 286
candidate training phase, 166
Cascade Correlation, 165–167, 276, 281, 285,
286
centered vector, 138
CFG, 8, 9, 16
chart, 12
289
child, 60
collocation matrix, 124
complement, 10, 53, 54, 70, 71, 115, 176
complete, 12, 60, 62
conditional independence, 56
conditional probability, 21
context, 75
context-free grammar, 2, 8
corpus, 4
cost, 44
coverage, 2
crossing brackets, 20
cutoff, 84
cutoff threshold, 83
decomposed, 2
dendrogram, 109, 137
density plot, 187, 190
dependency, 54
distance metric, 32, 52, 57, 180
distributed, 119
Earley parsing, 7
edge, 12
active, 12
best, 82
complete, 12, 60, 62
inactive, 12
incomplete, 12, 60, 62
eigen decomposition, 135
eigenvalue, 135
eigenvector, 135
event, 20
event file, 54
exact match, 19, 96
expand, 59
expected, 37
feature, 9
feature vector, 119
feature words, 124
fourgram, 124, 133
fragment, 2
fringe, 44, 64
genprob, 72
grammar, 1, 10, 11, 14, 16, 22, 26, 28, 29, 31–
33, 40, 43–45, 47, 61
hash key, 72, 281
HBG, 16, 27–29
head
constituent, 26
headword, 9, 26, 70
nonterminal, 26
production, 25
heuristic search, 40, 43, 44, 84
hidden Markov model, 74, 76
hierarchical, 7
HMM, 76, 77, 79, 109, 119, 182, 280, 283
homograph, 115
HPSG, 10, 11, 17, 22, 25–27, 29, 31, 40, 51, 52,
54, 58, 59, 62, 115
inactive edge, 12
incomplete, 12, 60, 62
independent, 34
index array, 82
index error, 168
Information Theory, 109
inside, 64
inside probability, 41
interpolation, 34
iterative clustering, 143
join, 59
kernel function, 122
key generation, 73
290
lexical head, 9, 22
Markov model, 45
markovian assumption, 76
Maximum Likelihood Estimate, 37
MLE, 37–39
mutual information, 108–110
n-gram, 34, 152
Naive Bayes, 28
neural network, 116, 117, 119, 122, 165, 167,
285
NLP, 1, 27, 34, 108
NPB, 52, 54, 62
object, 2
optimistic, 44
order, 76
output training phase, 166
outside, 64
outside probability, 41–44
overfitting, 180, 181, 185, 187, 192, 281
parent, 54, 59
parent constituent, 26
parent headword, 26
parent nonterminal, 26
PCA, 124, 127, 131–134, 136–138, 175, 211
PCFG, 16, 22, 25, 26, 28, 29, 31, 48, 49, 54
perplexity, 136
phrase, 8
polysemous, 115
POS, 11, 32, 36, 74–77, 79, 81, 100, 101, 106,
110, 143, 145, 146, 170, 171, 173, 175,
178, 182, 184, 208, 211, 212, 222–225,
231, 280
precision, 19, 284
primed, 4
principal component, 135
Principal component analysis, 124
probabilistic grammar, 4
production, 20
pseudo-event, 156
recall, 19, 284
reducing, 278
regular expression, 130
RMS, 133, 134, 138–142, 146, 168, 171
serial learning, 191, 201
shifting, 278
sibling production, 25
singular value decomposition, 136
skiplist, 84, 85
smoothing, 33, 75
SOM, 121
stagnate, 166
statistical parsing, 1
stop, 62
sub-event, 72
subcategorisation list, 10, 53, 54
subject, 2
supervised learning, 166
surprise, 109, 140
SVD, 136, 143, 144, 148, 171, 190, 191
SVM, 122, 124
T/G, 128–130, 137
TAG, 16
timeout, 166, 285
tokenisation, 129, 130
TOP, 52
top-down, 11, 276
tree, 7
treebank, 17, 29
trigram, 35, 113, 119
unary, 54
291
unigram, 34
unit vector, 134–136, 138
victory error criterion, 166, 276, 281
Viterbi, 45, 46, 48, 82
well-formed sentences, 2, 7
word space, 135, 146
WSJ, 17–20, 27, 33, 49, 50, 52, 54, 68, 70, 72,
77–79, 94, 105, 106, 109, 110, 113, 114,
119, 121, 127–130, 132, 151, 153, 157,
158, 160, 161, 170, 171, 173, 196–198,
205, 207, 210
XBAR, 16
Zipf’s law, 4, 22, 25, 56, 102, 105, 119, 134, 140
292