introduction to the link grammar parser

8/8/2019 Introduction to the Link Grammar Parser

1/26

An Introduction to the Link GrammarParser

CO TE TS

1. The logic and notation of link grammars

1.1. The basic idea1.2. Word rules1.3. Global rules1.4. Link grammar in relation to other systems

2. Using the parser

2.1. Running version 4.02.2. Using other dictionaries2.3. The API

3. General features of the parser

3.1. Connector subscripts3.2. Macros3.3. Word files3.4. Word subscripts3.5. The cost system3.6. Batch mode3.7. Commands and variables

4. Special features of the dictionary

4.1. Capitalization4.2. Hyphenated expressions4.3. Number expressions4.4. Unknown words4.5. Punctuation and affix-files4.6. The sorting of unknown strings4.7. The wall(s)4.8. Idioms

5. Coordinating conjunctions

5.1. The handling of conjunctions5.2. Uses of conjunctions5.3. Subscripts5.4. Some problems

6. Post-processing

6.1. The logic of post-processing6.2. The domain structure

duction http://www.link.cs.cmu.edu/link/dict/introduction.html

26 14/05/2010 18:18


2/26

6.3. Groups6.4. Post-processing rules6.5. Post-processing and conjunctions6.6. Link-type matching in post-processing6.7. The "!bad" command

7. Speed and robustness features

7.1. The null-link system7.2. The link-length limit7.3. The post-processing limit7.4. The timer 7.5. Panic mode

[Back to Link Grammar front page ]

1. The Logic and otation of Link Grammars1.1. THE BASIC IDEA . Think of words as blocks with connectors coming out. There aredifferent types of connectors; connectors may also point to the right or to the left. Aleft-pointing connector connects with a right-pointing connector of the same type on another word. The two connectors together form a "link". Right-pointing connectors are labeled "+",left-pointing connectors are labeled "-". Words have rules about how their connectors can beconnected up, that is, rules about what would constitute a valid use of that word. A validsentence is one in which all the words present are used in a way which is valid according totheir rules, and which also satisfies certain global rules.

1.2. WORD RULES. A simple dictionary entry would look like this:

blah: A+;

This means that if the word "blah" is used in a sentence, it must form an "A" link with another word; that is, there must be another word to the right of it with an "A-" connector. Otherwisethe sentence is not valid. The expression following the colon is the "linking requirement" for the word.

A word may have more than one connector that has to be connected. This would be notatedas

blah: A+ & B+;

A word may have a rule that either one of two (or one of several) connectors can be used, butexactly one must be used. In the dictionary, we notate this as

blah: A+ or B-;

This means that if the word can make either an "A" link to the right, or a "B" link to the left,its use in the sentence is valid; but it must make one or the other, and it can not make both.

These rules can be combined. For example, consider the following notation:

blah: A+ or (B- & C+);

This means that the word must make either an "A" link to the right, or a "B" link to the leftand a "C" link to the right. No other combination will be valid.


26 14/05/2010 18:18


3/26

Such expressions can be nested without limit, such as

blah: (A+ or B-) & ((C- & A+ & (D- or E-)) or F+);

Some connectors are optional; this is notated with curly brackets. For example:

blah: A+ & {B+};

This means the word must make an "A" link to the right, and it can make a "B" link to theright but does not have to. Curly brackets can also be put around complex expressions, like

blah: (A+ or B+) & {C- & (D+ or E-)};

An equivalent way of writing an optional expression like "{X-}" is "(X- or ())". This can beuseful, since it allows a cost to be put on the no-link option (see Section 3.5).

A word can also make an indefinite number of links of the same type to other words. For this,we use the "multi-connector" symbol "@". For instance, the word below could make anynumber of F links to words to the right (but is not required to make any).

blah: (A+ or B+) & {C- & (D+ or E-)} & {@F+};

(If a word has "@A+", with no curly brackets, it is required to make at least one A+ link tothe right; any others are optional.)

The ordering of elements in the "connector expression" is important. What that dictates is therelative closeness of the words that are being connected to. The further to the left theconnector name, the closer the connection must be. For example,

blah: A+ & B+;

This means that "blah" must make an "A" link to the right and a "B" link to the right, and theword it makes the "A" link with must be closer than the word it makes the "B" link with.

This only pertains, however, to connections in the same direction. For connectors pointing inopposite directions, the ordering is irrelevant. Therefore

blah: A+ & B-;

means exactly the same thing as

blah: B- & A+;

For that matter,blah: A- & B+ & C+ & D-;

means exactly the same thing as

blah: B+ & C+ & A- & D-;

For "or" expressions, such as "A+ or B+", the ordering of the elements is irrelevant.

A dictionary entry thus consists of a word, followed by a colon, followed by a connector expression, followed by a semi-colon. The dictionary consists of a series of such entries. Any

number of words can be put in a list, separated by spaces; they will then all possess thelinking requirement that follows:

blah blee blay: A+;


26 14/05/2010 18:18


4/26

A connector name must consist of one or more capital letters (any number may be used),followed by "+" or "-".

We should mention one concept here that plays an important role in the internal workings of the parser: the "disjunct". A disjunct is a set of connector types that constitutes a legal use of a word. The dictionary expression for any word can be represented as a set of disjuncts. If aword has the following expression:

blah: {C-} & (A+ or B+);

then it has the following four disjuncts:

C- A+A+C- B+B+

These disjuncts represent all the legal uses of the word "blah". Using C- and A+ is a legal useof the word; using A+ and B+ is not. Disjuncts play an important role in the internal workingsof the parser.

1.3. GLOBAL RULES. As well as these "word rules", which are specified in the dictionary,there are two other global rules which control how words can be connected.

First of all, links can not cross. For example, the following way of connecting these four words (connecting "cat" to "dog" and "horse" to "fish") would be illegal. The parser simplywill not find such linkages.

+------------++---- | -----+ || | | |

cat horse dog fish

This is the "crossing-links" (or "planarity") rule. Secondly, all the words in a sentence must beindirectly connected to each other. Therefore the following way of connecting these four words would be illegal (if it was the entire linkage).

+-----+ +----+| | | |

cat horse dog fish

This is the "connectivity" rule. A valid sentence is therefore one which can be linked up in away that a) all the words are used in a way that satisfies their linking requirements; and b) thecrossing- links and connectivity rules are not violated.

1.4. LI K GRAMMAR I RELATIO TO OTHER SYSTEMS. The structure assignedto a sentence by a link grammar is rather unlike any other grammatical system that we knowof (although it is certainly related to dependency grammar). Rather than thinking in terms of syntactic functions (like subject or object) or constituents (like "verb phrase"), one must think in terms of relationships between pairs of words. In the sentence below, for example, there isan "S" ("subject") relation between "dog" and "has"; a "PP" (past-participle) relationship

between "has" and "gone"; and a "D" (determiner) relation between "the" and "dog". (Ignorethe lower-case letters for the moment; they will be explained below.)

+-----Ds-----+

| +---A--+-Ss-+-PP-+| | | | |

the black.a dog.n has gone

Where possible, we try to give link-types names that have mnemonic significance in this way.


26 14/05/2010 18:18


5/26

It may be seen, however, that parts of speech, syntactic functions, and constituents may berecovered from a link structure rather easily. For example, whatever word is on the left endof an "S" link is the subject of a clause (or the head word of the subject phrase); whatever ison the right end is the finite verb; whatever is on the left-end of a D link is a determiner; etc..Moreover, all nouns, verbs, and adjectives in the dictionary are subscripted (as ".n", ".v", or ".a"--see section 3.4), so in these cases the syntactic category of the word is made explicit.

The constituent structure of sentences, while not absolutely explicit, is also quite "close to thesurface" in linkage structures. This requires some explanation. Imagine a linkage as a graphthrough which paths can be traced, similar to a street map. Constituents can be defined as setsof words which can be reached from certain links, tracing in a certain direction. For example,a verb phrase is everything reachable from an "S" link, tracing to the right--that is, not tracingthrough the left end of the S link itself. In the sentence below, "have changed during the past3 1/2 years" is a verb phrase, since all these words can be reached from an S link (between"addresses" and "have") going to the right. "Would be an administrative nightmare" is _not_

part of this verb phrase since these words cannot be reached from the S link in question,except by tracing through the left end of the link ("addresses"). For noun phrases there areseveral possibilities. Anything that can be reached from an O link tracing right is an NP, (such

as "an administrative nightmare") as is anything reachable from a J link tracing right ("the twomillion customers whose addresses have changed during the past 3 1/2 years").

+---------------------------------------------------------------------------| +--------------------+--------------Wd--------------+ +---------------Opn--| +------------Ds------------+ | +----DD| | +----G----+---AN---+----Ss---+--TH-+---Ce--+----K---+ | +--| | | | | | | | | | |

///// a Commonwealth Edison spokesman.n said.v that tracking.g down.e the two

----------------------------Xp---------------------------------------------

---------------------------------Ss*g---------------------------------------------------+ +-----------Jp------+ | | +-----DD----NN-+---Dmc---+---Mr---+--Dmcw--+----Sp---+---PP--+---MVp--+ +--L--+ +N

| | | | | | | | | |million customers.n whose addresses.n have.v changed.v during the past.a 3

---------------------------------------------------------+---------------+ |-------+ | +------------Ost------------+ |-+ | | | +-----------Ds----------+ |N+-Dmc-+ +--Ix--+ | +------A------+ |

| | | | | | | |1/2 years.n would.v be.v an administrative.a nightmare.n .

With version 4.0, we have incorporated a system for deriving a traditional constituentrepresentation of a sentence from a linkage. Click here for more information about this; youcan experiment with this system at the online demo page .

2. Using the Parser

2.1. RU I G VERSIO 4.0. To run version 4.0 of the parser on a Unix platform, youmust first construct your own copy of the directory "link-4.0" that is on our FTP site . You

will find a "Makefile", a "README" file and a handful of subdirectories. Compile the program by typing " make " while in the parse directory.

To actually run the program along with the dictionary (and related files) that we provide, type"./parse ". The parser will then give you a prompt:


26 14/05/2010 18:18


6/26

linkparser>

Type in a sentence, and press RETURN. The parser will tell you whether that sentence has avalid linkage or linkages, given the dictionary it is using, and it will output the linkage that itfinds, showing the words that are linked together and the type of link between them. The

parser will then prompt you for another sentence. To exit the parser, type " quit ".

For each sentence,the parser will do an exhaustive search for linkages; it will generate all

valid linkages. It will begin by displaying the lowest-cost linkage it finds. Other linkages maythen be seen, one at a time, by pressing RETURN. (The ordering of the output is determined

by the cost system: see Section 3.5.) If the parser cannot find any complete linkages of asentence, it may simply stop and say "No complete linkages found"; or it may try to parse thesentence, ignoring some of the words. This depends on whether "null-links" are in use; seesection 7.1.

2.2. USI G OTHER DICTIO ARIES. The more general way of running the parser is bytyping

./parse [dictionary] -pp [post-process file] -c

[constituent-knowledge file] -a [affix-file]

where "dictionary" is the name of whatever dictionary you wish to use. See section 6 for adiscussion of post-process files, and section 4.5 for a discussion of affix files. The constituent-knowledge file is the post-processing file used in deriving the constituents; click here for information about this. The default files (the ones used if none are specified) are4.0.knowledge, 4.0.constituents. and 4.0.affix.

If you wish to use no post-process file, thus by-passing post-processing altogether, type"./parse [dictionary] -ppoff" . You might wish to do this in using "tiny.dict", for example, since this dictionary is not designed to require any post-processing. You can specifyno affix file by typing "-aoff", and no constituent file by typing "-coff".

2.3. THE API. We have developed an API for designing customized versions of the parser and incorporating them into applications. It is important to realize that version 4.0 of the

parser is just one way of configuring the API . In the following sections, we discuss manyfeatures of version 4.0 of the parser. Some of these features can be modified by editing thedictionary, as we explain here. Other features can be modified by configuring the APIdifferently. For more about the API, see our API documentation .

3. General Features of the Parser

3.1. CO ECTOR SUBSCRIPTS. In general, a connector may only link to another onewith the same name, i.e., the same string of capital letters. However, there is another way of controlling how connectors may link to each other, using connector subscripts. A subscript isa lower-case letter following a connector-name, like "Ss+". An "Ss+" connector can connectwith an unspecified "S-" connector, or an "Ss-" connector, but not with an "Sp-" connector.

Connector types may have multiple subscript characters, such as "Spa+". An "Spa+" canconnect with an "S-", an "Sp-", or an "Spa-", but not with an "Ss-" or an "Ssa-" or an "Spb-".

An "*" subscript type is a "wildcard" that can connect with anything. Therefore, an "S*+" isexactly the same as an "S+". An "S*a+" can connect with an "S-", an "Ss-", an "Sp-", or an"Ssa-", but not with an "Ssb-".

3.2. MACROS . It is possible to define a single symbol as a longer connector expression, and


26 14/05/2010 18:18


7/26

then use that symbol to refer to the longer expression in the dictionary. To do this, simplychoose a name for the longer expression, and surround it with angle brackets (). Then treatit like a word in the dictionary; list the name, then a colon, then the connector expression thatit should stand for. For example, we define "" in the dictionary as follows:

: (Ss+ & ) or SIs- or Js- or Os- or({[Bsj+]} & Xd- & Xc+ & MX-);

We then use this symbol in many other actual word definitions.We use many of these macros in the dictionary, to reduce redundancy; there are manyconnector expressions that are used over and over in longer expressions. Here are a fewcommon ones:

: the "main" connectors for nouns, used to link them to the restof the sentence (as subject, object, etc).

: the "sub" connectors for nouns, used to link them to modifierslike prepositional phrases and relative clauses.

: These macros are for verbs; they distinguish different forms of thesame verb. That is, they contain connector types - like S-, PP-, etc. - thatdistinguish different forms of the same verb. is for singular verbs, for past participles, for forms which are both simple

past and past participle, etc..

: These macros are for verb complements; they stand for differentcomplement expressions. Some verbs can connect to a direct object, using O+;some can connect to an infinitive verb, using TO+; and so on.

3.3. WORD-FILES. The most basic way to write the dictionary is to list all the words in a particular category, followed by a colon, followed by their connector expression. There isanother way, however. One can put all the words in a category in a file, choose a name for the file, and put that filename in the dictionary in place of the list of words. When listed in thedictionary, the filename must be preceded by a slash (/).

Here are the word files that are in use at the moment:

words.n.1 singular countable (i.e. not mass) nouns

words.n.2.s plural nouns ending in "s"

words.n.2.x plural nouns not ending in "s"words.n.3 mass nouns

words.n.4 nouns that may be mass or countable

words.n.p proper names that are also ordinary words when not capitalized (see Section3.1 for explanation)

words.n.t nouns that can be used as titles, like "president"

(In the following verb files, the final number indicates the verb form. ".1" is for

infinitive-plural forms, ".2" is for singular forms, ".3" is for simple-past / past-participle forms, ".4" is for present participles, ".5" is for gerunds. Onintransitive verbs, the present participle and gerund expression are combined intoa single dictionary entry.)


26 14/05/2010 18:18


8/26

words.v.1.(1-4) intransitive verbs

words.v.1.p special two-word passives ("lied_to_", "paid_for")

words.v.2.(1-5) optionally transitive verbs

words.v.4.(1-5) transitive verbs

words.v.5.(1-4) intransitive verbs that may form two-word verbs with particles like "up"and "out"

words.v.6.(1-5) optionally transitive verbs that may form two-word verbs

words.v.8.(1-5) transitive verbs that may form two-word verbs

words.v.10.(1-4) verbs that may be used in quotation expressions, like "said" ("John ishere, he said").

words.adj.1 ordinary adjectives, with no special complements

words.adj.2 ordinary comparative adjectives (e.g. "bigger")

words.adj.3 ordinary superlative adjectives (e.g. "biggest")

words.adv.1 ordinary manner adverbs ("quickly", "angrily")

words.adv.2 ordinary clausal adverbs ("fortunately")

words.adv.3 adverbs like "chemically"

words.y common year numbers ("1990", etc.)

words.s US state names and abbreviations

3.4. WORD SUBSCRIPTS. A single word can be given several different dictionary entries.To do this, the entries must be distinguished by giving the words different subscripts. Wordsmay be followed by a subscript such as ".n". For example:

run.n: A+ or B+...run.v: C+ or D+...

If a word is listed more than once with the same subscript, or if it listed once with a subscriptand once without, the parser will generate a warning message and will ignore one of theentries.

The parser starts at the right end of every string of characters. Any sequence of letters to theright of the right-most period in the string will be considered the subscript. (Periods at the endof a string are simply considered part of the string; see Section 4.5 below.)

In searching for linkages, the parser will consider each entry for the word as a different word,and will generate all linkages found for all entries. The subscript is shown in the display, thusindicating which entry the parser chose for a particular linkage.

The main word subscripts we use are ".n" for nouns, ".v" for verbs, and ".a" for adjectives.All nouns, verbs, and adjectives are subscripted in this way. Certain other subscripts are usedonly when needed to distinguish two forms of the same word: ".e" for adverb ".p" for

preposition, ".s" for singular, ".p" for plural, ".t" for title.

3.5. THE COST SYSTEM. We have a system for assigning a cost to a linkage. This allowsthe parser to express preferences among the linkages it finds. The cost system uses square

brackets ("[" and "]"). If a connector, or a series of connectors, is surrounded by square


26 14/05/2010 18:18


9/26

brackets, it is assigned a cost. The amount of cost is equal to the number of square bracketson each side: [A+] will receive a cost of 1; [[A+]] will receive a cost of 2; etc.. The parser uses this cost as a criterion for deciding which linkage to output first; it outputs them in order of cost (i.e., lowest cost first).

At the moment, connectors with a cost of 0, 1 or 2 are considered in normal parsing;connectors with a cost of 3 are considered only in "panic mode". See Section 7.5.

Given several linkages of the same cost level, the parser has certain heuristics for choosingthe best parse, i.e., the one to output first. It prefers the linkage in which the total length of the links is lowest; and in sentences with conjunctions, it prefers a linkage where the lengthsof the conjoined word-lists are similar (see section 5). This information is indicated in the costvector shown above the linkage:

Unique linkage, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=1)

"DIS" is the connector cost or disjunct cost for the linkage (the "[]" system explained above);"AND" is the difference in length between and-list elements; and "LEN" is the total length of all links in the sentence (minus the number of words--since the total link length is never less

than the number of words). "UNUSED" indicates the number of null-links; see section 7.1.

3.6. BATCH-MODE. It is possible to make a file of sentences, and then run them throughthe parser all at once. Simply create a file, with one sentence on each line. When running the

parser, type

parse [dictionary name] < [filename]

When inputting a file of sentences, it is sometimes useful to use another feature known as"batch-mode". In this mode the user indicates in the input file which sentences should berejected, and which should be accepted; the parser then notifies the user when its judgments

differ. To say that the parser "rejects" a sentence means that it does not find any linkageswithout null-links. (Batch-mode does not allow null-links; the "!batch" commandautomatically turns null-links off. See section 7.1 for an explanation of null-links.) To preparea file for use in batch-mode, put "!batch" at the top of the file. Then, for each sentence thatshould be rejected, precede the sentence with "*" in the file. If it should be accepted, precedeit with nothing. After processing a batch file, the parser will then print the number of errors inthe file: i.e., the number of sentences on which its judgments differ with the judgmentsindicated by the symbols.

The symbol ":" can also be used at the beginning of a line. This means "linkage should befound with non-zero cost". If a linkage is found with zero cost, or no linkage is found, an error

will be reported. (An absence of symbol means "linkage should be found, either with cost or without".)

When running batch-mode, the parser will ordinarily output only the number of errors itmakes. Thus it is primarily useful for checking sentences, to see if a particular dictionary

produce the desired results (or to make sure that no earlier work has been broken by recentchanges). If one begins the batch file with the command "!echo", the parser will also outputthe sentences as it parses them, as well as display information for the sentences on which its

judgments disagree with the user's.

Our FTP directory contains a large batchfile entitled "4.0.batch". This file contains about 900sentences which systematically test many features of the parser. Many sentences areincorrect, and are marked with "*". With version 4.0, this file should parse with 0 errors.

3.7. COMMA DS A D VARIABLES. It is possible to modify the running of the parser in


26 14/05/2010 18:18


10/26

various ways, while running it, by typing in certain commands. The basic commands can beseen by typing "!help". Others are listed under "!variables". Many of these areself-explanatory. For example, "!width" changes the width of the parser display. Other commands relate to speed and robstness features; see section 7.

A few commands deserve special mention. One useful command is "!![word]". This queriesthe parser for information about a particular word. The parser will output list any entries of the word, with their word subscripts, the word-files in which they appear, if any, and thenumber of disjuncts on each word. (A disjunct is a combination of connectors whichconstitutes a legal use of the word.) Multiple entries of a word will be listed with their wordsubscripts.

The "!verbosity" command controls the amount of information that is displayed. With"!verbosity=1" (the default), information such as the following is shown:

linkparser> the quick brown fox jumped over the lazy dog++++Time 0.04 seconds (0.04 total)Found 2 linkages (2 had no P.P. violations)

Linkage 1, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=18)

+---------Ds---------+ +-------Js------+| +-------A------+ | +-----Ds----+| | +---A--+---Ss--+--MVp-+ | +--A--+| | | | | | | | |

the quick.a brown.a fox.n jumped.v over the lazy.a dog.n

Press RETURN for the next linkage.linkparser>

With "verbosity=0", no information is shown except for the graphic linkage display. Withverbosity set at 2 or 3, information is shown about the individual stages of parsing thesentence. (Infomation is also shown about the constituent derivation process, if this is being

done.) If one wants to suppress the graphic display as well, this can be done with thecommand "!graphics". (This can be useful it one wants to have only the constituent

bracketing as output; in that case, type "!verbosity=0", "!graphics", and "!constituents=1 (or 2)".

4. Special Features of the Dictionary

4.1. CAPITALIZATIO . The parser respects capitalization: that is, the use of upper- andlower-case letters. If a string is listed in the dictionary beginning with a capital letter, then aword that is inputted will only match it if it has the same capitalization. (The same with strings

with capital letters in the middle, although this is probably of little use.) However, there are afew special cases here.

There is a general category in the dictionary called "CAPITALIZED_WORDS". This is thedefault category for words whose first character is capitalized. Any such word which isinputted which is not explicitly listed in another category will be assigned to this category.This is of course useful, since most capitalized words are names which are grammatically allthe same.

A special situation occurs with words at the beginning of the sentence. If a sentence-initialword has an uncapitalized first letter, it is treated in the normal manner. If it is capitalized, the

parser will first look to see if it is listed in the dictionary as a either a capitalized word or anuncapitalized word. If not, it will then assign it to the generic "CAPITALIZED_WORDS"category. (If the word is listed both as a capitalized word and an uncapitalized one, the parser will try to use it in both ways. Because there are certain words which are also common


f 26 14/05/2010 18:18


11/26

names, like "Will" and "Rob", we have created a special category for them, so that when theyare used sentence-initially, they will be recognized as possible names.)

The situation at the beginning of the sentence also applies after a colon. Sometimes, after acolon, the following word is capitalized as if it was the beginning of a sentence; the parser recognizes this. So, for example, the following sentence is accepted: "The problem is this: Thedog ran."

4.2. HYPHE ATED EXPRESSIO S. The dictionary also contains a special categorycalled "HYPHENATED_WORDS". If a string contains a hyphen, and it is not listed in thedictionary, the parser will assign it to the category "HYPHENATED_WORDS". This is,again, useful, since hyphenated words are used somewhat "productively", and it would bevery difficult to list them all.

4.3. UMBER EXPRESSIO S. The dictionary contains a category "NUMBERS". Anynumerical expression -- that is, a string consisting entirely of numerical characters -- will beassigned to this category unless it is explicitly listed elsewhere in the dictionary. (The stringmay also contain a period, i.e. a decimal point, or a comma, as in "3,287". It may also containa colon; thus time expressions like "4:30" are treated as generic numbers.)

4.4. U K OW WORDS. The dictionary also permits a feature known as "unknownwords". A category can be defined using the string "UNKNOWN-WORD.x", where x is anysubscript. If a word beginning with a lower-case letter is typed in that is not recognized, it will

be assigned to that category. The word is then displayed with a question-mark in brackets,like "blah" below:

+-----Wd----+| +---D--+--Ss--+-Pp+| | | | |

///// the blah[?].n is here

Several different unknown word categories may be generated, labeled with differentsubscripts: for example, corresponding to nouns, verbs, and adjectives and adverbs. (Theseare the four categories we use, labeled .n, .v, .a, and .e, respectively.) The parser will searchfor all linkages that can be found using each entry. If it only finds a linkage for the "noun"category, then the output will show the unknown word labeled ".n": in effect, the parser isthen guessing that the word is a noun.

Version 4.0 of the parser has an new feature for handling unknown words, known as"morpho-guessing". This is a system for guessing the syntactic category of an unknown word(that is, a word not explicitly listed in the dictionary) based on its spelling. Words that end in

"-s" are assumed to be plural nouns or singular verbs; these are assigned to a category listedas "S-WORDS" in the dictionary. Similarly, words ending in "-ed" are assumed past-tense (or passive) verbs; those ending in "-ing", present participles; those ending in "-ly", adjectives.This greatly improves the ability of the parser to handle sentences containing multipleunknown words. Words that have been treated in this way are marked with a "[!]". Click herefor more information about this new feature.

4.5. PU CTUATIO A D AFFIX-FILES. The parser is capable of handling a variety of punctuation symbols. There are two issues to be discussed here. One is the listing of symbolsin the dictionary; the other is the way they are "read" by the parser when they are used insentences.

Punctuation symbols can be listed in the dictionary just like words, and given ordinary linkageexpressions. The same is true for strings containing multiple punctuation symbols or a mixtureof letters and punctuation. The problem here is that certain punctuation symbols are also usedas the "syntax" of the dictionary: colons, semi-colons, ampersands, etc.. Our solution to this is


f 26 14/05/2010 18:18


12/26

as follows: when listing these special characters, or a string containing them, one must putthem in quotation marks:

";": A+ or B-;"+": C+ or D-;

(The special characters that must be treated this way are precisely those which are used in thedictionary in a "syntactic" way: "(", ")", "{", "}", "[", "]", "@", "%", "&", "*", "+", "-", "/",

"".)When punctuation symbols are used in sentences, they will be used in linkages according tothe connector expressions listed in the dictionary, in the normal way. There is a difference,however. It may be noted that although many punctuation symbols are similar to words in theways they are used, they are often not separated from preceding or following words byspaces. In order for these symbols to be recognized as separate units, then, they must be"stripped off": that is, a space must be inserted between the symbol and the adjacent word.

With version 4.0 we provide a new system for handling this "stripping off" process. Rather than being hard-wired into the system, it can now be easily modified and expanded by the

user, using an "affix file" which is read in by the parser when the program is run. An affix fileis similar in format to a link grammar dictionary, with lists of strings (they need not beindividual symbols, but may be any length), followed by a connector type. Strings to bestripped off from the left are listed under the category "LSTRIP+"; those to be stripped off from the right are listed under the category "RSTRIP+". In the affix-file that we provide, for example, "$" is listed as a "LSTRIP+" symbol; "," is listed as a "RSTRIP+" symbol. (Click here for more information about affix files.)

In the default affix file we provide (4.0.affix), the "RSTRIP+" strings include these:

, . ! ? % ) : ; ' 's 're 'll 've 'd

The "LSTRIP+" strings include these:

$ (

Note that we also use the affix file for a few things besides punctuation. The "right-stripped"list includes strings like "'s", representing a possessive ("John's") or a contraction ("he's");strings like "'re" (as in "you're") are also included, representing the fact that these shouldreally be treated as independent words. A further point: the parser will perform this stripping

process on a string repeatedly, if necessary, perhaps stripping off several characters. For example, it will convert the first sentence below into the second:

John, a professor (who got a raise of 5%), is here

John , a professor ( who got a raise of 5 % ) , is here

There is one difference between right-stripped and left-stripped characters. Right-strippedcharacters can also be used at the right end of a string defined in the dictionary; for example,one could actually define "Mrs." or "it's" as ordinary dictionary entries. Before stripping off right-stripped characters, the parser will make sure that the string is not present in thedictionary as it appears.

If punctuation symbols are used in the middle of strings, they will be treated just like anyother symbols. (And if they are used in sentences in undefined strings, they will be treated assome kind of unknown string.) (Note that periods are special, however; if the string ends witha period followed by a letter, this will be interpreted as the word subscript, not part of thestring. See section 3.4 above.)


f 26 14/05/2010 18:18


13/26

One exceptional case is quotation marks. Quotation marks may not be defined in thedictionary; and they are simply ignored when they are used in sentences. This is sufficient tohandle most uses of quotes; generally, the presence of quotes does not affect thewell-formedness of sentences, and it is often only subtlely affects meaning. However thereare a few constructions, such as the last pair of sentences below, which seem to be onlycorrect when quotes are included.

"John is leaving," she saidJohn is leaving, she saidI am meeting with my "advisor" todayI am meeting with my advisor todayShe said, "John is leaving".?She said, John is leaving.

We are unable to control such usages at the moment.

The percent sign ("%") is used for comments. If a % appears at the beginning of a sentenceinputted to the parser, that sentence will simply be ignored. This is useful if one is inputting afile of sentences, and one wants to include some comments in the file. A % at the beginningof a line can also be used for putting comments in a dictionary.

4.6. THE SORTI G OF U K OW STRI GS. Notice that the parser must makedecisions about how to handle strings of characters which are not explicitly listed in thedictionary. The ordering of the decisions is important. At the moment, the parser proceeds asfollows:

If the string is the first string in a sentence, and its first letter is capitalized, thengenerate the lower-case-first-letter form of the string, and perform the following stepson both the capitalized and lower-case forms.

1.

Strip off any "strippable" strings from the left end of the string (as described above).Perform subsequent steps on both the stripped string and the remainder.

2.

Look up the string in the dictionary as is. If there is an exactly matching string in thedictionary, use that.

3.

Strip off any "strippable" punctuation string from the right end of the string (asdescribed above). Repeat step 3 on both the punctuation string and the remainingstring. If there is more than one such string, repeat steps 3 and 4 repeatedly.

4.

If the string begins with a capital letter, assign it to the category "CAPITALIZED-WORD".

5.

If the string contains a hyphen, assign to the category "HYPHENATED-WORDS".6.If the string consists entirely of numbers, assign it to the category "NUMBERS".7.

If the string still has not been assigned, see if it matches any of the "guessed-word"categories (S-WORD, etc.); if so, assign it to this category.

8.

If the string still has not been assigned, assign it to any "UNKNOWN-WORD"categories.

9.

If none of these alternatives are available, the parser will say "the following words arenot in the dictionary: [whatever]", and will give no other output for the sentence.

10.

If at any point the parser finds more than one way of analyzing a string, it generates both of them, and tries parsing the sentence with both forms of the string. This might happen a) if there are multiple forms of the word in the dictionary (with different subscripts); b) if thestring is capitalized and occurs at the beginning of the sentence, and both the capitalized andlower-case forms are listed in the dictionary; or c) if there is more than oneUNKNOWN-WORD category.

Any of the generic word categories - UNKNOWN-WORD (any or all of them), S-WORD (or


f 26 14/05/2010 18:18


14/26

the other "guessed word" categories), CAPITALIZED-WORD, HYPHENATED-WORD, or NUMBERS - may be omitted from the dictionary. In this case, the parser will skip over thecorresponding step in the procedure above. For example, if "HYPHENATED-WORD" isomitted from the dictionary, and an unknown hyphenated word is entered, it will be treatedlike an UNKNOWN-WORD. If no UNKNOWN-WORD categories are present, and a stringis entered which fits none of the other categories, the parser will proceed to step 9 above.

4.7. THE WALL(S). It proved to be useful to imagine that there was a dummy word at the beginning of every sentence. We call this "the wall". The wall has a linking requirement likeany other word; it is listed in the dictionary under "LEFT-WALL". If this entry is included inthe dictionary, the wall will be automatically inserted at the beginning of every sentence.Because of the connectivity rule, it is then necessary for the wall to be linked to the rest of the sentence in order for the sentence to be valid.

There is also a "right-hand wall", which is similar to the original wall at the other hand of thesentence. This is only needed for certain punctuation phenomena. In most sentences, we usea special "RW" connector to simply connect the left hand wall to the right hand one. Theright-wall's dictionary entry is "RIGHT-WALL". (Since the left-wall is much more important

than the right-wall, we often refer to the left-wall simply as "the wall".)In most sentences, the left-wall connects to the sentence with a "Wd" link, and the right-wallconnects to the left-wall with "RW". When only these connectors on the walls are being used,they are not displayed in the linkage diagram. When other connectors on the walls are beingused, instead or as well, the walls are shown. (For example, the left-wall is shown in questionsand imperatives.) To make it so that the walls are _always_ shown, type "!walls".

Either wall can be deactivated by simply removing the "LEFT-WALL" or "RIGHT-WALL"entry from the dictionary.

4.8. IDIOMS. A string of words can be defined as a single dictionary entry. To do this,simply join the words together with underbars:

a_la_mode: A+ or B-;

Most idioms can be interpreted either as a single "idiom" or as a string of words (for example,"in question"). In this case, the parser will find all linkages with both interpretations.

In reading idiomatic strings from the dictionary, the parser breaks them up into individualwords and assigns them "dummy" link-types which simply link the words of the idiomtogether in series. These link-types are assigned four-letter names of the form ID[X][Y],where X and Y are arbitrary letters.

Idioms cannot be given subscripts; if "a_la_mode.a" is included in the dictionary, this will not be accepted. However, an idiom can be listed in the dictionary more than once, withoutsubscripts.

5. Coordinating Conjunctions

Coordination constructions do not fit naturally into the framework of link grammars. We havedevised a method for automatically transforming the given link grammar into another one thatcaptures the desired phenomena. (It involves internally generating special links for use withconjunctions, which we call "fat links".) This system is hard-wired in, and cannot easily bemodified by the user. However, it has proven to be effective in handling the vast majority of uses of conjunctions. Our discussion will focus on the word "and", although the ideas apply tothe use of "or", "but", "either-or", "neither-nor", "both-and", and "not only - but".


f 26 14/05/2010 18:18


15/26

5.1. THE HA DLI G OF CO JU CTIO S. We begin by proposing a simple definitionof the use of "and" within the framework of link grammars. Then we'll mention a few

problems with the definition, and suggest an improvement. The second definition is the oneused in our system. It has drawbacks, but on balance it has proven to be remarkably effective.

Given a sequence S of words containing the word "and", a "well-formed 'and' list" L is acontiguous subsequence of words satisfying these conditions:

1. There exists a way to divide L into components (called "elements" of the well-formed"and" list) such that each element is separated from its neighboring elements by either acomma or the word "and" (or both). (The comma and the "and" are not part of the element.)The last pair of elements must be separated by "and" (or a comma followed by "and"). For example, in "The dog, cat, and mouse ran", "dog", "cat", and "mouse" are the elements of thewell-formed "and" list "dog, cat, and mouse".

2. Each of the sequences of words obtained by replacing L (in S) by one of the elements of Lis a sentence of the link grammar.

3. There is a way of choosing a linkage for each of these sentences such that the set of linksoutside of the "and" list elements are exactly the same in all of the sentences, and theconnectors joining the sentence with its "and" list element are the same. In other words, if wecut the links connecting the element to the rest of the sentence, remove that element from thesentence, and replace it by one of the other elements, then the cut links can be connected tothe element so as to create a valid linkage.

The sequence S is grammatical if each instance of "and" in it is part of a well-formed and list.

For example, consider the sentence "We ate popcorn and watched movies on TV for threedays." The the phrase "ate popcorn and watched movies on TV" is a well-formed "and" list

because it can be divided into elements "ate popcorn" and "watched movies on TV ", whichsatisfy all of the conditions above. The following two linkages show this. Note that in bothcases the element is attached to the rest of the sentence with an "S" to the left and an "MV"to the right.

+-------------------MVp-------------------+-----Jp-----++-Wd-+Sp+--Os--+ | +--Dmc-+| | | | | | |

///// we ate popcorn.n for.p three days.n

+---------MVp---------+-----Jp-----++-Wd-+----------Sp----------+---Op---+--Mp-+Js+ | +--Dmc-+| | | | | | | | |

///// we watched movies.n on TV for.p three days.n

There is a major problem with this definition. It contains no requirement that the words of anelement of an "and" list be connected to each other, nor be related in any way (other than

being contiguous). This allows many clearly ungrammatical sentences to be accepted, andgenerates numerous spurious linkages of correct sentences. For example, it would imply that"I like the beer John and wine Harry drank" is a valid sentence.

We have two techniques to limit the set of sentences deemed grammatical by this rule. Thefirst is to simply restrict the types of connectors that can connect the element of the "and" listto the rest of the sentence. The list of connectors allowed to do this is contained in the list

"ANDABLE-CONNECTORS" in the dictionary. (If a connector type is included in this list,this means, in effect, that several of them may be joined to a connector of the opposite type.So, including "S-" on this list allows "John ran and skipped".) See Section 5.2 for further discussion of "andable connectors".


f 26 14/05/2010 18:18


16/26

The second method is to restrict the definition of a well-formed "and" list. Say that awell-formed "and" list is a "strict and list" if it also satisfies the following condition: Eachelement must be connected to the rest of the sentence through exactly one of its words. (Itmay use many connectors.)

This is the system that we have implemented. This logic of dealing with conjunctions isreflected in the parser's output. A sentence with conjunctions is outputted showing thesentence split up into several sub-sentences:

+---J--++--------S----------+-MV+ +-D-+| | | | |

John, Dave, and Fred ran in the park

+---J--++---S----------+-MV+ +-D-+| | | | |

John, Dave, and Fred ran in the park

+---J--++S--+-MV+ +-D-+

| | | | |John, Dave, and Fred ran in the park

Alternatively, the conjunctive linkages can be merged into one in the display. To use thismode, type "!union". For the above sentence, "!union" mode yields this:

+----------Ss----------+| +-------Ss------+ +---Js--+| | +-Ss-+-MVp+ +-Ds-+| | | | | | |

John , Dave , and Fred ran.v in the park.n

In some cases, "!union" mode may result in "crossing" links - the one situation where this is possible:

+------Ds------+| +-------Ss------++-Ds-+ +--Ss-+| | | |

the cat.n and dog.n ran.v

Of course, several conjunctions may occur in a sentence: "John and Fred ran and jumped". Insuch cases, sub-linkages will be generated for each combination of and-list elements: "Johnran, "John jumped", "Fred ran", and "Fred jumped". Nested "and" structures are also allowed,like "The people and their sons and daughters were there". In sentences containing severalconjunctions, a large number of sub-linkages may be generated for a single linkage. For thisreason, it may be preferable to use the "!union" display.

Conjunctions are a frequent source of ambiguity. For example, in the sentence "Several bigcats and dogs with sharp teeth chased me", "several" may or may not apply to "dogs" (as a

plural noun, "dogs" does not require a determiner); "big" may or may not apply to dogs; and"with sharp teeth" may or may not apply to cats. Linkages for all of these possibilities will of course be generated.

5.2. USES OF CO JU CTIO S. The implementation of conjunctions is "hard-wired" in,and cannot be easily modified. However, it covers the vast majority of uses of coordinatingconjunctions. First of all, it allows a wide variety of connector types to be used withconjunctions. As mentioned above, this is controlled by the connectors listed in the"ANDABLE-CONNECTORS" list in the dictionary. It can be seen that most commonconnector types (both "+" and "-" forms) are included on the list, permitting a variety of


f 26 14/05/2010 18:18


17/26

conjunctive expressions. Here are examples of some of the more commonly used "andable-connectors":

John and Fred ran (S+)John ran and jumped (S-)I saw Sue and Mary (O-)I saw and greeted Sue (O+)She left with John and Fred (J-)The dog I saw and chased was black (B-, S-)It was big and black (P-)The dog and the cat I saw were black (B+, R+)She arrived and left on Tuesday (MV-)She did it quickly and efficiently (MV-)The dog and cat ran (D-)I told her that I was coming and that you would be late (TH-)What did you tell her and what did she say (Wq-)I left and she followed (Wd-)Sue, a teacher and scholar, is here (MX-, Xc+, Xd-)

Some connector types are not included on the "andable" list; the corresponding conjunctiveusages are therefore not permitted. Many of these usages are not exactly ungrammatical, butsimply never occur, like PP+: "?They have and we have gone", G- ("?Fred Smith and Jonesare here" [meaning "Fred Smith and Fred Jones are here]"), ND+ ("?She left three days or weeks ago"), and EA+ ("?She is very or somewhat competent"). Others truly seemungrammatical, like Q+: "*Would or could you go?".

As well as "and", the system also handles the conjunctions "or", "but", "either-or","neither-nor", "both-and", and "not only - but":

+--------------Sp-------------++-----Sp----+--Os-+ +--Op---+| | | | |

we ate popcorn or watched movieswe either ate popcorn or watched movieswe neither ate popcorn nor watched movieswe both ate popcorn and watched movieswe ate popcorn but watched movieswe not only ate popcorn but watched movies

All the words involved in conjunctive constructions--"and", "but", "or", "both", "nor","either", "neither", "not", and "only"--must be included in the dictionary. (If such a word isremoved, its conjunctive use will be disabled.) However, such words may also be givenordinary linkage expressions, and in fact are. These ordinary usages are considered along withtheir conjunctive usages. If no ordinary linkage expression is desired for a word, simply give ita linkage expression containing a dummy connector of some kind that will never be used, like"NO+".

A few usages of coordinating conjunctions are handled using ordinary link logic (this is whythere are ordinary connector expressions for these words in the dictionary). There is someoverlap between the special ("fat-link") handling of conjunctions and the ordinary handling,so that some sentences receive multiple parses. For example, ordinary clauses conjoined

together will receive two parses: "John ran and Fred walked". See the entries in the Guide-To-Links on "W" and "CC" for discussion these ordinary usages of conjunctions.

5.3. SUBSCRIPTS. How should subscripts (on the connector names) be dealt with? Whentwo or more connectors with different subscripts are combined with "and", they may only


f 26 14/05/2010 18:18


18/26

connect to a connector that may connect to all of them. For example, consider the followingdictionary:

a: Ds+;the: D+;those: Dm+;cats dogs: Dm-;cat dog: Ds-;

Among the determiners above only "the" can grammatically be allowed to modify the "and"list "cats and dog". This is because the only connector which matches Dm- and Ds- is D+, notDs+ or Dm+. This is the solution we implement.

There is an exception to be handled here, however. The system we've described so far wouldaccept "the dog and cat runs", while rejecting "the dog and cat run". Both of these

judgements are wrong because in English whenever two singular subjects are "anded"together they become plural. We have incorporated this exception: "Ss+" connectors, when"anded" together, may connect to an "Sp-", but not to an "Ss-".

5.4. SOME PROBLEMS. There are a few problems to be discussed. Some of these arehandled by the current system; others are not. One problem is sentences like the following:

I gave Bob a doll and Mary a gun.This is a problem Moscow created and failed to solve.

The former will be rejected since in "I gave Bob a doll", "gave" is linked to both "Bob" and to"doll". Thus, "Bob a doll" cannot be an element of a strict "and" list. In the second sentence,"Moscow" needs to connect to "failed" and "problem" must connect to "solve", so "problemsto solve" cannot be an element of a strict "and" list. This phenomenon does occur (althoughrarely) in formal English, so we would like to solve it. The problem remains in our currentsystem.

Another problem arises with embedded clauses. Consider the following linkage of thesentence "I think John and Dave ran".

+-S-+--C--+-----S------+| | | |I think John and Dave ran

+-S--+| |

I think John and Dave ran

This linkage is a combination of the following two sentences "I think John ran" and "Daveran". This linkage should clearly be rejected. (Actually, this linkage would not be foundanyway in the current version, but this is a simple demonstration of the problem.) Intuitively,the problem with this linkage is that the same "S" link (the one between "and" and "ran") is

being used to mean something that "I think" ("John ran") and also something that is just a fact("Dave ran"). We have devised a system for detecting such patterns, using post-processing(see section 6). As mentioned above, we handle conjunction sentences by expanding theoriginal sentence into several subsentences. We then compute the domain structure of theresulting linkage of each sentence. Finally, the original linkage is deemed incorrect if thenesting structure of a pair of links descending from the same link ("e.g." the "S" links in thetwo sentences above) do not have the same domain ancestry (are contained in the same set of domains). Linkages which are considered incorrect in this way have the message"inconsistent domains" at the top of the display.

Another problem concerns the different kinds of conjunctions. Right now, our system does


f 26 14/05/2010 18:18


19/26

not distinguish between the various kinds of conjunctions allowed; any of them may be usedwith any "andable" connector. However, there appear to be different constraints on differentconjunctions. This results in some false positives:

I saw John and Fred*I saw John but FredThe dog or cat ran*The either dog or cat ran

A few other smaller problems should be mentioned. Sometimes adverbs are used withconjunctions:

He talked to Steve and, apparently, Fred*He talked to, apparently, Fred

As the second sentence shows, it is the conjunction that makes the first adverbial use valid.We have no good way of handling this construction. Secondly, there are some special uses of

punctuation with conjunctions. Sometimes, a comma is inserted both before and after thefinal element in an "and" list (ex.1). And sometimes semi-colons may be used instead of commas, particularly when the and-list elements themselves contain commas (ex. 2).

1. John, and Steve, are coming2. John; my advisor, Steve; and several other people are coming

Thus our system still needs some work in the area of conjunctions.

6. Post-Processing

6.1. THE LOGIC OF POST-PROCESSI G. Besides conjunctions, there are certain

phenomena in English which the parser is incapable of dealing with in its basic form. To solvethese problems, we developed a post-processing system, based on a concept we call"domains". A domain is a subset of the links that make up a sentence. After a linkage has

been found, the post-processing mechanism goes through the linkage and divides the sentenceup into domains based on the kind of links that are present in the sentence. It then further divides the links into "groups": sets of links which share a particular domain membership. Itthen applies rules which may declare the linkage invalid based on the combinations of links

present in a given group.

The information about the the domains to be formed and the rules to be applied must beincluded in a "post-process knowledge file". The file we provide with version 4.0 is called"4.0.knowledge". This is the default post-process file, and will be used if no other file isspecified. As explained earlier, a post-process file can be specified on the command line asfollows: parse [dictionary] -pp [post-process file] If you type "-ppoff" rather than "-pp [post-

process file]", then no post-process file will be used and no post-processing will be done.

6.2. THE DOMAI STRUCTURE. A domain is started by a certain type of link; we callthis the "root link" of the domain. The word on the left end of this link is the "root word".Different types of links start different types of domains. To understand how domains areformed, imagine a linkage as a kind of street map, through which paths can be traced. In mostcases, the domain of a link consists of all the links in the sentence that can be reached fromthat link, without tracing through the root word. (Another way of saying this is: the domaincontains everything reachable from the right end of the root link, without tracing through theroot link itself.) For example, assume that the C link in the following sentence begins adomain. This domain will include the Sp link and the I, but not the Ss and the O.


f 26 14/05/2010 18:18


20/26

+--C---++-Ss+O-+ +Sp(e)+I(e)+| | | | | |

He told me they would go

(The letters in parentheses indicate that the Sp and I link are in an e-type domain.) Veryroughly speaking, domains tend to include the links to the right of that link. However, thereare many exceptions. In the following sentence, the C link starts a domain; this includes theSs and O links, but not the Xc, CO, or Sp, even though the Sp is entirely to the right of the C.

+---------CO---------++-------Xc--------+ |+-C-+Ss(s)+O(s)+ | +Sp+| | | | | | |

After he saw us , we left

In the case below, the domain started by the C link actually extends back to the left of theroot link, to contain the B link as well. This follows naturally from the way domains aredefined. The B link can be reached from the right end of the C link, without tracing leftthrough the C. In this case, note that the B link extends to the left of the root word.

+---------Bsw(e)------+| +---I---+ || +SI+ +-C--+S(e)+| | | | | |

Who do you think you saw

If one wants a certain link type to start a domain, it must be included in the list"DOMAIN_STARTER_LINKS" in the file "post-process.knowledge". It must also beincluded in the following list, ("STARTING_LINK_TYPE_TABLE"), along with the name of the domain-type that it starts.

As we saw above, it is possible for domains to be traced back to the left of the root word; andfrom there, they might be extended indefinitely. This tends to be prevented, however, bywhat we call "restricted links". If a link is restricted, this means that if a domain is beingtraced through it to the left, and if the link extends to the left of the root word of the domain(the left end of the root link), the domain will be traced no further through the restricted link.So, for example, Bsw is a restricted link. This means that, in the sentence above, the "e"domain started by the C will contain the Bsw; but if there are any links which can be tracedfrom the left end of the Bsw (i.e., another link coming out of the word "who"), they will not

be included in the domain.

There are a few other complications in the way domains are generated.

1. The root link of a domain may or may not be included in the domain it starts. Root linkswill not be included, unless they are listed in a special list in post-process.c,"DOMAIN_CONTAINS_LINKS".

2. Some domain-types are "bounded domains". This means that they are not allowed toextend to the left of their root link at all (even with a "restricted link"). If they do extend inthis way in a linkage, the linkage will be declared invalid. (See "C" in the Guide to Links for an explanation of how this is used.) Domain types in this category must be listed under "BOUNDED_RULES".

3. As well as ordinary domains, there are two special kinds. One kind is "urfl" domains. Aswell as including everything that can be reached from the right end of the root link, thesedomains include everything that can be reached from the left end of the root link, tracing tothe right, underneath the root link (but not over it), hence the name -- "Under Root FromLeft". In the example below, the TOo link starts an "urfl" domain; as well as including the "I"


f 26 14/05/2010 18:18


21/26

link, as a normal domain would, this domain also includes the O.

+---TOo--++-S-+O(x)+ +I(x)+| | | | |I asked him to go

The final kind of domain is "urfl-only". These include ONLY links that can be reached fromthe left of the root link, tracing to the right underneath the root link. In the case below, theSs*g link starts an "urfl-only" domain.

+-----Ss*g------++---O(d)---+ |+ +D(d)+ +-O+| | | | |

playing the piano is fun

"Urfl" and "urfl-only" domains are defined by the link-types that start them. Any link that isadded to the "URFL_DOMAIN_STARTER_LNKS" list will start an "urfl" domain; any link that is added to the "URFL_ONLY_DOMAIN_STARTER_LINKS" will start an "urfl-only"domain.

4. Finally, certain link types are simply ignored by post-processing. No domains may betraced through them, nor can they start domains. These are listed under "IGNORE_THESE_LINKS".

Domains may be nested; a link may therefore be in several domains at once. The domainmembership of a given link can be shown in the following way:

+----------------------RW-------------------++--W-+-S+-P-+-MV-+-C+-S-+-C+-S+--P--+ || | | | | | | | | | |

///// he got mad when I said I was leaving \\\\\

///// RW RW RIGHT-WALL(m) ///// Wd Wd he(m) he Ss S got(m) got Pa Pa mad(m) mad MV MVs when(m) when Cs C I(m) (s) I Sp*i S said(m) (s) said Ce C I(m) (s) (e) I Sp*i S*i was(m) (s) (e) was Pg Pg leaving

To have the parser produce a "link list" of this kind, type "!links". (In the structure shownhere, the domain structure is strictly hierarchical; every domain that is partially inside another is completely inside that domain. There is no a priori reason why domain structure shouldalways be strictly hierarchical; but we believe that, given the current grammar, it will almostalways be strictly hierarchical, except in a few very obscure cases.)

6.3. GROUPS. The domain structure is really a means to creating a more useful kind of structure. This is the "group". A "group" of links is the set of links that have the same domainmembership. In the above example, then, the "Spii" and "Pg" are part of the same group. The"Ce" and "Sp*i" are in another group. The Ce and Pg are not in the same group. Groupscorrespond roughly to subject-verb expressions - groups of links that are part of a clause, but

not part of any dependent clauses within that clause. For example, in the above case, "He gotmad" is one subject-verb expression; "I said ..." is another, "I was leaving" is a third.

6.4. POST-PROCESSI G RULES. The domain structure thus divides a sentence intogroups of links. This then allows us to enforce constraints on the link-types that are in a


f 26 14/05/2010 18:18


22/26

subject-verb expression. This is useful in cases where there are constraints on thecombinations of links that can be present in a clause, but the links may be separated (i.e., theymay not all connect to the same word), making the constraints difficult to enforce using link logic. We do this by using certain kinds of rules. One is the "contains_one" rule. This saysthat if a group contains a link of a certain type, it must contain a link of another type. Another kind of rule is the "contains_none" rule. This says that if a group contains one kind of link, itmay not contain any of a certain kind of link.

In each case, we have a "triggering" link-type: a link that triggers the rule, and enforces acertain constraint. We also have a "criterion" link-type: a link that defines the constraint(whether it is "must contain X" or "may not contain X"). The triggering link for a rule must becontained in a line of the form

X, Y Z, "Warning!",

where "X" is the triggering link-type, "Y" and "Z" are the criterion links for the rule, and"Warning" is a message that will be outputted when the rule is violated. (This warning willonly be outputted when the incorrect linkage itself is outputted, and this only happens in"!bad" mode. It will also only be outtputed when the "!links" display is shown.)

If such a rule is listed under "CONTAINS_ONE_RULES", this means, "A group containingan X connector must contain either a Y or a Z". (The group may contain more than one fromthe set: a Y and a Z, 2 Y's, etc.) If it is listed under "CONTAINS_NONE_RULES", thismeans "a group containing an X link may not contain either a Y or a Z".

One can also define a symbol as a set of links:

@Blah: Y Z

In this line, "@Blah" is defined as the set of links {X, Y}. "@Blah" can then be used in actual

rules. Given this definition, the rule below will have the same effect as the rule stated above:X, Blah, "Warning",

where "Y" and "Z" are the criterion links for the rule.

A further kind of rule is used in post-processing, which does not relate to domain structure. If a certain link type is included in the list "MUST_FORM_A_CYCLE", this means that anylinks of that type may only be used in cycles. For example, "R" is on list. The linkage belowsatisfies this rule in the case of "R", because "R" is used in a cycle here.

+-------Ss------++----Wd---+----Bs---+ || +-Ds-+--R-+Sp*i+ +-Pp-+| | | | | | |

///// the man.n I.p saw.v is.v here

Certain link-types (relating to punctuation) proved to disrupt the domain structure; it seemed preferable for them to simply be ignored by post-processing. Thus we created a list in post-process.knowledge, "IGNORE_THESE_LINKS", which are ignored for in the creationof the domain structure (that is, domains are not traced through them), and also in theapplication of the post-processing rules. (They may be included in the"MUST_FORM_A_CYCLE" list, however, if desired.)

6.5. POST-PROCESSI G A D CO JU CTIO S. The interaction of post-processing andconjunctions is straightforward. As discussed earlier, any linkage with a conjunction can beregarded as several sub-linkages, and in fact is generally displayed that way. In

post-processing, these sub-linkages are treated individually; domains are generated and rules


f 26 14/05/2010 18:18


23/26

are applied in each sub-linkage. A conjunctive linkage is only approved by post-processing if all of its sub-linkages are approved.

6.6. LI K-TYPE MATCHI G I POST-PROCESSI G. Link-type matching in post-processing requires some explanation. In the dictionary, subscripts are used to createsub-categories of connector-types. "Ss+" will link with "Ss-" and "S-", but not "Sp-". Thecharacter "*" is used as a wild-card; it will match to any character. An unsubscriptedconnector name, like "S+", can thus be regarded as equivalent to "S***...+"

The post-processor also requires a system for matching connector types. While the linkagestage is looking at link-types on connector expressions, post-processing is looking at theresulting link-types that are formed when a linkage is complete. (If an "Ss+" has linkedtogether with a "S*a-", what the post-processor sees is "Ssa".) It is then comparing them tolink-names listed in "post-process.c" (as domain-starting links, as triggering links for rules, ascriterion links for rules, etc.). The link-type matching system used in post-processing is similar to the linkage-level one, but it is a little different. As mentioned above, from the dictionary's

point of "*" is a wild-card character that matches to any other character (see section 3.1).From the post-processor's point of view, however, "*" is just another character; "S*" in a

linkage does _not_ match with an "Ss" in post-process.knowledge. Nor does an "S*" in post-process.knowledge match with an "Ss" in a linkage. However, there is also a wild-cardcharacter in post-processing; this is "#". An "S#" in post-process.knowledge does match witha "Ss" in the dictionary, as well as with an "S*".

As mentioned above, at the linkage stage, the parser treats any unsubscripted places in a link name as "*"'s. For the post-processor, however, any empty subscript places in a link name -either in a linkage or in a post-processing rule - are treated as "#'s". In short: at both thelinkage stage and the post-processing stage, empty subscript places are treated as wild-cardsthat match to anything. For explicitly labeling a subscript place as a wild-card, however, oneuses "*" at the linkage-stage and "#" in post-processing.

The specific uses of post-processing are fully explained in the "Guide to Links". For example,see "SF: filler-it"; "SI", and "MV: Comparatives".

6.7. THE "!BAD" COMMA D. For people editing the dictionary or post-processing, auseful display feature is "!bad". The usual running of the parser is that it will display thelinkages in order of cost (lowest-cost ones first); additional linkages after the first one can beseen one at a time by typing RETURN. Only linkages passing post-processing are outputted.If the "!bad" variable is toggled, however, the parser will output _all_ of the linkages it finds,including those that fail post-processing. The invalid linkages will be outputted after the validones, one by one, as long as the user types RETURN. The parser will only output the linkages

it finds at whatever stage it had gotten to when it found a valid linkage. For example, if it hadgotten to null-link stage 2 before finding its first valid linkage, it will also output invalidlinkages found at null-link stage 2 (see section 7.1 for explanation of null-links). There is noway of seeing invalid linkages found at earlier stages.

7. Speed and Robustness Features

The original version of the parser, as described in previous sections, did an exhaustive searchfor all correct linkages; if none were found, it produced nothing. However, this meant that the

parser was not "robust": it could not do anything useful with a sentence unless it could parse

the entire thing. It was also quite slow. We have developed several remedies to these problems.

7.1. THE ULL-LI K SYSTEM. An important feature of the parser is the "null-link"system. This effectively allows robust parsing: that is, it allows the parser to assign some


f 26 14/05/2010 18:18


24/26

structure to a sentence even when it cannot fully interpret it. Basically, if the parser cannot parse a sentence normally (that is, if it cannot find any valid linkages), it tries ignoring oneword in the sentence. It finds all the linkages it can, ignoring just one word (some linkagesmay ignore one word, some may ignore another). This is "null link stage 1". Failing that, itthen attempts to find linkages ignoring 2 words. This is "null link stage 2". Failing that, it willcontinue to increment the number of null links, until it finds some valid linkages; it will thenoutput all the linkages found at this stage, and stop. There may be some cases where it cannotfind a valid linkage unless it ignores _all_ the words in the sentence; in this case, it simplygets to "null link stage N" (where N is the number of words in the sentence), and then givesup.

In the graphic display, "null-linked" words are shown in brackets, with no links attached tothem. In the sentence below, "gosh" and "," are null-linked:

+--Dsu--+---Ss---+----O---+| | | |

[gosh] [,] this sentence.n uses.v null-links

In null-link parsing, the connectivity requirement is suspended (see Section 1.3). This meansthat disconnected "islands" may form. However, each island represents one added null link.That is, if a sentence can be parsed as three disconected islands (but with all the wordsotherwise connected with regular links), this will linkage will be found at null link stage 2.

Note that the null-link systems respects post-processing. It keeps incrementing the number of null-links until it finds linkages that pass post-processing. For example, if linkages are foundat null-link stage 0, but they all fail post-processing, the parser will decide that no validlinkage has been found and will proceed to null-link stage 1. However, if some of the links ina linkage are not indirectly connected to the left-wall--as may occur in some "islands"--theymay not be contained in any post-processing domain. Such links will not be subject to

"contains_one" and "contains_none" rules, which apply only to links that are in a domain.

The null-link system can be turned on or off by typing the command "!null". The default isthat null-links are on. If null-links are turned off, then, when ther parser is unable to find acomplete linkage for a sentence, it will say "No complete linkages found", and prompt for thenext sentence.

7.2. THE LI K-LE GTH LIMIT. In studying the parser's performance on very longsentences (on which it was often very slow), we discovered that it was often consideringextremely long links even for link-types which are generally very short, such as "D"(connecting determiners to nouns). For this reason, we installed a "link-length-limit": links are

only allowed to be a certain length, in terms of the number of words from end to end.However, not all link-types are always short; some are sometimes quite long, such as "S" and"MV". Thus we include a list in the dictionary, "UNLIMITED-CONNECTORS", for link-types which are to be exempted from this limit. The limit for short link-types can beadjusted while running the parser using the command "!short=[length]". The default value for this is 6.

If you wish to disable the length limit, so that all link-types may be any length, remove the"UNLIMITED-CONNECTORS" list from the dictionary. If you want all link-types to be

_short_, so that none are exempted from the length limit, you must include the"UNLIMITED-CONNECTORS" list in the dictionary, with some bogus connector such as"NO+", but otherwise empty.

The interaction between the length limit and conjunctions is complex. In sentences containinga conjunction, it seemed sensible to relax the length limit. For example, D links are very


f 26 14/05/2010 18:18


25/26

rarely 7 words long (as in sentence 1 below), but they might be this long in a sentencecontaining a conjunction (sentence 2).

+-------------D-------------+| |

The big big big big big big dog ran (rejected)

The big big dog and big big cat ran (accepted)

The logical thing to do seemed to be to apply the length limit only to subsentences; that is,when the length of a link to one element in an "and"-list is calculated, the length of anyintervening elements in the "and"-list is subtracted. We call this adjusted length the "effectivelength" of a link. Therefore sentence 1 above is rejected; sentence 2 is accepted. In null-link stages, however, this is not done; link length is simply equal to actual length. (For complexreasons, linkages with conjunctions which exceed the length limit may occasionally beaccepted in null-link stages.)

7.3. THE POST-PROCESSI G LIMIT. Since post-processing proved to be a major sourceof the slowness of the parser, we installed a "post-processing limit". This is simply a limit onthe number of linkages that will be considered by post-processing. If the limit is set at 100(this is the default), then only 100 linkages will be considered by post-processing, even if many more than that are generated; the others will just be discarded. This means, of course,that the "best" linkage (by the parser's heuristics, for example) may be discarded. However,the linkages to be considered by post-processing are selected randomly from all the generatedones, which means that at least one linkage is likely to be found which is fairly similar to thecorrect one. The number of sentences considered by post-processing can be controlled withthe variable "!limit=[number]".

7.4. THE TIMER. The parser also has a timer. When the timer is set to a positive value, the parser will simply abort parsing of a sentence after that number of seconds has elapsed.

(When the time is set to -1 -- its default -- it does nothing, and the parser will runindefinitely.) Set the timer while running the parser by typing "!timeout = [time value inseconds]".

7.5. PA IC MODE. Combining several of these features, we created a new mode of parsingcalled "panic mode", in which the parser can parse even very long sentences quickly, but withconsiderably reduced accuracy. Panic mode has the following features. 1) All link-types areconsidered "short", even those included in the UNLIMITED-CONNECTORS list. 2) Panicmode begins immediately with null-link parsing, skipping regular parsing (the mainconsequence of this is that the "effective length" of links is considered to be their actuallength; see Section 7.3). 3) Whereas only connectors of cost 0, 1 or 2 are considered in

regular parsing, all connectors are considered in panic mode. At the moment, the onlyconnectors in the dictionary with cost greater than 2 (they have cost 3) are connectors onnouns which allow them to make no main connection at all:

dog man house: {@A-} & D- & (S+ or O- or J- or [[[()]]]);

In effect, this allows noun-phrases to be treated as complete entities, not connected toanything else. (Recall that in null-link parsing, the connectivity rule is suspended, so thatdisconnected "islands" may form.) In a typical "panic mode" linkage, then, variousnoun-phrases and clauses within the sentence will be fully parsed, but they will often not beconnected to each other.

Panic mode is invoked whenever the timer runs out on regular parsing and no valid linkageshave been found. For example, if the timer is set for 15 seconds, panic mode will begin after 15 seconds. Panic mode can be disabled (or turned back on again), while running the parser,

by typing "!panic".


f 26 14/05/2010 18:18


26/26

Davy TemperleyLast modified: Mon Mar 22 09:25:46 EST 1999


introduction to the link grammar parser

Documents