grammar engineering: coordination and macros metarulemacro interfacing finite-state morphology

Grammar Engineering:Grammar Engineering:Coordination and MacrosCoordination and MacrosMETARULEMACROMETARULEMACROInterfacing finite-state morphologyInterfacing finite-state morphology

Miriam Butt Miriam Butt (University of Konstanz)(University of Konstanz)and and Martin Forst (NetBase Solutions)Martin Forst (NetBase Solutions)

Colombo 2014Colombo 2014

CoordinationCoordination Every attribute can only have one value So what do we do with coordinated

constituents?

Example: gorillas sleep and eatVP --> { …

| VP: ! $ ^

CONJ

VP: ! $ ^

}.

Coordination (cont’d)Coordination (cont’d) Coordination can happen basically at any level

of the c-structure.

Example: the gorillas peel and eat the bananasV --> { …

| V: ! $ ^

CONJ

V: ! $ ^

}.

Coordination (cont’d)Coordination (cont’d)

Basically any category can be coordinated.

Example: the gorillas eat the bananas in the cage and in the garden

PP --> { …

| PP: ! $ ^

CONJ

PP: ! $ ^

}.

Coordination (cont’d)Coordination (cont’d)

How can we capture these generalizations?

Via regular-expression macros!

SC-COORD(CAT) = CAT: ! $ ^;

CONJ

CAT: ! $ ^.

PP --> { ...

| @(SC-COORD PP)

}.

Nominal coordinationNominal coordination NP, N, etc. coordination is special because the

NUM attribute should typically have the value pl even when the individual set members are in the sg.

Examples: Mary and the gorilla like bananas.

The boys and girls like bananas.

Nominal coordination (cont’d)Nominal coordination (cont’d)

NP-COORD(CAT) = CAT: ! $ ^;

CONJ: ^ = !

(^ NUM) = pl;

CAT: ! $ ^.

NP --> { ...

| @(NP-COORD NP)

}.

N --> { ...

| @(NP-COORD N)

}.

METARULEMACROMETARULEMACRO Macros are nice But can‘t we do better? After all, it‘s pretty tedious to go into almost all

rules and invoke either the SC-COORD or the NP-COORD macro

XLE has a special macro called the METARULEMACRO

Every rule goes through the METARULEMACRO unless specified otherwise

METARULEMACRO (cont’d)METARULEMACRO (cont’d) Takes three arguments: _CAT, _BASECAT,

and _RHS _CAT is the category on the left-hand side of

the rule _BASECAT is the same as _CAT unless you

are dealing with a complex-category rule _RHS is the right-hand side of the rule

METARULEMACRO (cont’d)METARULEMACRO (cont’d)

METARULEMACRO(_CAT _BASECAT _RHS)=

{ _RHS

| e: _CAT $ { N NP };

@(NP-COORD _CAT)

| e: _CAT ~$ { N NP };

@(SC-COORD _CAT)

}.

Interfacing finite-state transducersInterfacing finite-state transducers Maintaining a full-form lexicon is tedious Many lexicon entries look alike Is there a way to get the information about the

category of a word from somewhere, ideally along with information about morphosyntactic categories such as tense, mood, case, number, person, etc?

Finite-state morphologies!

Interfacing finite-state transducersInterfacing finite-state transducers Cascade of finite-state transducers used is

specified in MORPHOLOGY section At least two subsections:

– TOKENIZE– ANALYZE

By default, the transducers listed are used both for parsing and for generation

This behavior can be altered by prefixing the names of transducer files with P! or G!

TokenizationTokenization So far, only white spaces are considered as

token boundaries However, there are more kinds of token

boundaries in real-word text– Punctuation has to be split off the preceding token– Some white spaces should not be treated as token

boundaries, e.g. “Sri Lanka”– Upper-case letters at sentence beginnings should

optionally be lower-cased

A finite-state tokenizer takes care of these things

Finite-state morphologiesFinite-state morphologies Map surface forms to canonical form (lemma)

and series of morphological tags

Examples:

rode ride +Verb +PastTense +123P

rides ride +Verb +Pres +3sg

ride +Noun +Pl

children child +Noun +Pl

Interfacing Finite-state MorphologyInterfacing Finite-state Morphology Morphological tags need to be listed in the

lexicon– Sublexical lexicon entries look like regular lexicon

entries– Difference: morphcode xle instead of *

Lemmas with non-predictable subcategorization frames must be listed in the lexicon

Other lemmas can be dealt with by the -unknown entry

Lexicon entries for morphology Lexicon entries for morphology outputoutput

+Verb V-POS XLE .

+Pres TNS XLE @VPRES.

+3sg PERS XLE @S-AGR.

wait V-S XLE (^ PRED)= ‘wait<(^SUBJ)(^OBL)>’.

-unknown A-S XLE @(PRED %stem);

N-S XLE @(PRED %stem).

Interfacing Finite-state MorphologyInterfacing Finite-state Morphology

Morphology output needs to be parsed by sublexical rules– Look like regular rules– Have f-annotations like regular rules– Difference: Sublexical categories are marked with

the suffix _BASE

Interfacing Finite-state MorphologyInterfacing Finite-state Morphology

V --> V-S_BASE

V-POS_BASE

{ TNS_BASE

PERS_BASE

| ASP_BASE

}.

XLE Lookup ModelXLE Lookup Model Only one entry per headword per lexicon

section Same headword may be covered by an explicit

entry and by -unknown entry In order to allow this, we need to mark the

explicit entry with ; ETC

sleep V-S @(INTRANS sleep); ETC.

-unknown N-S @(PRED %stem).

grammar engineering: coordination and macros metarulemacro interfacing finite-state morphology

Documents