morphology and finite-state transducers by mathias creutz 31 october 2001 chapter 3, jurafsky &...
TRANSCRIPT
Morphology and Finite-State Transducers
by Mathias Creutz31 October 2001
Chapter 3, Jurafsky & Martin
Contents Morphology
morphemes, inflection and derivation, allomporphs
Morphological Parsing finite-state automata, two-level morphology
Finite-State Transducers rules, combination of FSTs, lexicon-free FSTs
Human Morphological Processing Exercise
Morphology Morphology is the study of the way
words are built up from smaller meaning-bearing units, morphemes. e.g. talo + ssa + ni + kin
Two broad classes of morphemes, stems and affixes: the stem is the ”main morpheme” of the
word, supplying the main meaning, e.g. talo in talo+ssa+ni+kin
Affixes Affixes add ”additional” meanings. Concatenative morphology uses
the following types of affixes: prefixes, e.g. epä- in
epä+olennainen suffixes, e.g. –ssa in talo+ssa circumfixes, e.g. German ge- -t in
ge+sag+t ([have] said)
Non-concatenative Morphology In non-concatenative morphology the
stem morpheme is split up. The following types of affixes are used: infixes, e.g. Californian Jurok, sepolah (field),
se+ge+polah (fields) transfixes, e.g. Hebrew, l+a+m+a+d (he
studied), l+i+m+e+d (he taught), l+u+m+a+d (he was taught)
This type of non-concatenative morphology is called templatic or root-and-pattern morphology.
Inflection and Derivation There are two broad classes of
ways to form words from morphemes: inflection and derivation.
Inflection Inflection is the combination of a word stem
with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. talo (singular), talo+t (plural)
Inflection is productive. talo, talo+t vs. auto, auto+t vs. metsä, metsä+t
The meaning of the resulting word is easily predictable.
Derivation Derivation is the combination of a word
stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. e.g. järki, järje+st+ää, järje+st+ö,
järje+st+ell+ä, järje+st+el+mä, järje+st+el+mä+lli+nen, järje+st+el+mä+lli+syys
Not always productive. järki, järje+st+ää vs. metsä, metsä+st+ää vs.
talo, talo+st+aa?
Allomorphs A group of allomorphs make up
one morpheme class. An allomorph is a special variant of a morpheme. e.g. Finnish illative ending:
+<vowel_lengthening>n, +h<vowel>n, +seen, +siin talo+on, metsä+än, talo+i+hin, huonee+seen, huone+i+siin
e.g. Finnish stem variation: käsi, käde+n, kät+tä, käte+en
Why Allomorphs? Phonological constraints
e.g. vowel harmony, talo+ssa vs. metsä+ssä Morphological paradigms
e.g. käsi, käde+n vs. kasi, kasi+n, Swedish leta, leta+de vs. heta, het+te
Irregularities e.g. cat, cat+s vs. goose, geese
Orthographic constraints, i.e. spelling rules e.g. cat, cat+s vs. city, citi+es
Morphological Parsing Parsing means taking an input and
producing some sort of structure for it. Morphological parsing means
breaking down a word form into its constituent morphemes. e.g. talossa talo +ssa
Mapping of a word form to its baseform is called stemming. e.g. talossa talo
Finite-State Morphological Parsing In order to build a parser we need the
following: a lexicon containing the stems and affixes, morphotactics, i.e. the model of
morpheme ordering, e.g. talo+ssa+ni instead of talo+ni+ssa,
a set of rules (orthographic, etc.), i.e. the model of changes that occur in a word, usually when two morphemes combine, e.g. city + s cities.
Finite-State Automaton for Inflection of English Verbs
q0
q1
q2
q3
irreg-past-verb-form
reg-verb-stem
reg-verb-stem
irreg-verb-stem
preterite (-ed)
past-participle (-ed)
3-singular (-s)
progressive (-ing)
Finite-State Automaton for Inflection of the Verbs ’talk’, ’test’ and ’sing’
q0
q1
q2
q3
s
g
ed
de
gn
i
s
i
n g
s
u
a n
klat
ta
lk
e st
es t
Two-Level Morphology Two-level morphology represents a word as a
correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word.
s ni g +PROG+V
s ni g gni
Lexical
Surface
Finite-State Transducer A transducer maps between one set of symbols
and another; a finite state transducer does this via a finite automaton.
Where an FSA accepts a language stated over a finite alphabet of single symbols, e.g. ={a, b, c, ...}, an FST accepts a language stated over pairs of symbols, e.g. ={a:a, b:b, a:c, a:, :, ...}
In two-level morphology, we call pairs like a:a default pairs, and refer to them by a single symbol a.
An FST can be seen as a recognizer, generator, translator or a set relator.
Finite-State Transducer for Inflection of the Verbs ’talk’, ’test’ and ’sing’
q0 q3
+3SG:s
g
+PRET:e:d
+PSTPCP:e
+PROG:is
i
n g
s
i:u
i:a n
klat
ta
lk
e st
es t
n g
+V:
+V:
+V:
+V:
+PRET:
+PSTPCP:
:d
:g:n
Examples
Lexical form Surface form
talk +V talk
sing +V +3SG sings
test +V +PROG testing
talk +V +PRET talked
sing +V +PRET sang
talk +V +PSTPCP
talked
sing +V +PSTPCP
sung
Useful FST Operations Inversion: Switch input and output
labels. e.g. (T)={a:b, c:d} (inv(T))={b:a, d:c}
Intersection: Only sequences of pairs accepted by both transducerT1 and transducerT2 are accepted by transducer T1^T2.
Composition: The output of transducer T1 serves as input to T2. This is marked as T1ºT2 or T2(T1).
Spelling Rules and FSTs
Name Description of Rule ExampleConsonant doubling
1-letter consonant doubled before -ing/-ed
beg/begging
E deletion Silent e dropped before-ing and –ed
make/making
E insertion e added after –s, -z, -x, -ch, -sh before -s
watch/watches
Y replacement -y changes to –ie before -s, and to -i before -ed
try/tries
K insertion verbs ending with vowel + -c add -k
panic/panicked
Three levels Add an intermediate level between
the lexical and surface levels
ik s sesSurface
ik s #sIntermediate ^s
ik s +3SGLexical +Vs
FST for the E-insertion Rule
q0 q3 q4
q5
q1 q2
^:
:e
^:
^:
z, s, xz, s, x
z, s, x
s
#other
z, x#, other
#, other
#
other
s
#__^/ s
z
s
x
e
Combination of FSTs (1)
ik s sesSurface
ik s #sIntermediate ^s
ik s +3SGLexical +Vs
Lexicon-FST
Rule1-FST
RuleN-FST
...
Combination of FSTs (2)
ik s sesSurface
ik s #sIntermediate ^s
ik s +3SGLexical +Vs
Lexicon-FST
Rule1-FST
RuleN-FST
...Intersect
Combination of FSTs (3)
ik s sesSurface
ik s #s^s
ik s +3SGLexical +Vs
Lexicon-FST
Rule1-FST
RuleN-FST
...Intersect
Compose
Intermediate
Intersection and Composition For each state qi in transducer T1 and
state qj in transducer T2, create a new state qij. Intersection: For any pair a:b, if T1
transitions from qi to qn, and T2 transitions from qj to qm, T1^T2 transitions from qij to qnm.
Composition: If T1 transitions from qi to qn with the pair a:b, and T2 transitions from qj to qm with the pair b:c, then T1ºT2 transitions from qij to qnm with the pair a:c.
Lexicon-Free FSTs Used in information-retrieval E.g. the Porter algorithm, which is based
on a series of simple cascaded rewrite rules: ATIONAL ATE (relational relate) ING if stem contains vowel (motoring
motor) Errors occur:
organization organ, doing doe, university universe
Human Morphological Processing (1) How are multi-morphemic words represented in
the minds of human speakers? full-listing hypothesis vs. minimum
redundancy hypothesis Experiments:
Stanners et al. 1979: a word is recognized faster if it has been seen before (priming): lifting lift, burned burn, selective / select, i.e. different representations for inflection and derivation.
Marsen-Wilson et al. 1994: spoken derived words can prime their stems, but only if their meaning is close: government govern, department / depart
Human Morphological Processing (2)
Speech errors: Speakers mix up the order of words... e.g. if you break it, it’ll drop
... and also attach affixes to the wrong stems: e.g. it’s not only we who have screw
looses (for ”screws loose”) e.g. easy enoughly (for ”easily enough”)
Excercise (1/3) Your task is to create a finite-state transducer that can
analyze the following Finnish word forms:
Surface form Lexical form
talo talo +NOM
taloon talo +ILL
talomme talo +NOM +POS1PL
taloomme talo +ILL +POS1PL
metsä metsä +NOM
metsään metsä +ILL
metsämme metsä +NOM +POS1PL
metsäämme metsä +ILL +POS1PL
Exercise (2/3) The morphological tags have the following
meaning: +NOM = nominative; +ILL = illative; +POS1PL = possessive, 1st person plural.
Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky & Martin. Create three separate finite-state transducers that you finally combine into one:
a) Create a transducer that operates between the intermediate and surface level. This transducer handles the vowel lengthening that is necessary for the illative form: talo +ILL talo|on vs. metsä +ILL metsä|än.
Excercise (3/3) b) Create a transducer that operates between the
intermediate and surface level. This transducer handles the deletion of n in front of a possessive ending: talo + mme talo|mme vs. talo|on + mme talo|o|mme.
c) Create a transducer that operates between the lexical and the intermediate level. This transducer maps morphological tags onto endings.
d) Combine all the transducers into one. Present your transducers as graphs or tables (cf.
Fig. 3.15 in Jurafsky & Martin)