morphology and finite-state transducers by mathias creutz 31 october 2001 chapter 3, jurafsky &...

32
Morphology and Finite- State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Upload: noelle-hyman

Post on 14-Dec-2015

236 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Morphology and Finite-State Transducers

by Mathias Creutz31 October 2001

Chapter 3, Jurafsky & Martin

Page 2: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Contents Morphology

morphemes, inflection and derivation, allomporphs

Morphological Parsing finite-state automata, two-level morphology

Finite-State Transducers rules, combination of FSTs, lexicon-free FSTs

Human Morphological Processing Exercise

Page 3: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Morphology Morphology is the study of the way

words are built up from smaller meaning-bearing units, morphemes. e.g. talo + ssa + ni + kin

Two broad classes of morphemes, stems and affixes: the stem is the ”main morpheme” of the

word, supplying the main meaning, e.g. talo in talo+ssa+ni+kin

Page 4: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Affixes Affixes add ”additional” meanings. Concatenative morphology uses

the following types of affixes: prefixes, e.g. epä- in

epä+olennainen suffixes, e.g. –ssa in talo+ssa circumfixes, e.g. German ge- -t in

ge+sag+t ([have] said)

Page 5: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Non-concatenative Morphology In non-concatenative morphology the

stem morpheme is split up. The following types of affixes are used: infixes, e.g. Californian Jurok, sepolah (field),

se+ge+polah (fields) transfixes, e.g. Hebrew, l+a+m+a+d (he

studied), l+i+m+e+d (he taught), l+u+m+a+d (he was taught)

This type of non-concatenative morphology is called templatic or root-and-pattern morphology.

Page 6: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Inflection and Derivation There are two broad classes of

ways to form words from morphemes: inflection and derivation.

Page 7: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Inflection Inflection is the combination of a word stem

with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. talo (singular), talo+t (plural)

Inflection is productive. talo, talo+t vs. auto, auto+t vs. metsä, metsä+t

The meaning of the resulting word is easily predictable.

Page 8: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Derivation Derivation is the combination of a word

stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. e.g. järki, järje+st+ää, järje+st+ö,

järje+st+ell+ä, järje+st+el+mä, järje+st+el+mä+lli+nen, järje+st+el+mä+lli+syys

Not always productive. järki, järje+st+ää vs. metsä, metsä+st+ää vs.

talo, talo+st+aa?

Page 9: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Allomorphs A group of allomorphs make up

one morpheme class. An allomorph is a special variant of a morpheme. e.g. Finnish illative ending:

+<vowel_lengthening>n, +h<vowel>n, +seen, +siin talo+on, metsä+än, talo+i+hin, huonee+seen, huone+i+siin

e.g. Finnish stem variation: käsi, käde+n, kät+tä, käte+en

Page 10: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Why Allomorphs? Phonological constraints

e.g. vowel harmony, talo+ssa vs. metsä+ssä Morphological paradigms

e.g. käsi, käde+n vs. kasi, kasi+n, Swedish leta, leta+de vs. heta, het+te

Irregularities e.g. cat, cat+s vs. goose, geese

Orthographic constraints, i.e. spelling rules e.g. cat, cat+s vs. city, citi+es

Page 11: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Morphological Parsing Parsing means taking an input and

producing some sort of structure for it. Morphological parsing means

breaking down a word form into its constituent morphemes. e.g. talossa talo +ssa

Mapping of a word form to its baseform is called stemming. e.g. talossa talo

Page 12: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Finite-State Morphological Parsing In order to build a parser we need the

following: a lexicon containing the stems and affixes, morphotactics, i.e. the model of

morpheme ordering, e.g. talo+ssa+ni instead of talo+ni+ssa,

a set of rules (orthographic, etc.), i.e. the model of changes that occur in a word, usually when two morphemes combine, e.g. city + s cities.

Page 13: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Finite-State Automaton for Inflection of English Verbs

q0

q1

q2

q3

irreg-past-verb-form

reg-verb-stem

reg-verb-stem

irreg-verb-stem

preterite (-ed)

past-participle (-ed)

3-singular (-s)

progressive (-ing)

Page 14: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Finite-State Automaton for Inflection of the Verbs ’talk’, ’test’ and ’sing’

q0

q1

q2

q3

s

g

ed

de

gn

i

s

i

n g

s

u

a n

klat

ta

lk

e st

es t

Page 15: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Two-Level Morphology Two-level morphology represents a word as a

correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word.

s ni g +PROG+V

s ni g gni

Lexical

Surface

Page 16: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Finite-State Transducer A transducer maps between one set of symbols

and another; a finite state transducer does this via a finite automaton.

Where an FSA accepts a language stated over a finite alphabet of single symbols, e.g. ={a, b, c, ...}, an FST accepts a language stated over pairs of symbols, e.g. ={a:a, b:b, a:c, a:, :, ...}

In two-level morphology, we call pairs like a:a default pairs, and refer to them by a single symbol a.

An FST can be seen as a recognizer, generator, translator or a set relator.

Page 17: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Finite-State Transducer for Inflection of the Verbs ’talk’, ’test’ and ’sing’

q0 q3

+3SG:s

g

+PRET:e:d

+PSTPCP:e

+PROG:is

i

n g

s

i:u

i:a n

klat

ta

lk

e st

es t

n g

+V:

+V:

+V:

+V:

+PRET:

+PSTPCP:

:d

:g:n

Page 18: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Examples

Lexical form Surface form

talk +V talk

sing +V +3SG sings

test +V +PROG testing

talk +V +PRET talked

sing +V +PRET sang

talk +V +PSTPCP

talked

sing +V +PSTPCP

sung

Page 19: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Useful FST Operations Inversion: Switch input and output

labels. e.g. (T)={a:b, c:d} (inv(T))={b:a, d:c}

Intersection: Only sequences of pairs accepted by both transducerT1 and transducerT2 are accepted by transducer T1^T2.

Composition: The output of transducer T1 serves as input to T2. This is marked as T1ºT2 or T2(T1).

Page 20: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Spelling Rules and FSTs

Name Description of Rule ExampleConsonant doubling

1-letter consonant doubled before -ing/-ed

beg/begging

E deletion Silent e dropped before-ing and –ed

make/making

E insertion e added after –s, -z, -x, -ch, -sh before -s

watch/watches

Y replacement -y changes to –ie before -s, and to -i before -ed

try/tries

K insertion verbs ending with vowel + -c add -k

panic/panicked

Page 21: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Three levels Add an intermediate level between

the lexical and surface levels

ik s sesSurface

ik s #sIntermediate ^s

ik s +3SGLexical +Vs

Page 22: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

FST for the E-insertion Rule

q0 q3 q4

q5

q1 q2

^:

:e

^:

^:

z, s, xz, s, x

z, s, x

s

#other

z, x#, other

#, other

#

other

s

#__^/ s

z

s

x

e

Page 23: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Combination of FSTs (1)

ik s sesSurface

ik s #sIntermediate ^s

ik s +3SGLexical +Vs

Lexicon-FST

Rule1-FST

RuleN-FST

...

Page 24: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Combination of FSTs (2)

ik s sesSurface

ik s #sIntermediate ^s

ik s +3SGLexical +Vs

Lexicon-FST

Rule1-FST

RuleN-FST

...Intersect

Page 25: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Combination of FSTs (3)

ik s sesSurface

ik s #s^s

ik s +3SGLexical +Vs

Lexicon-FST

Rule1-FST

RuleN-FST

...Intersect

Compose

Intermediate

Page 26: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Intersection and Composition For each state qi in transducer T1 and

state qj in transducer T2, create a new state qij. Intersection: For any pair a:b, if T1

transitions from qi to qn, and T2 transitions from qj to qm, T1^T2 transitions from qij to qnm.

Composition: If T1 transitions from qi to qn with the pair a:b, and T2 transitions from qj to qm with the pair b:c, then T1ºT2 transitions from qij to qnm with the pair a:c.

Page 27: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Lexicon-Free FSTs Used in information-retrieval E.g. the Porter algorithm, which is based

on a series of simple cascaded rewrite rules: ATIONAL ATE (relational relate) ING if stem contains vowel (motoring

motor) Errors occur:

organization organ, doing doe, university universe

Page 28: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Human Morphological Processing (1) How are multi-morphemic words represented in

the minds of human speakers? full-listing hypothesis vs. minimum

redundancy hypothesis Experiments:

Stanners et al. 1979: a word is recognized faster if it has been seen before (priming): lifting lift, burned burn, selective / select, i.e. different representations for inflection and derivation.

Marsen-Wilson et al. 1994: spoken derived words can prime their stems, but only if their meaning is close: government govern, department / depart

Page 29: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Human Morphological Processing (2)

Speech errors: Speakers mix up the order of words... e.g. if you break it, it’ll drop

... and also attach affixes to the wrong stems: e.g. it’s not only we who have screw

looses (for ”screws loose”) e.g. easy enoughly (for ”easily enough”)

Page 30: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Excercise (1/3) Your task is to create a finite-state transducer that can

analyze the following Finnish word forms:

Surface form Lexical form

talo talo +NOM

taloon talo +ILL

talomme talo +NOM +POS1PL

taloomme talo +ILL +POS1PL

metsä metsä +NOM

metsään metsä +ILL

metsämme metsä +NOM +POS1PL

metsäämme metsä +ILL +POS1PL

Page 31: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Exercise (2/3) The morphological tags have the following

meaning: +NOM = nominative; +ILL = illative; +POS1PL = possessive, 1st person plural.

Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky & Martin. Create three separate finite-state transducers that you finally combine into one:

a) Create a transducer that operates between the intermediate and surface level. This transducer handles the vowel lengthening that is necessary for the illative form: talo +ILL talo|on vs. metsä +ILL metsä|än.

Page 32: Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Excercise (3/3) b) Create a transducer that operates between the

intermediate and surface level. This transducer handles the deletion of n in front of a possessive ending: talo + mme talo|mme vs. talo|on + mme talo|o|mme.

c) Create a transducer that operates between the lexical and the intermediate level. This transducer maps morphological tags onto endings.

d) Combine all the transducers into one. Present your transducers as graphs or tables (cf.

Fig. 3.15 in Jurafsky & Martin)