john tinsley morphological analysis of spanish using finite-state transducers acl 4 nclt seminar...
Post on 20-Dec-2015
218 views
TRANSCRIPT
John Tinsley
Morphological Analysis of Spanish Using Finite-State
Transducers
ACL 4
NCLT Seminar Presentation, 7th June 2006
Introduction What is this project about?
Provide morphological information on Spanish strings
Generate strings from morphologcal descriptions
What were my aims? Robust, fast, application – easily integrated into
other systems 80% token coverage on unrestricted text 100% coverage of Spanish morphology
Design Methodology Formalisation
Discovery of Spanish morphological rules
Implementation Coding of morphological model with
Xerox Finite-State Tools Evaluation
Check for accuracy & well-formedness Assess language coverage
Formalisation
Spanish Morphology - Verbs Inflected for person, tense/mood, number Regular verbs
3 regular conjugations identified by infinitive endings
‘-ar’, ‘-er’, and ‘-ir’
Irregular verbs 66 distinct irregularities Varying degrees of irregularity
Spanish Morphology - Nouns Inflected for number, gender 7 types of noun
Feminine, masculine, neutral, derivative, profession, number invariant, proper
Irregularities All arise via pluralisation Accentuation, character alterations
Spanish Morphology - Adjectives
Inflected for number, gender 4 types of adjective
Neutral, derivative, profession, irregular
Adverbs derived from adjectives by addition of suffix ‘mente’
Implementation
Xerox-Finite State Tools - lexc Lexicon compiler Compiles ‘continuation classes’ into
lexical transducers
Xerox Finite-State Tools - xfst
Xerox finite-state tool Compiles regular expressions into
networks Regular expression replace rules
[ String -> Replacement || left-context _ right-context ]
Xerox Finite-State Tool - example
conocer - ‘to know’ 1st person, pres. ind. ‘conozco’ Lexical transducer mappings
conoc:conoc er+Verb:ε +PresInd:^PresInd +1P+Sg:o
Xerox Finite-State Tool - example cont…
Composed replace rule
[ c -> {zc} || _ ^PresInd ]
Triggered by the ^PresInd tag Makes required changes, remove trigger
Lexical conocer+Verb+PresInd+1P+Sg
Surface conoc^PresIndo
Verb Lexicon
Coded in lexc Model has 3 regular paths 66 varieties of irregularity
e.g. poder ‘to be able to’
LEXICON Irreg430:^UE^VSoue^PRET1^FRErV ;[o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?* [%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]
Noun LexiconLEXICON NounFem ! Feminine Nouns!STEM !CONT. CLASS ! GLOSSacción fIsNounEs ; ! action
LEXICON fIsNounEs ! feminine pluralised with 'es'+Noun:0 fNounPluralES ;
LEXICON fNounPluralES+Sg+Fem:0 # ;+Pl+Fem:^NZ^NOes # ;
[z -> c || _ %^NZ]
[ó -> o || _ ?^<5 %^NO ]
Adjective Lexicon
Same process as noun lexicon Uses the same replace rules One exception for adverbs
LEXICON nIsAdjS+Adj:0 nAdjPluralS ;+Adj|+Adv:^AAOmente # ;
[o -> a || _ %^NAO %^AAO {mente}]
Other Transducers Overgeneration Filter
llover ‘to rain’
Capitalisation
Trigger Remover
Execution script
~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ]
[ a (->) A || .#. _ ]
[ %^IE -> 0 ]
Evaluation
Testing
Accuracy Maintaining integrity of existing rules
Projection Subtraction
Well-formedness Ensuring tag order
Assessing Coverage Aim – 80% on unrestricted text Statistical predictions (Crystal 1997)
Corpus compilation and processing Europarl, 3 corpora
(http://people.csail.mit.edu/koehn/publications/europarl/ )
Phase 1 – augmentation Phase 2 – 81% coverage Final assessment – 84.15% coverage
Further Details
Class # of forms
Nouns 547
Verbs 304
Adjectives 183
Other 378
• Generates approx. 44,000 unique morphological descriptions
• Evaluation corpus – 1.26 analyses per input token on average
Possible improvements Increase coverage
lexicon augmentation
Disambiguation using POS tagger
More derivational morphology
Deal with different dialects of Spanish
References (Beesley & Karttunen 2003) Beesley, K. and Karttunen, L.,
Finite State Morphology, CSLI Publications, United States, 2003.
(Claret 2005) Los Verbos Castellanos Conjugados, Sexta Edición, Editorial Claret, Barcelona, 2005
(Crystal 1997) Crystal, D., The Cambridge Encyclopedia of Language. (2nd. ed.) Cambridge University Press, 1997
Europarl - Europarl Parallel Corpus http://people.csail.mit.edu/koehn/publications/europarl/ - Last Accessed 19/05/2006
(Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990.
(Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J. Collection Bescherelle - Les verbes espagnols. Hatier, 1997.
Real Academia Española – http://www.rae.es/ - Last Accessed 25/05/2006
LEXICON ArVerbs!STEM !CONT. CLASS !GLOSSabord ArV ; !to approach
LEXICON ArVar+Verb:0 ArConj ;
LEXICON ArConj!TAGS !CONT.CLASS+PresInd:^PresInd ArPresInd ;+PretInd:^PretInd ArPretInd ;
LEXICON ArPresInd ! Present Indicative+1P+Sg:o^1PSg #;+2P+Sg:as^2PSg #;+3P+Sg:a^3PSg #;