john tinsley morphological analysis of spanish using finite-state transducers acl 4 nclt seminar...

25
John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Post on 20-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

John Tinsley

Morphological Analysis of Spanish Using Finite-State

Transducers

ACL 4

NCLT Seminar Presentation, 7th June 2006

Page 2: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Introduction What is this project about?

Provide morphological information on Spanish strings

Generate strings from morphologcal descriptions

What were my aims? Robust, fast, application – easily integrated into

other systems 80% token coverage on unrestricted text 100% coverage of Spanish morphology

Page 3: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Design Methodology Formalisation

Discovery of Spanish morphological rules

Implementation Coding of morphological model with

Xerox Finite-State Tools Evaluation

Check for accuracy & well-formedness Assess language coverage

Page 4: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Formalisation

Page 5: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Spanish Morphology - Verbs Inflected for person, tense/mood, number Regular verbs

3 regular conjugations identified by infinitive endings

‘-ar’, ‘-er’, and ‘-ir’

Irregular verbs 66 distinct irregularities Varying degrees of irregularity

Page 6: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Spanish Morphology - Nouns Inflected for number, gender 7 types of noun

Feminine, masculine, neutral, derivative, profession, number invariant, proper

Irregularities All arise via pluralisation Accentuation, character alterations

Page 7: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Spanish Morphology - Adjectives

Inflected for number, gender 4 types of adjective

Neutral, derivative, profession, irregular

Adverbs derived from adjectives by addition of suffix ‘mente’

Page 8: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Implementation

Page 9: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Xerox-Finite State Tools - lexc Lexicon compiler Compiles ‘continuation classes’ into

lexical transducers

Page 10: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Xerox Finite-State Tools - xfst

Xerox finite-state tool Compiles regular expressions into

networks Regular expression replace rules

[ String -> Replacement || left-context _ right-context ]

Page 11: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Xerox Finite-State Tool - example

conocer - ‘to know’ 1st person, pres. ind. ‘conozco’ Lexical transducer mappings

conoc:conoc er+Verb:ε +PresInd:^PresInd +1P+Sg:o

Page 12: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Xerox Finite-State Tool - example cont…

Composed replace rule

[ c -> {zc} || _ ^PresInd ]

Triggered by the ^PresInd tag Makes required changes, remove trigger

Lexical conocer+Verb+PresInd+1P+Sg

Surface conoc^PresIndo

Page 13: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Verb Lexicon

Coded in lexc Model has 3 regular paths 66 varieties of irregularity

e.g. poder ‘to be able to’

LEXICON Irreg430:^UE^VSoue^PRET1^FRErV ;[o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?* [%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]

Page 14: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Noun LexiconLEXICON NounFem ! Feminine Nouns!STEM !CONT. CLASS ! GLOSSacción fIsNounEs ; ! action

LEXICON fIsNounEs ! feminine pluralised with 'es'+Noun:0 fNounPluralES ;

LEXICON fNounPluralES+Sg+Fem:0 # ;+Pl+Fem:^NZ^NOes # ;

[z -> c || _ %^NZ]

[ó -> o || _ ?^<5 %^NO ]

Page 15: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Adjective Lexicon

Same process as noun lexicon Uses the same replace rules One exception for adverbs

LEXICON nIsAdjS+Adj:0 nAdjPluralS ;+Adj|+Adv:^AAOmente # ;

[o -> a || _ %^NAO %^AAO {mente}]

Page 16: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Other Transducers Overgeneration Filter

llover ‘to rain’

Capitalisation

Trigger Remover

Execution script

~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ]

[ a (->) A || .#. _ ]

[ %^IE -> 0 ]

Page 17: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006
Page 18: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Evaluation

Page 19: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Testing

Accuracy Maintaining integrity of existing rules

Projection Subtraction

Well-formedness Ensuring tag order

Page 20: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Assessing Coverage Aim – 80% on unrestricted text Statistical predictions (Crystal 1997)

Corpus compilation and processing Europarl, 3 corpora

(http://people.csail.mit.edu/koehn/publications/europarl/ )

Phase 1 – augmentation Phase 2 – 81% coverage Final assessment – 84.15% coverage

Page 21: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Further Details

Class # of forms

Nouns 547

Verbs 304

Adjectives 183

Other 378

• Generates approx. 44,000 unique morphological descriptions

• Evaluation corpus – 1.26 analyses per input token on average

Page 22: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Possible improvements Increase coverage

lexicon augmentation

Disambiguation using POS tagger

More derivational morphology

Deal with different dialects of Spanish

Page 23: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

References (Beesley & Karttunen 2003) Beesley, K. and Karttunen, L.,

Finite State Morphology, CSLI Publications, United States, 2003. 

(Claret 2005) Los Verbos Castellanos Conjugados, Sexta Edición, Editorial Claret, Barcelona, 2005

(Crystal 1997) Crystal, D., The Cambridge Encyclopedia of Language. (2nd. ed.) Cambridge University Press, 1997

Europarl - Europarl Parallel Corpus http://people.csail.mit.edu/koehn/publications/europarl/ - Last Accessed 19/05/2006

(Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990.

(Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J. Collection Bescherelle - Les verbes espagnols. Hatier, 1997.

Real Academia Española – http://www.rae.es/ - Last Accessed 25/05/2006

Page 24: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

Conclusions

Demonstration

Page 25: John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar Presentation, 7 th June 2006

LEXICON ArVerbs!STEM !CONT. CLASS !GLOSSabord ArV ; !to approach

LEXICON ArVar+Verb:0 ArConj ;

LEXICON ArConj!TAGS !CONT.CLASS+PresInd:^PresInd ArPresInd ;+PretInd:^PretInd ArPretInd ;

LEXICON ArPresInd ! Present Indicative+1P+Sg:o^1PSg #;+2P+Sg:as^2PSg #;+3P+Sg:a^3PSg #;