28/07/2005speech and nlp experiences with indian language morphology monojit choudhury rs, cse, iit...

28/07/2005 Speech and NLP

Experiences with Indian Language Morphology

Monojit ChoudhuryRS, CSE, IIT Kharagpur


When do we need MA/MS?

Store all words

Advantages: Less effort for NLP Less time for processing

Disadvantages: More words more space more search time How to tackle unseen words


Therefore, we need MA/MS when

The language is morphologically rich large number of affixes concatenation of affixes/compounding Example: Turkish, German, Sanskrit …

The language is morphologically productive Speakers/writers can coin new words by following

morphological rules Example: German, Sanskrit …


A Problem to ponder

How do we decide whether a language is morphologically rich and/or productive?

Linguistically Difficult (enumerate all morphological processes) Fuzzy/Subjective

Can you suggest some formal technique?

Hint: Statistics


Vocabulary GrowthBENGALI

(3019565,182848)

HINDI

(2967438, 121603)

200,000

CORPUS SIZE ( N ) 3,500,000

VOCAB

SIZE ( V(N) )


Another Estimate

How many different forms of a verb are there in English Hindi Bengali Telugu Sanskrit


Another Estimate

How many different forms of a verb are there in English – 5 Hindi – ~20 (without causation) Bengali – ~170 (without causation) Telugu – ~1000 Sanskrit – ~51480 (with derivational affixes)

~3960 (otherwise)


Three basic concerns

While designing a morphological analyzer/generator one must consider

Productivity of a ruleMorphological paradigmsIrregular morphology


Productivity of a Rule

Rule Example Productivity

VR + tA jAtA, letA

NR + ikadainika,

sAmAjika

Adj + imA lAlimA, niilimA


Productivity of a Rule

Rule Example Productivity

VR + tA jAtA, letA *****

NR + ikadainika,

sAmAjika**

Adj + imA lAlimA, niilimA X


Productive Rules for Bengali/Hindi

Inflectional Morphology Verb Noun Adjectives Pronouns

Derivational Morphology Compounding Prefixation Suffixation

Emphasizing in Bengali i and o


Morphological paradigms

Classes of words that inflect similarlyHindi Noun roots take 4 inflections

Singular, direct laDakA, laDakii Plural, direct laDake, laDakiyA.N Singular, oblique laDake, laDakii Plural, oblique laDako, laDakiyo.N

How many paradigms for nouns?


How to identify the paradigms?

Paradigms may be based on Syllable structure (e.g laDakii, nadii, sakhii) Gender (e.g. dhobii vs. nadii) Semantics (e.g. lohA vs. dohA)

Which of these distinctions can be identified automatically? How?


Paradigms for Bengali Nouns

Bengali noun inflections: Classifier Suffixes TA, gulo, rA etc. Case Markers er, ke, der, te etc. Emphasizers i, o

Paradigms are based on semantics Inanimate objects take TA, gulo Animate objects take rA, dera


Irregular Morphology

All languages feature irregular morphology English: ox – oxen, go – went Hindi: jAnA – gayA, karanA – kiyA Bengali: yAoYA – gela, AsA – ela

Better to list them as exceptions and treat separately

Bengali has only 4 exceptional verbs, Hindi has 2


So, we decided to

Build MS/MA for Hindi & Bengali Cover only inflectional morphology Cover only verbs, nouns and adjectives

We also identified the morphological paradigms Irregular verbs/nouns


Now we need to decide

The list of possible affixes There attributes Morphotactics

And then design/build The Input/output specification The lexicon structure The FST structure Lexicon and FST search strategy


A Case Study: Bengali Verb Morphology

The information coded by affixes:Finite forms

Tense: Past, present, future Aspect: simple, continuous, perfect, habitual Modality: Order, request Person: 1st, 2nd normal (tumi), 2nd familiar (tui),

3rd (se), Honorific 2nd and 3rd (Apani, tini) Polarity: positive/negative

Non-finite forms: e, te


Morphotactics

Root Aspect Tense Person +/- Gloss

kar

(to do)

eChi

(perfect)

l

(past)

Ama

(1st)

Φ

(+)I had done

karCh

(cont.)

Φ

(present)

i

(1st)

Φ

(+)I’m doing

karΦ

(simple)

b

(future)

i

(2nd fam)

Φ

(+)You’ll do

karΦ

(perfect)

Φ

(pre/pst)

i

(1st)

ni

(-)

I’ven’t done

I’d not done


Morphotactics

Root + aspect + tense +

person + emphasizer + polarity

Root + modality + person + emphasizer

Root + aspect1 + emphasizer +

aspect2 + person + polarity


Verb Suffix Table

TAM/ Person 1st 2nd, familiar 2nd, normal 2nd & 3rd formal 3rd

Ind, Pr, Simple i isa’ ena’ e

Ind, Pr, Cont chhi chhisa’ chha chhena’ chhe

Ind, Pr, Perfect echhi echhisa’ echha echhena’ echhe

Ind, Pa, Simple lAma’ li le lena’ la

Ind, Pa, Cont. chhilAma’ chhili chhile chhilena’ chhila

Ind, Pa, Perfect echhilAma’ echhili echhile echhilena’ echhila’

Ind, Future ba bi be bena’ be

Habitual Past tAma’ tisa’ te tena’ ta

Imperative - .h/ una’ uka’

Neg, Perfect ini isa’ni ani ena’ni eni


Orthographic Changes

kar + eChilAm kareChilAmkhA + eChilAm kheYeChilAmhAr + eChilAm hereChilAmkarA + eChilAm kariYeChilAmtolA + eChilAm tuliYeChilAmkhAoYA + eChilAm khAiYeChilAmde + eChilAm diYeChilAm


Orthographic Classes (Paradigms?)

$ V

a’ A oYA

aha [haoYA] (to happen)

kara’ [karA](to do)

karA [karAno](do, causative)

saoYA [saoYAno](undergo, causative)

AkhA [khAoYA]

(to eat)jAna’ [jAnA]

(to know)jAnA [jAnAno]

(to inform)khAoYA [khAoYAno]

(to feed)

idi [deoYA](to give)

likha’ [lekhA](to write)

ni~NrA [ni~NrAno] --

e --dekha’ [dekhA]

(to see)dekhA [dekhAno]

(to show)deoYA [deoYAno](give, causative)

oso [so;oYA](to lie down)

tola’ [tolA](to pick)

tolA [tolAno](pick, causative)

so;oYA [so;oYAno](lie, causative)

u/au -- --ghumA [ghumAno]

(to sleep)--


FSM for Recognizing Bengali Verb Class


A Morphological Generator: Abstract Level

Root

TAM Person

Polarity

Emph

Morphological Generator

Suffix Table

Orthographic FST

Suffix

Surface Form


A Morphological Generator: Implementation

Root

TAM Person Polarity

Emph

Morphological Generator

Surface Form

Root Class Recognizer

Orthographic Rules

for each Root class

SuffixTable

EmphAdder

Irregular Root Handler


Implementation: More Facts

Memory Requirement Root Class Recognizer: FSM with 26 states Suffix Table: 56 suffixes (emphasizers not incl.) Orthographic Rule Tables: 19×56 = 1064 rules

Time Requirement Root Class Recognizer: scans the root once (r) Suffix Selection: just table look up (constant) Orthographic Rules: scans root + suffix once (r+s) Emphasizer Adder: Constant time Total time: O(r+s)


Now we need to decide

The list of possible affixes There attributes Morphotactics

And then design/build The Input/output specification The lexicon structure The FST structure Lexicon and FST search strategy


A Morphological Analyzer:Abstract Level

Trie: A data structure also called a suffix tree. (from Information Retrieval)

Basic Notions: Note that Bengali verb morphology only has suffixes Scan a given word from right to left (backward) If the substring seen is a valid suffix, see if the remaining part

of the input is a valid stem/root Take care of orthographic changes

We shall see that trie is just another way to implement FST with some nice properties


Trie: Construction

Make a list of all valid suffixes NULL, i, Chi, li, eChi, YeChi, lAma, elAma

Construct the trie recursively by inserting each of the suffixes (right to left)

Every state where a suffix ends is marked as a final state

Every final state consists of TAM, Person, Polarity information Rewrite rules for generation of the root


Trie: Search

Reverse the input word Traverse the trie starting from the root (start

state) At every final state apply the orthographic rule to

the rest of the string Let r be the string obtained. Search for r in the

root lexicon If found, output the attributes Continue the search


Trie: Computational Issues

Time Complexity Searching the trie is linear on input length Searching the lexicon can also be linear

Space Complexity In general linear in number of affixes Can be reduced further by constructing DAWG


Trie vs DAWG

Trie More space Linear Search Easy to construct Easy to insert &

delete Final states have

unique attributes

DAWG Less space Linear search Exponential construction Difficult to delete and

insert A final state can have

ambiguous attributes


Morphological Analyzer:Implementation Details

Size of Trie: 300 states Size of root lexicon: 600 verb rootParadigm Information: Not requiredNoun, verb and adjectives are separately

analyzed Tries can be merged but no significant gain Root lexicons are also distinct

Rule compilation


Summarizing

Decide whether to go for MA/MS Identify the productive morphological processes

and corresponding irregularities Identify the paradigms and morphological

attributes Specify the morphotactics, affix list Gather a Machine readable root lexicon Choose appropriate computational technique Design, implement and test A good interface for rule-editing is desirable

28/07/2005speech and nlp experiences with indian language morphology monojit choudhury rs, cse, iit...

Documents

nlp paradigms

nlp productivity

nlp experiences

o slide

niilima slide

niilimax slide

statistics slide

nlp productive rules