28/07/2005speech and nlp experiences with indian language morphology monojit choudhury rs, cse, iit...

38
28/07/2005 Speech and NLP Experiences with Indian Language Morphology Monojit Choudhury RS, CSE, IIT Kharagpur

Upload: katrina-fleming

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

28/07/2005 Speech and NLP

Experiences with Indian Language Morphology

Monojit ChoudhuryRS, CSE, IIT Kharagpur

28/07/2005 Speech and NLP

When do we need MA/MS?

Store all words

Advantages: Less effort for NLP Less time for processing

Disadvantages: More words more space more search time How to tackle unseen words

28/07/2005 Speech and NLP

Therefore, we need MA/MS when

The language is morphologically rich large number of affixes concatenation of affixes/compounding Example: Turkish, German, Sanskrit …

The language is morphologically productive Speakers/writers can coin new words by following

morphological rules Example: German, Sanskrit …

28/07/2005 Speech and NLP

A Problem to ponder

How do we decide whether a language is morphologically rich and/or productive?

Linguistically Difficult (enumerate all morphological processes) Fuzzy/Subjective

Can you suggest some formal technique?

Hint: Statistics

28/07/2005 Speech and NLP

Vocabulary GrowthBENGALI

(3019565,182848)

HINDI

(2967438, 121603)

200,000

CORPUS SIZE ( N ) 3,500,000

VOCAB

SIZE ( V(N) )

28/07/2005 Speech and NLP

Another Estimate

How many different forms of a verb are there in English Hindi Bengali Telugu Sanskrit

28/07/2005 Speech and NLP

Another Estimate

How many different forms of a verb are there in English – 5 Hindi – ~20 (without causation) Bengali – ~170 (without causation) Telugu – ~1000 Sanskrit – ~51480 (with derivational affixes)

~3960 (otherwise)

28/07/2005 Speech and NLP

Three basic concerns

While designing a morphological analyzer/generator one must consider

Productivity of a ruleMorphological paradigmsIrregular morphology

28/07/2005 Speech and NLP

Productivity of a Rule

Rule Example Productivity

VR + tA jAtA, letA

NR + ikadainika,

sAmAjika

Adj + imA lAlimA, niilimA

28/07/2005 Speech and NLP

Productivity of a Rule

Rule Example Productivity

VR + tA jAtA, letA *****

NR + ikadainika,

sAmAjika**

Adj + imA lAlimA, niilimA X

28/07/2005 Speech and NLP

Productive Rules for Bengali/Hindi

Inflectional Morphology Verb Noun Adjectives Pronouns

Derivational Morphology Compounding Prefixation Suffixation

Emphasizing in Bengali i and o

28/07/2005 Speech and NLP

Productive Rules for Bengali/Hindi

Inflectional Morphology Verb Noun Adjectives Pronouns

Derivational Morphology Compounding Prefixation Suffixation

Emphasizing in Bengali i and o

28/07/2005 Speech and NLP

Three basic concerns

While designing a morphological analyzer/generator one must consider

Productivity of a ruleMorphological paradigmsIrregular morphology

28/07/2005 Speech and NLP

Morphological paradigms

Classes of words that inflect similarlyHindi Noun roots take 4 inflections

Singular, direct laDakA, laDakii Plural, direct laDake, laDakiyA.N Singular, oblique laDake, laDakii Plural, oblique laDako, laDakiyo.N

How many paradigms for nouns?

28/07/2005 Speech and NLP

How to identify the paradigms?

Paradigms may be based on Syllable structure (e.g laDakii, nadii, sakhii) Gender (e.g. dhobii vs. nadii) Semantics (e.g. lohA vs. dohA)

Which of these distinctions can be identified automatically? How?

28/07/2005 Speech and NLP

Paradigms for Bengali Nouns

Bengali noun inflections: Classifier Suffixes TA, gulo, rA etc. Case Markers er, ke, der, te etc. Emphasizers i, o

Paradigms are based on semantics Inanimate objects take TA, gulo Animate objects take rA, dera

28/07/2005 Speech and NLP

Three basic concerns

While designing a morphological analyzer/generator one must consider

Productivity of a ruleMorphological paradigmsIrregular morphology

28/07/2005 Speech and NLP

Irregular Morphology

All languages feature irregular morphology English: ox – oxen, go – went Hindi: jAnA – gayA, karanA – kiyA Bengali: yAoYA – gela, AsA – ela

Better to list them as exceptions and treat separately

Bengali has only 4 exceptional verbs, Hindi has 2

28/07/2005 Speech and NLP

So, we decided to

Build MS/MA for Hindi & Bengali Cover only inflectional morphology Cover only verbs, nouns and adjectives

We also identified the morphological paradigms Irregular verbs/nouns

28/07/2005 Speech and NLP

Now we need to decide

The list of possible affixes There attributes Morphotactics

And then design/build The Input/output specification The lexicon structure The FST structure Lexicon and FST search strategy

28/07/2005 Speech and NLP

A Case Study: Bengali Verb Morphology

The information coded by affixes:Finite forms

Tense: Past, present, future Aspect: simple, continuous, perfect, habitual Modality: Order, request Person: 1st, 2nd normal (tumi), 2nd familiar (tui),

3rd (se), Honorific 2nd and 3rd (Apani, tini) Polarity: positive/negative

Non-finite forms: e, te

28/07/2005 Speech and NLP

Morphotactics

Root Aspect Tense Person +/- Gloss

kar

(to do)

eChi

(perfect)

l

(past)

Ama

(1st)

Φ

(+)I had done

karCh

(cont.)

Φ

(present)

i

(1st)

Φ

(+)I’m doing

karΦ

(simple)

b

(future)

i

(2nd fam)

Φ

(+)You’ll do

karΦ

(perfect)

Φ

(pre/pst)

i

(1st)

ni

(-)

I’ven’t done

I’d not done

28/07/2005 Speech and NLP

Morphotactics

Root + aspect + tense +

person + emphasizer + polarity

Root + modality + person + emphasizer

Root + aspect1 + emphasizer +

aspect2 + person + polarity

28/07/2005 Speech and NLP

Verb Suffix Table

TAM/ Person 1st 2nd, familiar 2nd, normal 2nd & 3rd formal 3rd

Ind, Pr, Simple i isa’ ena’ e

Ind, Pr, Cont chhi chhisa’ chha chhena’ chhe

Ind, Pr, Perfect echhi echhisa’ echha echhena’ echhe

Ind, Pa, Simple lAma’ li le lena’ la

Ind, Pa, Cont. chhilAma’ chhili chhile chhilena’ chhila

Ind, Pa, Perfect echhilAma’ echhili echhile echhilena’ echhila’

Ind, Future ba bi be bena’ be

Habitual Past tAma’ tisa’ te tena’ ta

Imperative - .h/ una’ uka’

Neg, Perfect ini isa’ni ani ena’ni eni

28/07/2005 Speech and NLP

Orthographic Changes

kar + eChilAm kareChilAmkhA + eChilAm kheYeChilAmhAr + eChilAm hereChilAmkarA + eChilAm kariYeChilAmtolA + eChilAm tuliYeChilAmkhAoYA + eChilAm khAiYeChilAmde + eChilAm diYeChilAm

28/07/2005 Speech and NLP

Orthographic Classes (Paradigms?)

$ V

a’ A oYA

aha [haoYA] (to happen)

kara’ [karA](to do)

karA [karAno](do, causative)

saoYA [saoYAno](undergo, causative)

AkhA [khAoYA]

(to eat)jAna’ [jAnA]

(to know)jAnA [jAnAno]

(to inform)khAoYA [khAoYAno]

(to feed)

idi [deoYA](to give)

likha’ [lekhA](to write)

ni~NrA [ni~NrAno] --

e --dekha’ [dekhA]

(to see)dekhA [dekhAno]

(to show)deoYA [deoYAno](give, causative)

oso [so;oYA](to lie down)

tola’ [tolA](to pick)

tolA [tolAno](pick, causative)

so;oYA [so;oYAno](lie, causative)

u/au -- --ghumA [ghumAno]

(to sleep)--

28/07/2005 Speech and NLP

FSM for Recognizing Bengali Verb Class

28/07/2005 Speech and NLP

A Morphological Generator: Abstract Level

Root

TAM Person

Polarity

Emph

Morphological Generator

Suffix Table

Orthographic FST

Suffix

Surface Form

28/07/2005 Speech and NLP

A Morphological Generator: Implementation

Root

TAM Person Polarity

Emph

Morphological Generator

Surface Form

Root Class Recognizer

Orthographic Rules

for each Root class

SuffixTable

EmphAdder

Irregular Root Handler

28/07/2005 Speech and NLP

Implementation: More Facts

Memory Requirement Root Class Recognizer: FSM with 26 states Suffix Table: 56 suffixes (emphasizers not incl.) Orthographic Rule Tables: 19×56 = 1064 rules

Time Requirement Root Class Recognizer: scans the root once (r) Suffix Selection: just table look up (constant) Orthographic Rules: scans root + suffix once (r+s) Emphasizer Adder: Constant time Total time: O(r+s)

28/07/2005 Speech and NLP

Now we need to decide

The list of possible affixes There attributes Morphotactics

And then design/build The Input/output specification The lexicon structure The FST structure Lexicon and FST search strategy

28/07/2005 Speech and NLP

A Morphological Analyzer:Abstract Level

Trie: A data structure also called a suffix tree. (from Information Retrieval)

Basic Notions: Note that Bengali verb morphology only has suffixes Scan a given word from right to left (backward) If the substring seen is a valid suffix, see if the remaining part

of the input is a valid stem/root Take care of orthographic changes

We shall see that trie is just another way to implement FST with some nice properties

28/07/2005 Speech and NLP

Trie: Construction

Make a list of all valid suffixes NULL, i, Chi, li, eChi, YeChi, lAma, elAma

Construct the trie recursively by inserting each of the suffixes (right to left)

Every state where a suffix ends is marked as a final state

Every final state consists of TAM, Person, Polarity information Rewrite rules for generation of the root

28/07/2005 Speech and NLP

Trie: Search

Reverse the input word Traverse the trie starting from the root (start

state) At every final state apply the orthographic rule to

the rest of the string Let r be the string obtained. Search for r in the

root lexicon If found, output the attributes Continue the search

28/07/2005 Speech and NLP

Trie: Computational Issues

Time Complexity Searching the trie is linear on input length Searching the lexicon can also be linear

Space Complexity In general linear in number of affixes Can be reduced further by constructing DAWG

28/07/2005 Speech and NLP

Trie vs DAWG

Trie More space Linear Search Easy to construct Easy to insert &

delete Final states have

unique attributes

DAWG Less space Linear search Exponential construction Difficult to delete and

insert A final state can have

ambiguous attributes

28/07/2005 Speech and NLP

Morphological Analyzer:Implementation Details

Size of Trie: 300 states Size of root lexicon: 600 verb rootParadigm Information: Not requiredNoun, verb and adjectives are separately

analyzed Tries can be merged but no significant gain Root lexicons are also distinct

Rule compilation

28/07/2005 Speech and NLP

Summarizing

Decide whether to go for MA/MS Identify the productive morphological processes

and corresponding irregularities Identify the paradigms and morphological

attributes Specify the morphotactics, affix list Gather a Machine readable root lexicon Choose appropriate computational technique Design, implement and test A good interface for rule-editing is desirable