amamorph: finite state morphological analyzer for amazighe2ceisic, royal institute of amazigh...

20
CIT. Journal of Computing and Information Technology, Vol. 24, No. 1, March 2016, 91–110 doi: 10.20532 /cit.2016.1002478 91 AmAMorph: Finite State Morphological Analyzer for Amazighe Fatima Zahra Nejme 1 , Siham Boulaknadel 2 and Driss Aboutajdine 1 1 LRIT, associated unit to the CNRST-URAC n 29, Mohammed V University, Faculty of Science, Rabat, Morocco 2 CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana- lyzer for Amazighe language using a system based on the NooJ linguistic development environment. The paper begins with the development of Amazighe lexicons with large coverage formalization. The built electronic lexicons, named ‘NAmLex’, ‘VAmLex’ and ‘PAmLex’ which stand for ‘Noun Amazighe Lexi- con’, ‘Verb Amazighe Lexicon’ and ‘Particles Amazighe Lexicon’, link inflectional, morphological, and syntactic- semantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over inflected forms. To our knowledge, AmAMorph is the first morphological analyzer for Amazighe. It identifies the component mor- phemes of the forms using large coverage morphological grammars. Along with the description of how the analyzer is implemented, this paper gives an evaluation of the analyzer. ACM CCS (2012) Classication: Computing method- ologies Artificial intelligence Natural language processing Phonology / morphology Keywords: Amazighe language, Natural Language Pro- cessing, NooJ, Lexical Analysis, Inflectional Morphol- ogy, Derivational Morphology, Unknown words, Recog- nition Systems 1. Introduction The Moroccan Amazighe language is consid- ered as a prominent constituent of the Moroc- can culture due to its richness and originality. However, it has been long discarded and oth- erwise neglected as a source of cultural enrich- ment. Nevertheless, due to the creation of the Royal Institute of Amazighe Culture (IRCAM) (see Appendix J), this language has been intro- duced in the public domain including adminis- tration, media and the educational system. It has enjoyed its proper coding in the Unicode Standard [1][2], an official spelling [3], appro- priate standards for keyboard realization and linguistic structures that are being developed with a phased approach [4][5]. This process was initiated by the standardization, vocabular- ies construction [6][7][8][9], spelling standard- ization [3] and development of grammar rules [4]. However, all these stages are not sufficient for less-resourced languages as Amazighe to join the group of well-resourced ones in terms of language technologies. In this context, many scientific studies have been undertaken to im- prove the current situation. These studies can be divided into two categories: (1) Computational resources: this can be sub- divided into three subdomains: Tifinagh promotion works [10][11], Optical character recognition [12][13][14], Amazighe corpora [15][16]. (2) NLP tools: overall, NLP tools for Amazighe have been limited and carried out on light stem- mer [17], tagging assistance tool [18], Search engine [19], Concordancer [20] and verb Con- jugator [21]. Nevertheless, for the morphologi- cal analysis, which presents a core part of NLP applications, there are only a few publications that explicitly discuss the attempts at develop- ing full-fledged morphological analyzers for the Amazighe language [22][23][24][25][26][27]. To this end, this paper presents the continua- tion of our previous efforts, which are restricted

Upload: others

Post on 10-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

CIT. Journal of Computing and Information Technology, Vol. 24, No. 1, March 2016, 91–110doi: 10.20532 /cit.2016.1002478

91

AmAMorph: Finite StateMorphological Analyzer for Amazighe

Fatima Zahra Nejme1, Siham Boulaknadel2 and Driss Aboutajdine1

1LRIT, associated unit to the CNRST-URAC n◦29, Mohammed V University, Faculty of Science, Rabat, Morocco2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco

This paper presents AmAMorph, a morphological ana-lyzer for Amazighe language using a system based onthe NooJ linguistic development environment.

The paper begins with the development of Amazighelexicons with large coverage formalization. The builtelectronic lexicons, named ‘NAmLex’, ‘VAmLex’ and‘PAmLex’ which stand for ‘Noun Amazighe Lexi-con’, ‘Verb Amazighe Lexicon’ and ‘Particles AmazigheLexicon’, link inflectional, morphological, and syntactic-semantic information to the list of lemmas. Automatedinflectional and derivational routines are applied to eachlemma producing over inflected forms.

To our knowledge, AmAMorph is the first morphologicalanalyzer for Amazighe. It identifies the component mor-phemes of the forms using large coverage morphologicalgrammars. Along with the description of how theanalyzer is implemented, this paper gives an evaluationof the analyzer.

ACM CCS (2012) Classification: Computing method-ologies → Artificial intelligence → Natural languageprocessing → Phonology / morphology

Keywords: Amazighe language, Natural Language Pro-cessing, NooJ, Lexical Analysis, Inflectional Morphol-ogy, Derivational Morphology, Unknown words, Recog-nition Systems

1. Introduction

The Moroccan Amazighe language is consid-ered as a prominent constituent of the Moroc-can culture due to its richness and originality.However, it has been long discarded and oth-erwise neglected as a source of cultural enrich-ment. Nevertheless, due to the creation of theRoyal Institute of Amazighe Culture (IRCAM)(see Appendix J), this language has been intro-duced in the public domain including adminis-tration, media and the educational system. It

has enjoyed its proper coding in the UnicodeStandard [1][2], an official spelling [3], appro-priate standards for keyboard realization andlinguistic structures that are being developedwith a phased approach [4][5]. This processwas initiated by the standardization, vocabular-ies construction [6][7][8][9], spelling standard-ization [3] and development of grammar rules[4].

However, all these stages are not sufficient forless-resourced languages as Amazighe to jointhe group of well-resourced ones in terms oflanguage technologies. In this context, manyscientific studies have been undertaken to im-prove the current situation. These studies canbe divided into two categories:

(1) Computational resources: this can be sub-divided into three subdomains:

• Tifinagh promotion works [10][11],

• Optical character recognition [12][13][14],

• Amazighe corpora [15][16].

(2)NLP tools: overall, NLP tools for Amazighehave been limited and carried out on light stem-mer [17], tagging assistance tool [18], Searchengine [19], Concordancer [20] and verb Con-jugator [21]. Nevertheless, for the morphologi-cal analysis, which presents a core part of NLPapplications, there are only a few publicationsthat explicitly discuss the attempts at develop-ing full-fledgedmorphological analyzers for theAmazighe language [22][23][24][25][26][27].

To this end, this paper presents the continua-tion of our previous efforts, which are restricted

Page 2: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

92 F. Z. Nejme et al.

to the noun processing and describes a step to-ward the development of full-fledged morpho-logical analyzer tool, using finite state technol-ogy within the linguistic development environ-mentNooJ, whichwill be used as input to higherlevels of linguistic analysis such as syntacticparsing.

The remainder of this paper is divided into fivemain sections: the first presents a brief overviewof the Moroccan Amazighe language particu-larities. The second section exposes and dis-cusses some of Amazighe NLP challenges. Thethird presents our morphological analyzer sys-tem. The fourth section gives an overview ofevaluation results, and the last section tries todraw some conclusions and suggests some fu-ture directions for our approach.

2. Moroccan Amazighe Language

2.1. Historical Background

The Amazighe language, also called Berberor Tamazight ( [tamazi t]), be-longs to the Hamito-Semitic languages [28][29].It is currently spoken in a dozen countries rang-ing from Morocco, with 50% of the overall pop-ulation [30], to Egypt, passing through Algeriawith 25%, Tunisia, Mauritania, Libya, Nigerand the Mali [31].

In Morocco, we distinguish between three ma-jor Amazighe dialects: Tarifit in the North ofMorocco, Tamazight in the center and South-East, and Tashelhit in the South-West and theHigh Atlas.

Today, the situation of the Amazighe languageis at a pivotal point. It holds official status. Itsmorphology as lexical standardization processis still underway. At present, it represents themodel taught in schools and used by the mediaand an official papers published in Morocco.

2.2. Moroccan Standard AmazigheLanguage Particularities

In order to deal with the linguistic problemof Amazighe language due to the fact of us-ing the different dialects (Tamazight, Tachelhitand Tarifit), IRCAM has engaged to achieve astandardization process for Amazighe language

[32], which is not based on any of the dialectsand aims to unify the three regional dialects intoa standard national Amazighe language in orderto introduce it in the school system, as well as inthe media. This standardization process affectsdifferent levels:

− Phonology: adopt a standard script;

− Vocabulary: adopt a common lexicon;

− Morphology, syntax and spelling: apply thesame spelling rules, the same morphologicaland syntactical instructions, and the sameneologisms forms.

Thereafter, we give a brief overview of the stan-dard Amazighe characteristics which include:the graphical writing system (see Appendix F),Tifinaghe encoding and morphology properties.

2.2.1. Writing System

Like any language that passes through oral towritten mode, the Amazighe language has beenin need of a graphic system.

In Morocco, the choice ultimately fell on Ti-finaghe for technical, historical and symbolicreasons. Since the Royal declaration on Febru-ary, 11th, 2003, Tifinaghe has become the of-ficial graphic system for writing Amazighe.Thus, IRCAM has developed an alphabet sys-tem called Tifinaghe-IRCAM. This alphabet isbased on a graphic system towards phonologi-cal tendency. This system does not retain all thephonetic realizations produced, but only thosethat are functional [5]. It is written from left toright and contains 33 graphemes which corre-spond to:

• 27 consonants including: the labials ( ),dentals ( ), the alveolars( ), the palatals ( ), the velar( ), the labiovelars ( ), the uvulars( ), the pharyngeals ( ) and the la-ryngeal ( );

• 2 semi-consonants: and ;

• 4 vowels: three full vowels and neu-tral vowel (or schwa) which has a ratherspecial status in Amazighe phonology.

Page 3: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 93

2.2.2. Tifinaghe Encoding

Since the adoption of Tifinaghe as an officialscript in Morocco for the Amazighe language,the Tifinaghe encoding has become a necessity.To this end, considerable efforts have been in-vested by IRCAM.

Therefore, the Tifinaghe encoding is composedof four Tifinaghe character subsets: the basicset of IRCAM used to arrange the orthogra-phy of different Moroccan Amazighe dialects,the extended IRCAM set used for historical andscientific use, The Neo-Tifinaghe letters, andmodern Touareg letters. The first two subsetsconstitute the sets of characters chosen by IR-CAM [1].

2.2.3. Amazighe Morphology in Brief

Amazighe is a morphologically rich language.It is highly inflected and also shows derivationto a high degree. The main morpho-syntacticcategories in Amazighe language are: nouns,verbs pronouns and function words which in-clude adverbs, prepositions, etc. [4][5]. Thefollowing section aims to provide a brief sys-tematic description of Amazighe morphology.

1) Noun Characteristics

In Amazighe language, noun is a lexical unit,composed of a root and a pattern (see AppendixH) [4]. It can appear in several forms: sim-ple form ( [argaz] “the man”), compoundform ( [bu tmzrayt] “chronicler”),or derived one ( [amsawad. ] “the com-munication”).

1a) Simple Noun

The Amazighe simple noun consists of a sin-gle word occurring between two blank spaces.There are two major subclasses of Amazighesimple nouns: proper nouns and common ones.This latter constitute the most frequent type.

• Proper noun: the name of a person[hnnu], place [azru] or date. Such nounsare not inflected.

• Common noun: can be either abstract orconcrete. The latter can be either animate( [tamt.t.ud. t.] “woman”) or inani-mate ( [tigmmi] “house”).

• Adjective: the adjective is a syntactic under-class of the noun; it shares with the nom-inal all these combinational and functionalcharacteristics: the gender, the number andthe state markers, and the complete sentencepredicate. As a specific function, they deter-mine the nominal with which they agree ingender and number: [as.bh. an] “beau-tiful”.

1b) Derived Noun

In Amazighe language, we can distinguish be-tween two types of derivation processes: (1)nominal derivation based on noun and (2) nom-inal derivation based on verb.

• Nominal derivation based on noun: in thiscase, the derived nouns are obtained by theprefixation of the morphemes [ams],[ans], or [am] to the nominal basis:[awal] “word”→ [amawal] “lexicon”.

• Nominal derivation based on verb: froma verbal root, deverbal nouns are built bythe prefixation or suffixation of a derivationmorpheme associated with some changes.Thus, four types of nouns are made: the ac-tion noun, agent noun, instrument noun andadjective.

Action Noun. The action noun is formedwith prefixation associated with some changes.The main processes of this derivation are: (1)for simple verbs we have: (i) prefixation ofthe vowel [a], [u] or [i] ( [h. rrz]“keep” → [ah. rrz] “the fact to supervisehis wife (husband) or supervising his wife bysomeone”) or (ii) prefixation and affixation ofone of the feminine morpheme: ,

, orassociated with an initial or final vowel

variation and a consonant gemination/degimina-tion for some nouns ( [gnu] “sew” →[tigni] “couture”) and (2) for borrowed ones,we have: (i) prefixation of [l]: [h.mu] “behot”→ [lh.mu] “the heat” or (ii) affixationof vowel [a] sometimes associated with somechanges ( [dhn] “paint” → [adhan]).

Agent Noun. This kind of noun is derived froman action verb by the prefixation of one of thefollowing morphemes for simple verbs: [a],[am]/ [an], [im] or [i] associated with aninfixation of the [a] ( [azl] “send” →[amazal] “messenger”). For borrowed ones, thederivation process consists of an affixation of

Page 4: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

94 F. Z. Nejme et al.

the [a] sometimes associated with a gemina-tion of radical consonant and a vowel infixation.

Instrument Noun. This type of noun is ex-tremely scarce, since it is formed from a sim-ple verb by the prefixation of the morphemes

[a]/ [as] associated with vowel or conso-nant changes: [rgl] “close”→ [asrgl]“lid”.

Adjective. The adjective is generally derivedfromquality verbs/stative verbs. The derivationprocesses of this type of nouns are: (1) prefix-ation of [a] associated with a vowel alterna-tion, (2) prefixation of [am]/ [an] followedsometimes by intra or post-radical changes, (3)prefixation of [i] with intraradical change or(4) prefixation of [u] sometimes associatedwith infixation of the vowel [i]: [qmr]“be narrow” → [uqmir] “close”.

1c) Compound Noun

Although it is less productive than derivation,the field of composition is not absent inAmazig-he language. A compound noun is a noun that ismade of two or more words. The most frequentcomposition models in Amazighe are presentedin the following table (cf. Table 1) [33].

Whether simple, compound or derived, the nounvaries in gender (masculine or feminine), num-ber (singular or plural), and state (free or con-structed) [4].

Gender. The Amazighe noun is characterizedby one grammatical gender: masculine or fem-inine.

• The masculine noun: begins with one of theinitial vowels: [a], [i] or [u]. However,there are some exceptions as: [imma]“(my) mother”.

• The feminine noun: is marked with the cir-cumfix . However, there are someexceptions such as nouns which have onlythe initial or the final [t] of feminine mor-pheme: [tadla] “the sheaf”,[r.r.muyt] “the tiredness”.

Number. The noun, masculine or feminine, hassingular and plural. The latter has four forms.

• The external plural: it is formed by an al-ternation of the first / [a/i] associatedwith a suffixation of [n] or one of its vari-ants ( [in], [an], [ayn], [wn],[awn], [wan], [win], [tn], [yin]):

[axxam] “house” → [ixxamn]“houses”.

• The broken plural: involves a change of theinternal vowels: [aba us] “monkey”→ [ibu as] “monkeys”.

• The mixed plural: is formed by vowels’change associated sometimes with the suf-fixation of [n]: [izikr] “the rope” →

[izakarn] “the ropes”.

• The plural in [id]: this kind of pluralis obtained by the noun prefixation with[id]. It is applied to a set of nouns includ-

Table 1. Description of most frequent composition models in Amazighe language.

Composition Models Examples

Noun + Noun (juxtaposition) [baba rbbi] “god”

Noun + [n] + Noun (genitive) [agru n lbur] “toad”

Noun + Adjective (juxtaposition) [aman d. rnin] “dew”

Verb + Noun (juxtaposition) [slm aggwrn] “butterfly”

Verb + Verb (juxtaposition) [bbi zdi] “bagatelle”

Verb + Adverb (juxtaposition) [jud mlih. ] “sycophancy”

Preposition + Noun (juxtaposition) [ddu wakal]“the grave”

Moneme ( [ayt], [bu], [bn], [u],[gar], [war], [bla], [tar],

[m], [ult], [bnt], [ist]) +Noun (affixal)

[bu th. anut] “shopkeeper”

Page 5: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 95

ing: nouns with an initial consonant, propernouns, kinship nouns, compound nouns, nu-merals, as well as borrowed ones:[bu tmzrayt] “chronicler”→[id bu tmzrayt] “chroniclers”.

State. We distinguish between two states: thefree state and the constructed one.

• The free state is unmarked. The noun is infree state if it is a single word isolated fromany syntactic context, a direct object, or acomplement of the predictive particle [d].

• The constructed state is marked in the fol-lowing contexts: (1) when the noun followsthe verb in a sentence and it is the subjectof the verb, (2) when the noun follows apreposition, and (3) after some numerals. Itinvolves a variation of the initial vowel. Incase of masculine nouns, it takes one of thefollowing forms: initial vowel alternation[a]/ [u]; adding of the consonant [w] or

[y]. For the feminine nouns, it consists ofdropping or maintaining the initial vowel.

2) Verb

The verb occurs in two forms: simple and de-rived one. The simple verb is formed throughthe amalgamation of a root and a pattern, calledinterdigitation ( [sgl] “backfill”, resultingfrom the intersection of the form – [gl] “col-lapse” – and the pattern [sC]); while the de-rived one is obtained by the combination of sim-ple verb with one of the following derivationalmorphemes: [s/ss] indicating the factitiveform, [tt] marking the passive form, and[m/mm] designating the reciprocal one. Theprefixation of these morphemes can lead to thedeletion of the initial vowel/consonant accom-

panied with some change of the simple form ofthe verb.

The verb, whether simple or derived, inflectsin three moods (indicative, imperative and par-ticipial), where in each mood the same personalmarkers are used (cf. Table 2). It has fouraspects (aorist, perfective, negative perfectiveand imperfective) that are marked with vocalicalternations, prefixation or consonant gemina-tion/degimination. The indicative and the par-ticipial moods occur on the four aspects, whilethe imperative mood has two forms, simple andintensive, that are based on the aorist and theimperfective aspects [4] respectively.

According to a study done by IRCAM [34],verbs are arranged into thirty one conjugationclasses, according to the aorist/perfective andthe aorist/imperfective conjugation oppositions.

The last class includes irregular verbs. It shouldbe noted that, based on these classification crite-ria, the Amazighe verb and its derived forms donot necessarily belong to the same class, sincethey may not use the same morphotactic rulesto be conjugated. Furthermore, to deal withregional varieties, a verb can belong to manyclasses.

3) Pronouns

A pronoun refers to any element that could re-place a noun or nominal group. It may rep-resent a nominal group already employed, ordesignate a person participating in the commu-nication. The paradigm of pronouns includes:personal pronouns ( [nkk] “me”); the pos-sessives ( [winu] “mine”); the demonstra-tives ( [wad] “that”); the interrogatives ([ma] “which”) and the indefinite ones ( [kra]“something”).

Table 2. Personal markers for the indicative, imperative and participial moods.

Indicative mood Imperative mood(simple and intensive) Participial mood

Masc. Fem. Masc. Fem. Masc./ Fem.

Singular 1st pers.

2nd pers. 2nd pers. 2nd pers.

3rd pers.

Plural 1st pers.

2nd pers. 2nd pers. 2nd pers.

3rd pers.

Page 6: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

96 F. Z. Nejme et al.

Generally, some of these pronouns are inflected,such as the possessive and demonstrative ones:

[wad] “this one” → [wid] “these ones”.

4) Function words

Function words are a set of Amazighe wordsthat is assignable neither to a noun nor to averb. It consists of several elements, namely:

4a) Prepositions

A preposition is a closed paradigm that com-bines simple and complex forms that expressvarious semantic values. For the first form, itconsists of several formats: [n], [i], [s],[g], [di], [zg], [xh], [gr], [al]/[ar], [bla], [ r], [dar], [agd],[d]. For the second one, it is composed of twoor three prepositions which may be used adver-bially: [izdar n] “below”.

4b) Adverbs

Adverbs are the elements that change the mean-ing of a verb. Generally, they are classified ac-cording to their semantics. Thus, we distinguishadverbs of place ( [da] “here”), time ([azkka] “tomorrow”), quality ( [drus] “lit-tle”) and manner ( [mlih. ] “well”).

Besides these elements, there also exist: as-pectual particles ( [rad] “that”), orientationparticles ( [nn] “there”), negative particles ([ur] “not”) and conjunctions ( [minzi] “be-cause”).

3. Amazighe NLP Challenges

Despite its position as second official languagein Morocco, Amazighe language still lacks instudies from the computational point of view.This situation is due to its complex and chal-lenging pre-processing tasks. Thus, we can de-scribe three main difficulties which need to betaken into account.

3.1. Amazighe Script

The Amazighe writing system poses three maindifficulties:

• Dialectal variations: in linguistic terms, theAmazighe language is characterized by the

proliferation of dialects (Tamazight, Tashel-hit and Tarifit) due to historical, geographi-cal and sociolinguistic factors. Each of thesedialects has its own specificity with a set ofsub-dialects. Dialects are still present in aset of resources despite of the standardiza-tion process.

• Scripts: Amazighe is among the languageshaving different scripts, namely: Latin, Ara-bic andTifinaghe. Thus, it requires a translit-erator to convert all the scripts into the stan-dard “Tifinaghe-Unicode” form. However,this process is confronted with spelling vari-ation related to regional varieties [ta-fuct] “Sun” – [tafukt] “Sun”.

• Spelling: the Amazighe language has re-mained essentially an oral language for along time. Therefore, the Amazighe textdoes not respect the standard writing con-vention.

3.2. Amazighe Corpora

Even with the attention paid to the Amazighelanguage in the recent years, there are still somegaps in the corpora which present a very valu-able resource for NLP tasks. The Amazighelacks such resources, and most of those existingare in paper format which needs to be digitized.To this end, a set of studies has been undertakento build Amazighe corpora in a progressive wayuntil reaching a large-scale corpus [16].

3.3. Amazighe Morphology

The third and the last reason, as cited in this pa-per, for Amazighe processing complexity is itsrich and complex morphology. Bellow we de-scribe three main difficulties in Amazighe mor-phology.

Part of speech ambiguity. One of the Amazi-ghe NLP challenges is ambiguity; the same sur-face form might have different annotations de-pending on how it has been used in the sentence.To give some examples of different categories,we present the following examples:

• Some stop words such as [d] might func-tion as a preposition, a coordination con-junction, a predicate particle or an orienta-tion particle. For instance, in the sentences

Page 7: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 97

below, the word “d” might be: a coordina-tion conjunction:

[tamazi t d tiknulujiyin timaynutin]“Amazighe and new technologies” → =and; a preposition: [iman dubrid] “he went with the road” → = with;a predication particle: [d argaz] “heis a man” → = he is; or an orientation par-ticle: [asi d tikinttamjahdit] “bring to here large bowl”;

• [illi] may have many meanings; as averb in negative perfective, it means “do notexist” when used after a negative particle,while as a noun, it refers to a kinship nounmeaning “my daughter”.

Inflectional and derivational prosess. Asmentioned earlier in this paper (cf. 2.2.3), theAmazighe language is highly inflected and alsoshows derivation to a high degree. The firstprocess presents more difficulties especially fornoun inflection. As the example for the plural,there are four different forms based primarily onboth prefix and suffix concatenations, but thereare no specific rules which determine the use ofone or the other form to inflect a noun. Fur-thermore, the base form itself can be modifiedin different paradigms such as the derivationalone, where in case of the presence of geminatedletter in the base form, the latter will be alteredin the derivational form ( [qqim] “make sit”→ [s im] “sit”).

Contractions. The third reason for processingcomplexity is contractions. For example, thekinship term [baba] “father” followed bya pronoun [nns] “his”, becomes the singleword [babas] “his father” [4][3].

4. Amazighe Morphological Analyzer:Development and Evaluation

Morphological analysis is one of important tasksin NLP which presents a core part of severalNLP applications, providing information aboutthe possible part of speech and other morpho-syntactic features of words and tokens as theyappear in the context. It is a basic step to varioushigher levels applications including text mining,information retrieval,machine translation, auto-matic summarization, and the Amazighe learn-ing systems.

Over the last few years, AmNLP has gained in-creasing importance, and several state of the artsystems have been developed for a wide rangeof applications. These applications had to dealwith several complex problems pertinent to thenature and structure of the Amazighe language.The lack of available resources and their lim-itations have motivated many scholars to fol-low the rule based approach and rely on hand-constructed linguistic rules in developing theirtools, systems, and resources. Furthermore, ba-sic tools and applications such as morphologicalanalyzer system also needs to be developed.

As Amazighe is a morphologically rich lan-guage due to its high inflectional and deriva-tional nature, morphological processing playsa key role in developing Amazighe NLP sys-tems and applications. The basic principle ofmorphological analysis is to breakdown an in-flected form into a root and a set of features(lexical category and morpho-syntactic proper-ties). Therefore, the effort spent on creating areliable, efficient, and flexible Amazighe mor-phological analyzer is justified by its reuse inmany of these applications.

In this paper, we will present our approach tobuild a highly flexible Amazighe morphologi-cal analyzer. Our morphological analyzer willidentify the component morphemes of the in-put word (nouns, verbs, pronouns or functionwords) on the texts, and also label them withsufficient information to be useful for the otherNLP tasks.

4.1. Computational Approaches toMorphology: Brief State-of-the-Art

During the last 25 years the Finite-State Tech-nology (FST) has had a great impact on a varietyof NLP applications, as well as in industrial andacademic Language Engineering.

Finite State Morphology (FSM) aims at han-dling morphology within the computational po-wer of finite state automata. This approach isespecially attractive in dealing with human lan-guage morphologies; among these are the abil-ity to handle concatenative and non-concatenati-ve morphotactics, the high speed and efficiencyin handling large automata of lexiconswith theirderivations, and inflections that can run intomil-lions of paths, modularity of the design, due

Page 8: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

98 F. Z. Nejme et al.

to the closure properties of regular languagesand relations; the compact representation that isachieved through minimization; and reversibil-ity, resulting from the declarative nature of suchdevices. The adequacy of this technology forSemitic languages has frequently been chal-lenged.

Although there is now a variety of Finite statetools, the development or adaptation of exist-ing tools to facilitate the creation, annotation,and description of Amazighe corpora remains amajor challenge. For morphological analysis,several tools and frameworks have been used.Well known tools include:

• Two-Level Morphology: it depends heav-ily on finite state methods, which are wellknown and are often described as elegant[35]. Two-level morphology was “the firstgeneral model in the history of computa-tional linguistics for the analysis and genera-tion of morphologically complex languages”[36]. Developed by Koskenniemi [37], thistechnology facilitates the specification ofrules that relate pairs of surface strings thro-ugh systematic rules. Such rules, however,do not specify how one string is to be derivedfrom another; rather, they specify mutualconstraints on those strings. Furthermore,rules do not apply sequentially. Instead, aset of rules, each of which constrains a par-ticular string-pair correspondence, is appliedin parallel, such that all the constraints musthold simultaneously. In practice, one of thestrings in a pair would be a surface realiza-tion, while the other would be an underlyingform. One of the greatest advantages of two-level morphology is that rules are entirelydeclarative: indeed, the original formulationof [37] allows for both analysis and gener-ation within the same grammar. The for-malism was later implemented as part of theXerox tools; two-level rules are compiled tofinite-state transducers, which indeed allowfor both analysis and generation.

• XFST: stands for Xerox Finite-State Tool,one of the most sophisticated tools for con-structing finite state language processing ap-plications [38], developed at the XRCE byKenneth R. Beesley and Lauri Karttunen[35]. It is “based on solid and innova-tive finite-state technology”, designed formulti-purpose use with explicit support for

automata theoretical research. The XFSTtoolkit provides powerful and elegant lin-guistic descriptions and treatment of irreg-ularities via different operators, at a highlevel of abstraction, such as restriction, re-placement, and left-to-right longest matchreplacement.

• SFST: Stuttgart Finite State Transducer is anon-commercial, open-source, freely-availa-ble set of finite state tools that support theanalysis, generation, and processing of finite-state automata and transducers. SFST toolswere primarily developed to provide finitestate technology for building two-level mor-phological analyzers. The tools run underLinux are command-line oriented, and notso easy to use, especially since they couldbe better documented.

Although these finite state technologies presenta number of advantages, they present also sev-eral disadvantages. The most common one be-ing that they provide a single formalism (power-ful grammar) supposed to be used to describe alllinguistic phenomena. Unlike NooJ, which is alinguistic development environment, it provideslinguists with one or more formal tools specif-ically designed to facilitate the description ofeach linguistic phenomenon, as well as parsingtools designed to be as computationally efficientas possible. Furthermore, it allows linguists tocombine in one unified framework Finite-Statedescriptions such as in XFST (see [39] and [40]for more details). This fact makes NooJ an idealtool to parse complex phenomena that involvethose across all levels of linguistic phenomena,and allows NooJ’s parsers to be extremely effi-cient compared with other NLP parsers.

4.2. NooJ: Linguistic DevelopmentalFramework

NooJ, released in 2002 by Max Silberztein [41][42][43], is a freeware language-engineering de-velopment environment,which runs on differentoperating systems such as Windows, Linux, So-laris and Mac OSX, and provides a set of toolsandmethodologies for formalizing and develop-ing a set of NLP applications. It presents a pack-age of finite state tools that integrates a broadspectrum of computational technology from fi-nite state automata to augmented/recursive tran-sition networks. This package allows building

Page 9: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 99

and managing a large coverage of electronicdictionaries and formal grammars in order toformalize the different linguistic phenomenasuch as: spelling, morphology (inflectional andderivational), vocabulary (simple words, com-pound words and frozen expressions), syntax(local, structural and transformational), disam-biguation, semantics and ontology, and whichcan be applied to treat texts and large corpora.For each of these levels, NooJ provides linguistswith one or more formal frameworks specifi-cally designed to facilitate the description ofeach phenomenon, as well as parsing, devel-opment and debugging tools designed to beas computationally efficient as possible, fromFinite-State to Turing machines. This approachdistinguishes NooJ from other computationallinguistic frameworks providing a unique for-malism that is supposed to cover all linguisticphenomena. As of today, NooJ can process 23languages [44] as Croatian language [45], Ar-menien [46], including Arabic [47] as a Semiticlanguage like Amazighe.

Given the situation of Amazighe as less re-sourced language, we recognized the neces-sity of developing an Amazighe component forNooJ platform, according to its robustness andsimplicity, with the aim to enable its automaticprocessing and its integration in the field ofInformation and Communication Technology.One of the important and useful features ofNooJ, regarding Amazighe as morphologicallyrich language, is its simple description of mor-phological phenomena and efficient morpho-logical processing. This component would al-low us to process and take advantage of thisreadily available data. The use of this technol-ogy was extremely attractive and allows gener-ating and analyzing several thousands of wordsper second.

The NooJ lexical module that will be usedthroughout this paper relies on operators per-forming transformations inside strings, andmor-phological graphs describing grammatical rulesfor morphological analysis. Generally, trans-formations inside strings are based on the useof some generic predefined commands such as:

• 〈LW〉 : position the cursor (I) at the begin-ning of the form,

• 〈RW〉 : position the cursor (I) at the end ofthe form,

• 〈R〉 : keyboard Right arrow,

• 〈L〉 : keyboard Left arrow,

• 〈B〉 : delete the last character,

• 〈S〉 : delete the current character,

• Etc. (see Appendix B).

Our main agenda is to build the morphologi-cal analyzer as the combination of several finitestate transducers using the NooJ framework. Inorder to do this, and given that the linguisticresources required by the morphological ana-lyzer include a lexicon and inflection rules forall paradigms, we started by building a morpho-logical Amazighe lexicon.

4.3. Building Amazighe Lexicons:NAmLex, VAmLex and PAmLex

The most basic and yet most needed step inmorphological analysis of any language is thedevelopment of morphological lexicon. To thisend we have created, in the first step and us-ing the NooJ robust dictionary module, threeAmazighe lexicons named:

• NAmLex, which stands for Noun AmazigheLexicon and contains a list of simple, com-pound and derived nouns which are repre-sented as a singular form,

• VAmLex, which stands for Verb AmazigheLexicon and contains a list of simple, andderived verbs which are represented as a sec-ond person, singular, masculine, imperativemode, and

• PAmLex, which stands for Particles Ama-zighe Lexicon and contains a list of pronounsand function words.

In order to do this, manual collection of theterms is carried out following the three stepsdescribed below:

1) We developed a list of nouns (simple, de-rived and compound ones) from a set of re-sources such as Taifi dictionary [48], Amazi-ghe vocabulary [6], Amazighe media vocab-ulary [7], Amazighe grammatical vocabu-lary [8], new grammar of Amazighe [4] andschool lexicon [49].

2) We collected a list of verbs from AmazigheManual Conjugation [34].

Page 10: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

100 F. Z. Nejme et al.

3) We built a lexicon of pronouns and functionwords from Amazighe grammar [4].

Each lexical entry presents the following de-tails: the lemmas, lexical category, semanticfeature and its translation in French and Arabiclanguages.

Eventually, to complete the morphological con-cept of our lexicon, in the second step, we for-malized the inflectional and derivational mor-phology in order to generate from each entry itsinflectional and derivational information.

4.4. Amazighe Lexicon Formalization:Inflectional and DerivationalMorphology

Both inflection and derivation are standard pro-cesses for the Amazighe and they generate largeamounts of word forms. Thus, the purposeof this section is the Amazighe categories (i.e.noun, verb, pronouns and function words) for-malization. This study presents the implemen-tation of inflectional and derivational rules thatwill allow the generation of inflected formsfrom each entry.

4.4.1. Inflectional Morphology

To formalize the inflectional rules, we have re-lied on the works of [4][50][34][33][51], as wellas on heuristic study of our lexicons. There-fore, through graphs integrated in the linguisticdevelopment platform NooJ, we have createda set of hand-encoded graphs: 224 inflectionalparadigms for nouns (descriptions of noun in-flectional paradigms are explained in detail in[22]), 484 inflectional paradigms for verbs and8 of them for pronouns [23].

By these inflectional descriptionswe refer to theset of all possible transformations which allowus to obtain, from a lexical entry, all inflectedforms. On average, there are 8 inflected formsper noun entry, 116 inflected forms per verbentry, and 2 to 8 ones per inflected pronouns.

These rules are invoked by the property “+ FLX=” that each inflective word has within its lex-icon description.

4.4.2. Derivational Morphology

The morphological derivation of new words isconsidered to be a higher degree of Amazighemorphological process, in the level below the in-flection. Therefore, we have implemented a hi-erarchical system of morphological paradigmsas a tool able to capture different levels of theAmazighe derivational morphology. Our hier-archical system includes three derivation levels:(1) nominal derivation based on noun, (2) nom-inal derivation based on verb, and (3) verbalderivations.

1) Nominal derivation based on noun

According to the rules described before (cf.2.2.3.1.b) and on heuristic study of our noun lex-icon, we have formalized noun derivation withabout 21 rules, generally based on prefixationand suffixation of the nouns.

2) Nominal derivation based on verb

Deverbal nouns are formed by prefixation orsuffixation of a derivation morpheme accompa-nied by intraradical changes. Thus, four typesof nouns are made: action noun, agent noun,instrument noun and noun of quality. Eachof these branches (nominal pattern) is formedbased on verbal patterns [24].

2a) Action nouns

Based on the work of [52] we have formalizedthe derivation of action nouns based on (1) thepattern and on (2) the verbal root(monoliteral(see Appendix A), biliteral, etc. with or withoutinitial tension). Wehave raised, in total, 66 rulesof which 5 rules are for the monoliteral, 22 forbiliteral, 23 for triliteral, and 9 for quadriliteralroots. In the following (cf. Table 3) we showan example of derivation from monoliteral root(see Appendix C).

Table 3. Example of the action nouns derivation.

Verbal Patterns Exampleclass Verbal

patternNominalpattern

Monoliteral Cu i Cu

[ttu]“forget” →

[ittu]“oversight”.

Page 11: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 101

Amazighe, as well as other languages, has bor-rowed words. These words are strongly bor-rowed from Arabic dialect (Moroccan Arabic:Darija). To this end, we have formalized thederivation of action nouns based on this kindof verbs. Given that the triliteral roots are themost productive for borrowed verbs, we haveformalized, relied on the works of [52], 7 rulesfor this morphological type. We show in the fol-lowing (cf. Table 4) an example of derivationfrom triliteral root.

Table 4. Example of the action nouns derivation basedon borrowed verbs.

Verbal Patterns Exampleclass Verbal

patternNominalpattern

Triliteral CaCC aCaCC

[h. asb]“count” →

[ah. asb]“counting”.

2b) Agent nouns

Similarly to action nouns and based on the workof [53], we formalized the derivation of agentnouns based on (1) the pattern and on (2) theverbal roots: biliteral and triliteral. We haveraised, in total, 9 rules of which 4 rules are forthe biliteral and 5 for triliteral roots. In the fol-lowing (cf. Table 5) we show an example ofderivation from biliteral root.

Table 5. Example of the agent nouns derivation.

Verbal Patterns Exampleclass Verbal

patternNominalpattern

Biliteral CC amCaC

[ny]“mount” →

[amnay]“rider”.

2c) Instrument nouns

For this nominal type, there are no regular pat-terns. Based on a list of verbs, we have formal-ized a set of cases for biliteral and triliteral roots.Thus, we have raised, in total, 8 rules of which3 rules are for the biliteral and 5 for triliteralroots. In the following (cf. Table 6) we showan example of derivation from triliteral root.

2d) Nouns of quality

Table 6. Example of the instrument nouns derivation.

Verbal Patterns Exampleclass Verbal

patternNominalpattern

Triliteral CCC asCCC

[rgl]“close” →

[asrgl]“lid”.

To formalize the nouns of quality, and basedon the work of [53], we have raised 14 rules ofwhich 4 rules are for the biliteral, 9 for trilit-eral and 1 rule for quadriliteral roots. In thefollowing (cf. Table 7) we show an example ofderivation from biliteral root.

Table 7. Example of the nouns of quality derivation.

Verbal Patterns Exampleclass Verbal

patternNominalpattern

Biliteral CC amCCu

[rz. ]“break” →

[amrz.u]“breaker”.

Then, like the simple form, each derived nounis associatedwith the corresponding inflectionalparadigm (cf. Figure 1) in order to recognize allthe corresponding inflected terms. The lattermay be the same as the basic noun and may alsobe different.

Figure 1. Processing of derived noun.

3) Verbal derivations

According to the rules described in (cf. 2.2.3.2),we have formalized verb derivation within 50derivational paradigms for the totality of theverbs ( [kks] “remove” → [myuk-kas] “remove mutually”).

Page 12: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

102 F. Z. Nejme et al.

All the descriptions inside the derivational gram-mars are given in the form of graphs. Theserules are invoked by the property “+ DRV =”that each derivative word has within its lexicondescription.

To give an overview of all these morphologicaltransformations (inflectional and derivationalones), we take as an example the verb [mun]“to accompany”.

ⵎⵓⵏ,V+Simple+C2+FLX=T2_16+DRV=Forme1_s: T8_10+DRV=NAC_Pf1:PES:Ta_cT1 +FR=accompagner+AR= شاور+استشار +صاحب

This example illustrates two processes:

The inflectional process. Inflectional gram-mar is looking for the paradigm named T2 16(+ FLX = T2 16 (see Appendix G) in orderto generate all forms. Among the 116 inflec-tional transformations which are described inthe flexional paradigm "T2 16", here is one:

ⵖ<LW>ⵜⵜ/ Inacc+1+m+s

This NooJ paradigm, written in NooJ graphiceditors, illustrates the imperfective form. Thefirst part of this paradigm describes a changeon the word (e.g. 〈LW〉 / – position the cursor(I) at the beginning of the form), while the sec-ond part describes features that the newly madeword is given (e.g. /Inacc + 1 + m + s – wordis added description that it is in imperfectiveform (Inacc), first person (1), masculine (m)and singular (s)). The meaning of the transfor-mation is to insert the consonant [ ] at the endof the form and [tt] into the head of it. Theseoperations, applied in succession, generate theform: [tmun ] “I accompanied”.

The derivation process. Derivational grammaris looking for the paradigm Forme1 s (+ DRV= Forme1 s). In this paradigm we have thefollowing derivational transformations:

Like the inflectional transformation, the deriva-tional one is written in NooJ graphic editors

and illustrates the factitive derivation. The firstpart of this transformation describes a changeon the word, while the second part describesfeatures that the newly made word is given (e.g./V + Derive + Factitive + Pref – word isadded description that it is in derived form (+Derive), in factitive type (+ Factitive) formedby the prefixation of the morpheme [s] (+Pref )). The meaning of this transformationis to insert the consonant [s] at the begin-ning of the form. This operation generates theform: [smun] “gather”. Given that thisderived form has a different inflection in theimperfective mood (imperfective = aorist), wehave associated the correspondent paradigm in(:T8 10).

The deverbal noun derivation and its flex-ion. Moreover, this verb presents also nom-inal derivation. The derivational grammar islooking for the paradigm NAC Pf1 (+ DRV= NAC Pf1: PES:Ta cT1). The + DRV fea-ture first invokes the derivation pattern namedNAC Pf1 after which the newly derived word isinflected as the paradigm named PES:Ta cT1.For the first paradigm “NAC Pf1” we have thefollowing derivational transformations:

NAC_Pf1= ⵜ<LW>ⵜⴰ/N+Dérivé+Nom_Action

This NooJ transformation consists in the inser-tion of the consonant [t] at the end of the form,and [ta] at the beginning. After the deriva-tion, the newly derived noun ( [tamunt]“union”) is marked as derived (+ Derive) noun(N)with the type: action noun (+Nom Action).The newword ( ) is then inflected using theinflectional pattern name that follows the nameof the derivational one (:PES:Ta cT1). Amongthe 8 inflectional transformations described inthis inflectional paradigm, we can cite the fol-lowing model which allows to obtain the pluralform in constructed state:

<B>ⵉⵏ<LW><R2><B> /EA+f+p

The meaning of this transformation is to delete:(1) the last [t] and (2) the vowel after the firstone, and then insert the variant [in] at theend of the form. These operations, applied insuccession, generate the form: [tmunin].

Page 13: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 103

Since Amazighe language is highly inflectional,the number of word forms in the compiled lexi-cons is much larger than the number of lemmasthat are in the main lexicons. We present belowan overview of the final lexicons:

Table 8. Lexical entries.

Grammatical Formscategories Simple

formsInflectional and

derivational forms

Nouns 15 325 105258

Verbs 4109 250214

Pronouns +Function words 1642 1878

Total 21076 357350

4.5. Evaluation of Amazighe Lexiconsand Lexical Rules

The test of the lexical coverage of our Amazighelexicons is evaluated on lexical analysis of ourmanually-constructed corpus. This corpus is acollection of school texts and contains 96 031tokens. Automatic evaluation of our Amazighetransducer is a challenge due to the lack of stan-dard annotated corpora.

The main objective from this analysis is to eval-uate how many words in our texts can be recog-nized and annotated after applying our compiledlexicons. It consists of testing membership ofeach word of the text to our lexicons and thenlabeling them with the corresponding informa-tion. To better illustrate our analysis, we showin the following figure (cf. Figure 2) an exampleanalysis of the noun [timunin] “unions,associations”:

Figure 2. Schematic illustration of derived nounanalysis (see Appendix D and I).

• is the plural form of the action noun[tamunt] derived from the verb

[mun] “accompany” (cf. Figure 3) (For fur-ther guidance see the example cited in theprevious section of the verb [mun]),

• is the plural form of the femininenoun [tamunt] “union, association”,

• is the plural form of the femininenoun [tamunt] derived from the mas-culine one [amun] “community” (cf.Figure 3).

Figure 3. A template describing a feminine nounderived from a masculine one and from a verb.

Lexical analysis of our corpora shows the resultspresented below:

Table 9. Result of our lexical analysis.

ResultsNumber of

recognized formsNumber of

unrecognized forms

Lexicons Number % Number %

88 897 92.57% 7134 7.42%

Lexical analysis of these corpora shows cover-age of about 93% by our lexical and morpho-logical resources.

However, despite the efforts on improving acomplete lexicon, unknownwords typically rep-resent 7.42% of the words in our corpus that arenot contained in our lexicons. Therefore, wepropose a new perspective to handle the out-of-vocabularywords that are considered important,using rules and patterns. This approach presentsan alternative to investing manual effort in im-proving the lexicon.

Page 14: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

104 F. Z. Nejme et al.

4.6. The Unknown Words Processing:Recognition Systems (RS)

Unknownwords are one of the key factorswhichdrastically impact the analysis quality. In orderto increase the robustness of our analyzer, wehave undertaken to propose a new method forhandling such unavoidable lack of information.To this end, there have been two possibilities:the first way is to attempt to construct a com-plete lexicon, and the second way is to attemptto analyze the word by discovering its part ofspeech and feature information, and storing thatinformation in the lexicon.

Given that the second approach is less expen-sive, in this paper we describe a morphologicalanalysis method based on morphological andsyntactic grammars. This method uses a modelthat cannot only consult the lexicons, but alsoestimates how likely it is that a word can bea noun or verb according to the information athand.

The following sections will cover the various

concepts used by our recognition systems (RS)to help in the processing of unknown words.First, the architecture system will be discussed.Second, the morphological RS are detailed.Finally, an evaluation of our approach is de-scribed.

4.6.1. System Architecture

Our morphological analysis involves 4 steps(see Figure 4).

For convenience, we split the field of our mor-phology into three different areas: morpho-logical generation, morphological reconstruc-tion and morphological recognition. Morpho-logical generation involves creating lexical en-tries based on the inflectional rules (cf. 4.5.1).Morphological reconstruction produces the newword through derivational rules (cf. 4.5.2). The-se two morphological areas are applied directly,after the application of the NooJ segmentationsystem to the corpus (step 1), then we identify

Figure 4. Amazighe Morphological Analyzer Architecture.

Page 15: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 105

the component morphemes of the input words,matching them with the lexicons (step 2) andlabeling (step 3) the known words with suffi-cient information to be useful for the tasks athand. And finally, if the input word does notbelong to our lexicons (step 4), we apply themorphological recognition systems which useknowledge about affixes and other features ofthe word in order to provide the possible an-notation, without using any direct informationabout the words stem. The following sectiondescribes the methods employed by our recog-nition systems.

4.6.2. Morphological Recognition:Recognition Systems (RS)

No matter how big the applied lexicon is, therewill always remain unknown words. Moreover,natural language is dynamic and it is impos-sible to compile huge dictionaries which willcontain all the words that could appear in real-life texts: new words are constantly added to alanguage; other words get less frequent or aredropped out, while some of the existing ones getlost or change, etc. To remedy to this situation,we introduce a system analysis without lexiconin order to increase the robustness of our ana-lyzer. This system will empirically investigatehow well syntactic and morphological rules cananalyze sentences containing unknown words.

Given that the unknown word can be a noun or averb, we proposed a two-level mechanism. Thefirst level is developed for the nouns (in simple,derived and compound forms) and the secondone is developed for the verbs (in simple andderived forms). For these two levels, we arebased on syntactic knowledge.

1) Noun Recognition System

Our goal was the design and implementationof a system for identification and distribution ofpossible tags that can serve as the analysis of theunknown Amazighe words from texts. To thisend, we have developed a set of five morpholog-ical and syntactic grammars presented below:

• Syntactic grammar for simple nouns recog-nition: according to a set of rules that wehave developed, this grammar is used to ex-tract the morphological information for theunrecognized nouns. These rules are based,in the first step, on the morphemes whichpresent the gender, number and state marks,and, in the second step, it is based on theprevious word (e.g. a word directly follow-ing the predicative particle [d] is typicallya noun in free state form).

• Syntactic grammar for derived nouns recog-nition: following the same method as sim-ple nouns, we have developed a set of rulesbased on the previousword and on the deriva-tion patterns (suffixes, prefixes and infixes).

• Syntactic grammar for compound nouns re-cognition: we have developed, in the firststep, two syntactic grammars allowing rec-ognizing a compound noun in the most fre-quent composition models (cf. 2.2.3.1.c).Many compound nouns, which are formedon the moneme ( , , etc.) + Noun(cf. 2.2.3.1.c), present a challengewhen theycome attached (e.g. [buhyyuf] “thefamine”). Thus, as a second step, we havedeveloped a morphological grammar to han-dle these words. To illustrate our propositionwe show an example in Figure 5 (cf. Figure5). In our graphical presentations, bracketsrefer to variables, dollar sign ($) is used for

Figure 5. Example of compound noun recognition.

Page 16: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

106 F. Z. Nejme et al.

the variable names ($Var1), etc. These andthe remaining symbols are explained inmoredetail in [42].

This grammar decomposes the noun in mor-pheme [bu] +word (〈L〉 ). Given that anyword following these morphemes is in con-structed state, we combine this latter (〈L〉 )with constructed state prefix and then checkif the word belongs to the lexicons (requiredby the lexical constraint 〈 #$Var2=:N〉 , theword must be a noun). If such a word ex-ists, it imports all relevant morphologicalinformation from the lexicon (required bythe lexical constraint 〈 $1L,N$1S$1F〉 , theN correspond to Noun). Thus, if the lex-icon contains the word [uzggwa ]“red”, the grammar will annotate the word

[buzggwa ] “measles” as follows(cf. Figure 6).

Figure 6. Example of compound noun recognition.

The noun will be annotated as a noun (N),in compound form (Classe=Compose) inthe Affixal model (TypeComposition = Af-fixale).

Then, an accurate annotation will be ex-tracted: [bu] will be annotated as mor-pheme in singular masculine form (MN +m + s), and the information of the noun cor-responding to the form [uzggwa ]will be extracted from the lexicon.

• Morphological grammar for kinship nouns(see Appendix E): this grammar recognizeskinship nouns which form one entity withaffixal pronouns (e.g. [babas] “his fa-ther”). To fully illustrate our proposition,we present below (cf. Figure 7) an exampleof kinship nouns recognition system.

This morphological grammar recognizes theword forms like [babak] “your fa-ther”, [immatn ] “our mother”, etc.Each of these forms will be computed fromthe root formed by letter succession (〈L〉 )stored in the variable Pref ($Pref) whichneeds to be listed as a kinship noun in ourlexicon (required by the lexical constraint〈 $Pref=:N + Parente〉 ). If such a word ex-ists, we identify the second variable whichcorrespond to affixal pronouns ( [k], [m],

[s] etc.) and then produce the annotations.We use the special tag 〈 INFO〉 , associatedto the affixal pronouns to be concatenated atthe end of the annotation (e.g. , N +Parente + ).

2) Verb Recognition System

Following the same method of the recognitionsystems for nouns, we have developed two syn-tactic grammars presented below:

• Syntactic grammar for simple verbs recog-nition: we proposed a set of rules used toextract the morphological information asso-ciated to the unrecognized verb (the conjuga-tion aspects, the gender (masculine or femi-nine), the number (singular, plural), and theperson (first, second or third)).

Figure 7. Example of recognition system for kinship nouns.

Page 17: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 107

• Syntactic grammar for derived verbs recog-nition: we have developed a set of rulesbased on the derivation patterns (cf. 4.5.2)allowing testing on an input word whether itis a derived verb or not.

4.7. Experiments

The main goal of this evaluation is to prove theflexibility of our approach and to prove that itcan satisfy the morphological analysis of themost unknown words remaining after the lexi-cal analysis. The experiment reported here isvery simple: it consists of parsing the techni-cal corpus before and after integration of therecognition systems in the preprocessing com-ponents, and comparing the results obtained. Todo this, we use the corpus of school texts cited atthe lexicons evaluation (cf. 4.6). The results offull analysis can be seen in the following table(cf. Table 10).

Table 10. Amazighe morphological analyzer results.

Results

Number ofrecognized

forms

Number ofunrecognized

forms

Number % Number %

Lexicons 88 897 92.57% 7134 7.42%

Lexicons + RS 95 753 99.71% 278 0.29%

After the application of our resources, we haveundertaken a manual analysis of the outputs toevaluate the performance of our rules. The eval-uation results shown in Table (10) indicate thestrong impact of using RS component in wordanalysis. The unrecognized forms present a ra-tio of 0.29%.

The unanalyzed words are mostly classified inthree sets:

• Named entities: foreign proper nouns ofperson such as [muh.and] “Muhand”,cities such as [misr] “Egypt” and num-bers such as [kr.ad. ] “four”.

• Spelling mistakes: The most common mis-take is replacing the prefix of the constructedstate such as the form [wad. il] “grapes”where the noun is written [wwad. il]according to its pronunciation.

• Foreign words (nouns, verbs, pronouns andfunction words): Amazighe has imported

foreign words from Arabic ( [bzzaf]“a lot”).

5. Conclusion

Morphological analysis caters to the needs ofvariety of applications like machine transla-tion, information retrieval and spell-checking.Therefore, in this paper we have proposed amodule of Amazighe morphological analyzerwhich can recognize lexical units from texts andalso label them with sufficient information to beuseful for the other NLP tasks. The finite-stateapproach has proven to be very successful inthe consistent description of the Amazighe mor-phology, consisting of a network of lexicons andRS.

In the near future we plan to pursue a set ofpaths, such as: (1) implementing a module ofnamed entity recognition using the NooJ syn-tactic module, (2) developing morphologicalgrammars to correct the spelling mistakes withdisambiguation problems, and (3) enlarge ourcorpus with the number of inputs of our lexi-cons.

References

[1] P. Andries, “Unicode 5.0 en pratique”, Codage descaracteres et internationalisation des logiciels etdes documents. Dunod, France, Collection InfoPro,2008.

[2] L. Zenkouar, “Normes des technologies de l’in-formation pour l’ancrage de l’ecriture Amazighe”,Etudes et documents berberes, vol. 27, pp. 159–172,2008.

[3] M. Ameur et al., Graphie et orthographe del’Amazighe. Rabat, Maroco: IRCAM, 2006.

[4] F. Boukhris et al., La nouvelle grammaire del’Amazighe. Rabat, Maroc: IRCAM, 2008.

[5] M. Ameur et al., Initiation a la langue Amazighe.Rabat, Maroco: IRCAM. 2004.

[6] M. Ameur et al., “Vocabulaire de la langueAmazighe (Francais-Amazighe)”, Lexiques, no. 1,IRCAM, Rabat, Maroco, 2006.

[7] M. Ameur et al., “Vocabulaire desmedias (Francais-Amazighe-Anglais-Arabe)”, Lexiques no. 3, IR-CAM, Rabat, Maroco, 2009.

[8] M. Ameur et al., “Vocabulaire grammatical”, Lex-iques, no. 5, IRCAM, Rabat, Maroco, 2009.

Page 18: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

108 F. Z. Nejme et al.

[9] S. Kamel, Lexique Amazighe de geologie. Rabat,Maroco: IRCAM, 2006.

[10] IRCAM. (2003). Conception et mise au pointdes polices tifinaghe. Centre des Etudes Informa-tiques, Systemes d’Information et Communication,plan d’action. [Online]. Available: http://www.ircam.ma/fr/index.php?soc=telec&rd=1

[11] IRCAM. (2004). Polices et Claviers UNICODE.Centre des Etudes Informatiques, Systemes d’In-formation et Communication. [Online]. Available:http://www.ircam.ma/fr/index.php?soc=telec&rd=3

[12] M. Amrouch et al., “Handwritten Amazighe Char-acter Recognition Based On Hidden Markov Mod-els”, International Journal on Graphics, Vision andImage Processing, vol. 10, no. 5, pp. 11–18, 2010.

[13] M. Fakir et al., “Skeletonization methods evaluationfor the recognition of printed tifinaghe characters”,In Proceedings of the 1er Symposium Interna-tional sur le Traitement Automatique de la CultureAmazighe, Agadir, Morocco, 2009, pp. 33–47.

[14] Y. Es Saady et al., “Printed Amazighe CharacterRecognition by a Syntactic Approach Using FiniteAutomata”, International Journal on Graphics, Vi-sion and Image Processing, vol. 10 no. 2, pp. 1–8,2010.

[15] M. Outahajala et al., “Using Confidence And In-formativeness Criteria To Improve POS TaggingIn Amazigh”, Journal of Intelligence and FuzzySystems, vol. 28, no. 3, pp. 1319–1330, 2015.http://dx.doi.org/10.3233/IFS-141417

[16] S. Boulaknadel and F. Ataa Allah, “Building astandard Amazighe corpus”, In Proceedings ofthe International Conference on Intelligent HumanComputer Interaction, Prague, Tchec, 2011.

[17] F. Ataa Allah and S. Boulaknadel, “Pseudoracini-sation de la langue Amazighe”, In Proceeding ofTraitement Automatique des Langues Naturelles,Montreal, Canada, 2010.

[18] F. Ataa Allah and H. Jaa, “Etiquetage morphosyn-taxique: Outil d’assistance dedie a la langueA-mazighe”, In Proceedings of the 1er Symposiuminternational sur le traitement automatique de laculture Amazighe, Agadir, Moroco, pp. 110–119,2009.

[19] F. Ataa Allah and S. Boulaknadel, “AmazigheSearch Engine: Tifinaghe Character Based Ap-proach”, In Proceeding of the International Confer-ence on Information and Knowledge Engineering,Las Vegas, Nevada, USA, 2010, pp. 255–259.

[20] S. Boulaknadel and F. Ataa Allah, “OnlineAmazighe Concordancer”, In Proceedings of theInternational Symposiumon ImageVideoCommuni-cations and Mobile Networks, Rabat, Maroco, 2010.http://dx.doi.org/10.1109/isvc.2010.5656272

[21] F. Ataa Allah and S. Boulaknadel, “Amazigh VerbConjugator”, In Proceedings of the 9th edition ofthe Language Resources and Evaluation Confer-ence, Reykjavik, Iceland, 2014.

[22] F. Nejme et al., “Analyse Automatique de laMorphologie Nominale Amazighe”, Actes de laconference du Traitement Automatique du LangageNaturel (TALN), Les Sables d’Olonne, France, vol.1, Taln, 2013.

[23] F. Nejme et al., “Finite State Morphology forAmazighe Language”, In Proceedings of the Inter-national Conference on Intelligent Text Processingand Computational Linguistics (CICLing), Samos,Greece, 2013. http://dx.doi.org/10.1007/978-3-642-37247-616

[24] F. Nejme et al., “Toward a noun morphologicalanalyser of standard Amazinge”, In Proceedings ofthe InternationalConferenceon Systems and Appli-cations (AICCSA), Fes, Maroco, 2013. http://dx.doi.org/10.1109/AICCSA.2013.6616457

[25] F. Nejme et al., “Toward an amazigh language pro-cessing”, In Proceedings of the 3rd Workshop onSouth and Southeast Asian Natural Language Pro-cessing (SANLP), COLING, December, Mumbai2012, pp. 173–180.

[26] F. Nejme and S. Boulaknadel, “Formalisation del’Amazighe standard avec NooJ”, Actes de laconference JEP-TALN-RECITAL, Grenoble, Fran-ce, 2012.

[27] F. Nejme et al., “Vers un dictionnaire electroniquede l’Amazighe”, Actes de la Conference Interna-tionale sur les Technologies d’Information et deCommunication pour l’AMazighe (TICAM), Rabat,Maroco, 2012.

[28] J. Greenberg, The Languages of Africa. The Hague,1966.

[29] O. Ouakrim, “Fonetica y fonologıa del Bereber”,Survey at the University of Autonoma de Barcelona,1995.

[30] A. Boukous, “Societe, langues et cultures auMaroc:Enjeux symboliques”, Casablanca, Najah El Jadida,1995.

[31] S. Chaker, “Le berbere”, Les langues de France(sous la direction de Bernard Cerquiglini), Paris,PUF, pp. 215–227, 2003.

[32] M. Ameur and A. Boumalk, “Standardisation del’Amazighe. Actes du seminaire organise par leCentre de l’Amenagement Linguistique”, Publica-tion de l’Institut Royal de la Culture Amazighe,Rabat, Maroco, 2004.

[33] H. Moujahid, “La classe du Nom dans un parlerde la langue tamazighete, le tachelhiyt d’Ighrem(Souss-Maroc)”, These de 3eme cycle, Paris V,Universite Rene Descartes, 1981.

[34] R. Laabdelaoui et al., Manuel de conjugaison del’Amazighe, IRCAM, Rabat, Morocco, 2012.

[35] K. Beesley andL. Karttunen, “Finite State Morphol-ogy: CSLI Studies in Computational Linguistics”,Stanford University, CA: CSLI Publications, 2003.

Page 19: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

AmAMorph: Finite State Morphological Analyzer for Amazighe 109

[36] L. Karttunen and K. R. Beesley, “A short historyof two–level morphology”, In: Talk given at theESSLLI workshop on finite state methods in naturallanguage processing, 2001.

[37] K. Koskenniemi, “Two-levelmorphology: a generalcomputationalmodel forword-formrecognition andproduction”, The Department of General Linguis-tics, University of Helsinki, 1983.

[38] L. Karttunen et al., “Regular Expressions for Lan-guage Engineering”, Journal of Natural LanguageEngineering, vol. 2, no. 4, pp. 307–330, 1997.

[39] A. Donabedian et al., “EDS, NooJ ComputationalDevices”, In Formalising Natural Languages withNooJ, Cambridge Scholars, Cambridge, 2013.

[40] M. Silberztein, “Open Source Multiplatform NooJfor NLP: CESAR Project”, In Proceedings of COL-ING 2012. Demonstration Papers, Mumbai, Decem-ber, 2012, pp. 401–408.

[41] M. Silberztein, “NooJ: The Lexical Module In NooJpour le Traitement Automatique des Langues”, S.Koeva, D. Maurel, M. Silberztein EDS, Cahiers dela MSH Ledoux, Presses Universitaires de Franche-Comte. 2005.

[42] M. Silberztein, (2006). NooJ Manual. Available:http://www.nooj4nlp.net

[43] M. Silberztein, “An Alternative Approach to Tag-ging”, NLDB 2007, pp. 1–11, 2007. http://dx.doi.org/10.1007/978-3-540-73351-5 1

[44] M. Silberztein, (2015, March, 17), NOAJ:A Linguistic Development Environment [Online].Available: http://www.nooj-association.org/index.php?option=com nooj&controller=module&task=display module fo&Itemid=529

[45] A. Donabedian and N. Boyacioglu, “La lemmatisa-tion de l’armenien occidental avec Nooj S. Koeva,D. Maurel, M. Silberztein, Formaliser les languesavec l’ordinateur, de INTEX ‘a NooJ” PressesUniversitaires de Franche-Comte, pp. 55–75, 2007.

[46] K. Vuckovic et al., “Croatian Language Resourcesfor NooJ”, CIT. Journal of Computing and Informa-tion Technology, vol. 18, no. 4, pp. 295–30, 2010.http://dx.doi.org/10.2498/cit.1001914

[47] S. Mesfar, “Analyse Morpho-syntaxique Automa-tique et Reconnaissance Des Entites NommeesEn Arabe Standard”, Thesis, Graduate School-Languages, Paris, France, 2008.

[48] M. Taifi, Dictionnaire tamazight-francUƒais: par-lers du Maroc central Paris. L’Harmattan-Awal,1991.

[49] F. Agnaou et al., Lexique Scolaire. IRCAM, Rabat,Maroco, 2011.

[50] L. Oulhaj, Grammaire du Tamazight. ImprimerieNajah El Jadida, 2000.

[51] A. Berkai, “Lexique de la linguistique Francais-Anglais Berbere: precede d’un essai de typologiedes procedes neologiques”, 2007.

[52] A. Lhousni, “Derivation des nominaux enTamazight”, Memoire de Licence, Faculte des Let-tres et des Sciences Humaines, Universite SidiMohamed Ben Abdellah, Maroco, 1990.

[53] M. Azougarh, “Lexique berbere: structures et signi-fication”, These, Faculte des Lettres et des SciencesHumaines, Oujda, 1992.

Appendix

Appendix AMonoliteral root represents one consonant like

[ru] “to cry”, Biliteral root stands for twoconsonants like [agl] “suspend”, etc.

Appendix BArguments for commands 〈B〉 , 〈L〉 , 〈R〉 , 〈S〉 :xx number: repeat xx times.

Appendix CC is used when a consonant is reduplicated.

Appendix DFigure 2: In the aim to census the maximumrules to recognize and generate all possible in-flected forms of each noun, we have made astudy based on works of [7] [16] [37] [13] [1]and on heuristic study of our lexicons. Thus,for each pattern, we have extracted the mostrelevant rules (e.g. for the form [ta. . .Ct]the rule will be an alternation of the first vowelaccompanied by suffixing of [in] and remov-ing the last [t]). Our study includes also theexceptions.

Appendix EKinship noun: the noun designing the relation-ship between members of the same family.

Appendix FWriting System: Scripts.

Appendix G+ FLX introduces the functionality which de-scribes all the potential forms from a lemma.

Appendix HA root is a sequence of one or many consonantsand the pattern is a template of vowels (V) andconsonants.

Appendix IGenre = Gender, Nb = Number, EL = FreeState and EA = Constructed one.

Page 20: AmAMorph: Finite State Morphological Analyzer for Amazighe2CEISIC, Royal Institute of Amazigh Culture, Rabat, Morocco This paper presents AmAMorph, a morphological ana-lyzer for Amazighe

110 F. Z. Nejme et al.

Appendix JIRCAM: an institution created in 2001 to pre-serve, promote and endorse Amazighe culturein all its dimensions.

Received: September, 2014Revised: July, 2015

Accepted: January, 2016

Contact addresses:

Fatima Zahra NejmeLRIT

Mohammed V UniversityFaculty of Science

RabatMorocco

e-mail: [email protected]

Siham BoulaknadelCEISIC

Royal Institute of Amazigh CultureRabat

Moroccoe-mail: [email protected]

Driss AboutajdineLRIT

Mohammed V UniversityFaculty of Science

RabatMorocco

e-mail: [email protected]

FATIMA ZAHRA NEJME is a PhD student in Engineering Sciences atMohammed V-Agdal University, Rabat, Morocco. She received theMaster’s degree from the same University in Computer Sciences andTelecommunications in 2011. Her research interests focus on the devel-opment of natural language processing tools for Amazighe language.

SIHAM BOULAKNADEL is researcher at Royal Institute of AmazigheCulture, Morocco. She obtained her PhD in Computer Sciences fromthe Faculty of Science of Rabat in collaboration with the Universityof Nantes in 2008. Her research interests focus on natural languageprocessing, information retrieval, artificial intelligence, and e-learning.She is currently involved in several national projects dealing with de-veloping linguistics resources and natural language processing tools forless resourced languages, particularly the Amazighe language. She hasalso largely contributed to the supervision of young researchers in var-ious topics, especially in developing numerical learning resources ofthe Amazighe language. In addition, she is the author or co-author ofnumerous national and international publications.

DRISS ABOUTAJDINE received the PhD degrees in Signal Processingfrom the Mohammed V-Agdal University, Rabat, Morocco, in 1985.He joined Mohammed V-Agdal University in 1978, first as an assistantprofessor and since 1990, he is professor at the Faculty of Science,heading the LRIT laboratory. Actually, he is the national coordinatorof the National Information Technology Network of Excellence. Over30 years, he has developed research activities covering various topics ofsignal and image processing, wireless communication, pattern recogni-tion and natural language processing which allowed him to publish over300 journal papers and conference communications. He was electedmember of the Hassan II Moroccan Academy of Science and Tech-nology in May 2006 and he was Vice President in charge of research,cooperation and partnership of the Mohammed V Souissi Universityfrom September 2008 to December 2010. Actually, he is the Directorof the National Center for Scientific and Technical Research.