towards including prosody in a text-to-speech system for modern standard arabic

20
Towards including prosody in a text-to-speech system for modern standard Arabic Allan Ramsay a, * , Hanady Mansour b a School of Computer Science, University of Manchester, Manchester, United Kingdom b University of Alexandria, Egypt Received 13 March 2006; received in revised form 22 June 2007; accepted 22 June 2007 Available online 6 August 2007 Abstract Most attempts to provide text-to-speech for modern standard Arabic (MSA) have concentrated on solving the problem of diacritic assignment (i.e. of recovering phonetically relevant information, such as choice of short vowels, which is not explicitly provided in the surface form of MSA). This is clearly a crucial issue: you can hardly produce intelligible spoken output if you do not know what the vowels are. We describe an approach to the task of generating speech from MSA text which not only solves this initial problem, but also provides the information required for imposing an appropriate intonation contour. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Text to speech; Modern standard Arabic; Prosody 1. Text-to-speech for modern standard Arabic: intelligible vs. natural The goal of the work reported here is to produce a system for text-to-speech (TTS) for modern standard Arabic (MSA) which is both intelligible and natural: Intelligible: speech is intelligible if a human native speaker can discern the words that were spoken. Clearly any useful TTS system must produce intelligible speech: if a hearer cannot work out what was said, the system can hardly be said to be producing speech at all. The major tasks in producing intelligible speech are (i) to produce a phonetic transcription of the text, and (ii) to use this to drive a synthesiser. The first of these tasks is particularly problematic for MSA because the written form omits a great deal of phonetically relevant material. In particular, short vowels are not written, but there are other distinctions (e.g. the doubling of consonants in some contexts) which also make a dif- ference to the spoken form, but which are not written. Note that this information is provided in classical 0885-2308/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.csl.2007.06.004 * Corresponding author. Tel.: +44 0 161 306 3108. E-mail address: [email protected] (A. Ramsay). Computer Speech and Language 22 (2008) 84–103 www.elsevier.com/locate/csl COMPUTER SPEECH AND LANGUAGE

Upload: allan-ramsay

Post on 26-Jun-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards including prosody in a text-to-speech system for modern standard Arabic

COMPUTER

Computer Speech and Language 22 (2008) 84–103

www.elsevier.com/locate/csl

SPEECH AND

LANGUAGE

Towards including prosody in a text-to-speech systemfor modern standard Arabic

Allan Ramsay a,*, Hanady Mansour b

a School of Computer Science, University of Manchester, Manchester, United Kingdomb University of Alexandria, Egypt

Received 13 March 2006; received in revised form 22 June 2007; accepted 22 June 2007Available online 6 August 2007

Abstract

Most attempts to provide text-to-speech for modern standard Arabic (MSA) have concentrated on solving the problemof diacritic assignment (i.e. of recovering phonetically relevant information, such as choice of short vowels, which is notexplicitly provided in the surface form of MSA). This is clearly a crucial issue: you can hardly produce intelligible spokenoutput if you do not know what the vowels are.

We describe an approach to the task of generating speech from MSA text which not only solves this initial problem, butalso provides the information required for imposing an appropriate intonation contour.� 2007 Elsevier Ltd. All rights reserved.

Keywords: Text to speech; Modern standard Arabic; Prosody

1. Text-to-speech for modern standard Arabic: intelligible vs. natural

The goal of the work reported here is to produce a system for text-to-speech (TTS) for modern standardArabic (MSA) which is both intelligible and natural:

Intelligible: speech is intelligible if a human native speaker can discern the words that were spoken. Clearlyany useful TTS system must produce intelligible speech: if a hearer cannot work out what was said, thesystem can hardly be said to be producing speech at all.

The major tasks in producing intelligible speech are (i) to produce a phonetic transcription of the text, and(ii) to use this to drive a synthesiser. The first of these tasks is particularly problematic for MSA because thewritten form omits a great deal of phonetically relevant material. In particular, short vowels are not written,but there are other distinctions (e.g. the doubling of consonants in some contexts) which also make a dif-ference to the spoken form, but which are not written. Note that this information is provided in classical

0885-2308/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.doi:10.1016/j.csl.2007.06.004

* Corresponding author. Tel.: +44 0 161 306 3108.E-mail address: [email protected] (A. Ramsay).

Page 2: Towards including prosody in a text-to-speech system for modern standard Arabic

Fig. 1. Pitch contour specification for (1) (Y-axis is the pitch value in Hz as specified for input to MBROLA).

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 85

Arabic, in the form of marks known as ‘diacritics’. The task of retrieving this information is thus oftenreferred to as ‘diacritisation’: it is important to note that this process includes noting reduplicated conso-nants as well as determining what short vowel, if any, separates two consonants in a root.Natural: speech will sound natural if it has an appropriate pitch contour. Arabic uses pitch for a variety ofpurposes (de Jong and Zawaydeh, 1999):

• stressed syllables typically have locally maximal pitch. Given that we are using MBROLA (Dutoit et al.,1996; Dutoit, 1997), which allows control over pitch but not intensity, as the actual synthesiser, it isimportant for us to assign local pitch appropriately in order to make the stress pattern clear.

• ‘phonetic phrases’ have individual pitch contours. The boundaries of phonetic phrases often mark placeswhere some item is in a non-canonical position (which in turn indicates that the item in question is in focus).

• sentences have prosodic contours that indicate whether they are statements, questions or commands.

It therefore seems likely that attempting to specify the pitch contour for Arabic will make the output morenatural. At the very least, marking which syllables are stressed will make it easier for the hearer to identifythe words that were spoken, anything further that we can do to indicate focus will help the hearer follow theinformation flow.

The work described below attempts to use linguistic information to achieve both these goals. We use arange of linguistic constraints to help us determine which lexical items, and which form of those lexical items,are present in a given text; and we then use information about phrase order to decide on the appropriate pitchcontour. The end result is to produce a phonetic transcription such as that shown in Fig. 1 from an input textsuch as (1).

ð1Þ

1 We

Arab

1

In order to achieve this, we exploit a wide range of information: facts about lexical and inflectional forms,syntactic constraints, semantic constraints, phonological relations, and relationships between informationstructure and prosody. The lexical, morphological, syntactic and semantic constraints are very tightly inte-grated, using dynamic, or ‘just-in-time’, constraints to exploit information at exactly the point when it be-comes available. We will describe the various levels independently below, before showing how using just-in-time constraints enables us to deploy this knowledge very efficiently. Generation of the phonetic transcriptionis carried out by a series of finite-state transducers (FSTs) on the basis of information supplied by the priorlinguistic analysis.

2. Structural analysis

The first stage, then, is to carry out a structural analysis. This has two functions: it enables us to infer themissing phonetic items (short vowels and reduplicated consonants), and it provides information about the

write standard MSA forms like , with the transliteration in italics, and fully diacriticised forms like

, with the transliteration in bold. The transliterations are in the ZDMG format, as provided by Klaus Lagally’s

system.

Page 3: Towards including prosody in a text-to-speech system for modern standard Arabic

86 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

boundaries between phonetic phrases. This structural analysis involves lexical description (Section 2.1), mor-phological analysis (Section 2.2), syntax (Section 2.3) and some fairly simple semantic filtering (Section 2.4).Following the outline of each level, we will discuss the techniques used for delaying constraint applicationuntil the relevant information is available (Section 2.5).

2.1. Lexical description

There has been considerable debate about the most appropriate level for describing Arabic words. Thereare two key issues: should you store words that have a common base, but which are syntactically and/orsemantically distinct, together; and for any given item, should you store all the inflected forms, or shouldyou store the ‘root’ (the key consonants) plus enough information to derive the inflected forms (including theirdiacritics)?

2.1.1. Derivational morphology

Arabic words very largely comprise semantically related variants of a common core. To take a standardexample, there is a large group of words dealing with various aspects of writing: (kutub), (kataba),

(kutiba), (kattaba) , (mukattib), (mukattab), (muktab), (muktib),(maktab), etc.

Each of these items has the triconsonantal root (ktb) at its heart. They all have related meanings –something to do with writing. It is, however, unclear whether you should store them separately in the lexicon,or whether they should be stored as variations on a single lexical entry.

At first sight, it seems obvious that the lexicon should contain a single entry, with the derived forms beingobtained by the addition of derivational affixes. Unfortunately, the pattern of derivational affixes that a givenroot can take is not predictable, and the semantic consequences of adding such an affix to a root are not fixed.It is thus necessary to list all the affixes that a root can combine with explicitly, and to specify the interpreta-tion of each derived form. Storing the root together with the information about what derivational affixes itaccepts seems, then, to provide little extra conceptual clarity: is it more helpful to say that (mudarris)is a word meaning ‘teacher’ or to say that the root (drs) can accept the derivational prefix (mu) with thediacritic pattern (a + i) to produce a form that means ‘teacher’?.

Conceptually there seems to be little difference between (i) listing all the derived forms together and (ii) hav-ing a lexical entry for the root and listing all the derivational affixes that can apply to it, together with thediacritics that are associated with each affix. In either case, you have to list all the forms. Computationally,however, there are considerable advantages to the second option. In many cases, there are numerous formswhich look the same when written without diacritics.2 If lexical lookup returns all the possibilities, then sub-sequent processing steps will be swamped: the complexity of any parsing algorithm will have a term of orderPN

i¼1ai, where ai is the number of alternative readings of the ith word. If we can delay making a choice betweenthe various interpretations of a given surface form, we will minimise the effects of this term. We thereforegroup all the derivational affixes that will produce the same surface form for at least some choices of diacriticstogether, using the techniques describe in Section 2.5 to explore different options when they arise. We will pres-ent different choices of when to explore these options in Section 2.5.

2.2. Morphology

We follow fairly standard practice in assuming that the morphology of an open class Arabic word is deter-mined by a set of constraints between various levels (McCarthy and Prince, 1990; McCarthy, 1993; Kiraz,2001). Firstly, we assume that an open class word consists of a stem plus a number of derivational and inflec-tional affixes. Secondly, we assume that the form of the stem is determined by a process of interweaving a con-sonantal root, usually consisting of three consonants; a diacritic pattern, specifying what goes in between the

2 After all, if this were not the case then diacritisation for MSA would be as easy as retrieving the vowels in an English phrase like ‘Hv U

sent any txt msgs today?’

Page 4: Towards including prosody in a text-to-speech system for modern standard Arabic

Fig. 2. Morphology of ’ktb’(class i).

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 87

consonants (the entries in this pattern could be long vowels, they could be short vowels, or they could beempty); and a consonantal pattern which determines whether the root consonants should be duplicated.3

The interactions between the various components are fairly intricate. The form of the present prefix, forinstance, depends in part on the specific stem to which it is being attached, and in part on whether or notthe verb is active or passive. The two active derived verb stems (ktub) and (kattab), for instance,take (ya) and (yu) as their third singular present prefixes, whereas the passive forms of the same verbs

(ktab) and (kattib) both take (yu). This information is shared between the stem and the prefix– the stem knows what the vowel in the present prefix would be if one was present, and the prefix knows wherethe vowel would go. Similarly the various case markers know what their underlying form is if the noun towhich they are attached is definite or indefinite; whether the noun is in fact definite depends partly on whetherit has the definite article attached but also on whether it is the head of a construct NP.

We capture the combinatory properties of Arabic morphemes by using a categorial description of individ-ual morphemes. Fig. 2 shows morphological properties of the root for one of the senses of (ktb).

The key points here are that this item has an affix list, similar to the subcat list of a grammatical frameworklike HPSG, which specifies that it needs a derivational suffix (the fact that what is needed is a suffix is markedby dir(+after, -before), the fact this item is a derivational affix is marked by affix(*deriv)); and thatits diacritics will be (0 + u) if it is used as an active present form, (a + a) if it is active past, (u + i)for passive past and (a + 0) for passive present. Which of these gets chosen depends on the specific tensemarkers that get added and on whether the verb is used in an active or a passive sentence. Note that althoughwe need to know whether the verb is being used actively or passively in order to determine the diacritics, thereis nothing in the written form which will help us. We cannot know until we see the overall structure of thesentence in which it occurs. The easiest way to handle this is by using a just-in-time constraint which is trig-gered when we find out what the voice is.

Morphemes are combined by ticking items off the affix list, looking left or right as specified by the entry onthe affix list, just as in categorial grammar. We allow affixes themselves to have non-empty affix lists, so thatwhen, for instance, we add a verbal derivational affix to the root in Fig. 2 the resulting item inherits a list ofinflectional affixes from the derivational one. This provides us with considerable flexibility: a single root maygive rise to a number of nominal and verbal forms, where the nominal forms may require gender, number andcase markers and the verbal forms may require tense, agreement and mood markers. By allowing successiveaffixes to specify what is needed next we can let a single root combine with a variety of different sets of affixes.We are thus using a categorial framework to describe those aspects of the internal structure of Arabic wordsthat Beesley (1996) and Kiraz (2001) suggest may be treated using a context-free grammar (note that anycontext-free grammar can be systematically transformed into a categorial grammar). We are, however, usingthe extended categorial rules of Ades and Steedman (1982) which support a strictly left! right approach to

3 We use ‘root’ to refer to the consonantal cluster that underlies a set of derived verbal and nominal forms, and ‘stem’ to refer to aspecific instantiation of such a cluster with a set of interconsonantal vowels and a consonant pattern.

Page 5: Towards including prosody in a text-to-speech system for modern standard Arabic

88 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

parsing; this enables us to merge the structural analysis with a set of ‘spelling rules’, embodied as a finite-statetransducer, which are used for spotting places where a diacritic gap should be filled either by an explicit longvowel or an unwritten short one and for a number of simple surface transformations such as the replacementof a surface (w) by an underlying ð�uÞ in certain contexts.

The written form of the word may fail to specify the underlying form of a particular affix. Several nominalstems, for example, include a derivational prefix which is written (m). This prefix, however, can have either(mu) or (ma) as its underlying form, and it can pick out a range of different diacritic and consonant patterns.We therefore include a single surface form for this set of affixes, together with a set of just-in-time constraintsthat are applied when the affix is combined with the root. At this point in the processing we know what thesurface forms of the diacritics were and we know which variants of the affix this root accepts. The constraintscheck that the root has a sense which is compatible with the various surface markers for diacritics, and insertsshort vowels and spaces as required for diacritics which had no written realisation. By applying the constraintsat this point we can treat the prefix as an instance of a single underspecified item during the initial left! righttraversal, and we can verify which forms make sense exactly at the point when we have the requiredinformation.

Other affixes cannot be settled so early. The forms of case, tense, agreement and mood markers cannot bedetermined just by looking at a word in isolation. We may need to know a considerable amount about thesurrounding syntactic and semantic context before we can decide the underlying form of such markers, par-ticularly in cases where there is no surface marker at all. So be it. In such cases the trigger for specifying theunderlying form is the provision of the necessary information. For the mood marker for third singular presenttense forms, for instance, we include the rule in Fig. 3.

We have given this rule using an extended form of Prolog, but it should be fairly clear what it says: whenyou know the mood of X, then if it is indicative the underlying form is (u), if it is subjunctive then the under-lying form is (a), and if it’s jussive then the underlying form is (0). The key point is that we have no ideawhen the facts about a given verb’s mood will become clear. This constraint simply sits there waiting until theinformation does become available, and fills in the underlying form at exactly that point. Similar rules are usedto delay fixing nominal case markers until the case and definiteness of the word is clear (see the discussion ofconstruct NPs below) and for a range of other similar markers.

2.3. Syntax

We have argued that leaving the description of lexical items incomplete is a good move because it cuts downon the complexity of the parsing process. If we are to benefit from this, then, we have to have a parser.

The general outline of our approach to parsing Arabic is as follows:

(1) We assume a general HPSG-like framework in which each lexical item contains a subcat list whichdescribes the set of arguments it requires in order to be saturated, and a target and result, whichdescribe the things it can modify and what the resulting item is like. These are exploited by two pairs ofrules, as shown in Fig. 4.

The first rule in Fig. 4 covers the same ground as the standard X==>X/Y,Y of categorial grammar, oras the subcat principle in HPSG. It says that if you have an item which expects to be followed by some-thing of a certain kind, and there is something of that kind immediately after it, then they can be com-bined (there is obviously also a corresponding rule for cases where the argument is expected to precedethe head).The second rule covers modifiers and specifiers, where the modifier can copy all the informa-tion from the target to the result (as would happen with a categorial description of an adjective as being

Fig. 3. Just-in-time constraint for mood marking.

Page 6: Towards including prosody in a text-to-speech system for modern standard Arabic

Fig. 4. Rules of combination.

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 89

of type N/N) or change some of it, as in the description of an English determiner as being of type NP/N. It,and its conjugate for modifiers that follow their targets, corresponds to schema V from Pollard and Sag(1988).

(2) The rules in Fig. 4 say where things ought to appear, where they would appear if everything was in itscanonical position. We allow things to appear in positions other than their expected locations, subject toa set of constraints on local subtrees. We exploit a chart parsing algorithm with an indexing regimewhich allows us to combine non-adjacent items at comparatively little extra cost (Ramsay, 1999), andwe control the potential explosion of local fragments with a set of ‘filters’. This part of the treatmentis thus reminiscent of government-and-binding’s ‘move-a’ (Chomsky, 1981).

(3) We make use of a distinction between the ‘internal’ and ‘external’ views of an object. The notion is famil-iar from things like English gerunds, where a nominal gerund is a word that looks like a present parti-ciple verb (internal view) but is used in places where a noun would be expected (external view) (see(2)(a)), whereas a verbal gerund is a phrase that looks a present participle VP but is used in places wherean NP would be expected (2)(b) (Malouf, 1996; Sadler, 1996).

(2) a. Sheriff John Brown always hated me for the killing of a deputy.(nominal gerund, ‘killing’ looks like a verb but behaves like a noun)

b. He concluded the banquet by eating the owl.(verbal gerund, ‘eating the owl’ looks like a VP but behaves like an NP)

We extend Nerbonne et al (1994)’s use of lexical rules to include ‘post-lexical rules’, which describe cases wheresomething which has a specified set of internal properties can be used in contexts requiring something with adifferent set of external properties.

The two main uses of this in our treatment of Arabic relate to nominal sentences and construct NPs.

2.3.1. Nominal sentences

Arabic, like a number of other languages, allows for sentences which consist of an NP and a predication(e.g. another NP, an adjective, a PP, or predicative VP) (Fehri, 1993; Abdul-Raof, 1998). These ‘nominal sen-tences’, which also resemble English ‘small clauses’, can most easily be described by using a post-lexical rulewhich says that an NP can be seen as a sentence missing a predication. Fig. 5 shows the basic rule.

This says that if you have an NP (a saturated +specified nominal) then you can see it as an unsaturatedS which needs a +predicative item and which has the NP as its subject.

This rule covers the basic facts for a number of languages, including English small clauses. For Arabic,however, we have to supplement the basic rule with some rather complex ordering constraints. Roughly speak-ing, the situation is as follows:

Page 7: Towards including prosody in a text-to-speech system for modern standard Arabic

Fig. 5. Nominal sentences.

Fig. 6. Case marking in a construct NP.

90 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

• the case of the subject NP is governed by the external syntactic context,• if the subject NP is indefinite then the order of the constituents must be reversed and the predication must in

fact be a PP. This is again a constraint which can only be checked when the properties of the NP are known,and hence is included in the general rule as a just-in-time constraint.

2.3.2. Construct NPs

Arabic allows NPs to function as possessive determiners, so thatdenotes ‘the school’s authors’.4 The basic facts here are again fairly straightforward: any genitive NP can func-tion as the satellite in a construct NP. As in other languages, genitive marking does not always denote literalpossession, and the semantic relation between the satellite and the head may be quite subtle, but the core of thestructural rule is as given.

As with nominal sentences, the basic rule is embellished with a number of rather delicate caveats. The keyproblems relate to the case marking on the head noun. This has to be nominative marked, no matter what thefunction of the whole NP in the wider sentence, and the nominative marker that is assigned to it has to be theform which is appropriate for definite nouns even if there is no definite article. The analysis of (3) in Fig. 6shows the assignment of the definite nominative marker to despite the lack of a definite article:

is the head noun of an NP which is definite by virtue of being a construct NP, and hence the casemarker has to be the definite form (Mohammed, 2000).

ð3Þ

4 Asproble

:

This example shows that we really have no chance of assigning case markers until we see the wider syntacticcontext. We do not know what case some noun has until we see the context, and even if we did know what thecase was we would not know what the marker should look like until we saw the context. Again, use of a just-in-time constraint allows us to delay the decision until we have the required information.

2.3.3. Clausal complements

Case marking, then, is something which is generally unwritten, affects the phonetic form, and cannot bedetermined until you have identified the local syntactic context. The role of the syntactic context becomes evenclearer when we consider clausal complements.

noted above, and , like many surface forms in Arabic, have multiple interpretations. This is, after all, the reason why we have am in the first place. The discussion in Section 2.4 suggests some possible solutions to this problem.

Page 8: Towards including prosody in a text-to-speech system for modern standard Arabic

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 91

In free-standing main clauses, the subject of an Arabic sentence is nominative. In various other contexts,however, it may not be. In particular, different complementisers and ‘mood particles’ place different con-straints on the form of the embedded clause:

requires a subject initial indicative clause with an accusative subject, where theclause must be a verbal clause.

requires a verb initial subjunctive clause with a nominative subject

requires a subject initial indicative clause with an accusative subject, but this timethe clause may be either nominal or verbal

requires a verb initial subjunctive clause. Other mood particles place similar con-straints on the following clause.

It can happen, then, that the surface form does not tell us which complementiser was written, nor whichversion of the verb. In these cases it is the embedding verb which makes the choice. We condense the two

forms of into a single item which can manifest itself either as or , and which

can simultaneously provide the information required to make the verb fix its own underlying form. As soonas the embedding verb says which version of the complementiser it wants, the relevant phonetic details becomeclear, but until then we do not carry around multiple local analyses. The analyses of (4a) and (4b) in Fig. 7

illustrate this phenomenon: requires a complement headed by , so the form of

is constrained to be . But if the form of the complement is then the verb must

be indicative, so we can fix the right form of the mood marker. on the other hand requires the ver-

sion of which has as its underlying form, and this in turn requires a subjunctive form of the

verb. We use just-in-time constraints to delay the decisions about the form of , the mood markers and

the case of the subject of the embedded clause.

ð4Þ

2.3.4. Clitic pronounsWe deal with clitic pronouns and prepositions by assuming that they should be seen as making compound

words, where each part of the word should be dealt with as a separate lexical item. In other words, we assume

Fig. 7. Different complementisers impose different constraints.

Page 9: Towards including prosody in a text-to-speech system for modern standard Arabic

92 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

that is really two words, the preposition (l) and the noun , which hap-

pen to be written without an intervening space, and likewise that is two words (the

verb and the pronoun ) which happen to be written without a space. We treat

the clitic pronouns in (1), for instance, by assuming they are in fact free standing pronouns which are writtennext to the preceding words without any intervening space. In particular we assume that the one attached to

the complementiser is the subject of , rather than being governed directly by the complementiser.

This provides a uniform treatment of cases where there is a clitic pronoun attached to the complementiser butno visible subject and cases where there is no such pronoun but there is an explicit subject, and it also explainswhy such pronouns have to agree with the following verb (see Fig. 8).

ð5Þ

ð6Þ

ðhe thought that he marked on it: ðhe thought that he made a mark on itÞÞ

Note the zero subject of . The fine-grained agreement markers on Arabic verbs mean that there isgenerally enough information to determine the subject even if there is no explicit pronoun. We allow for thisby permitting ‘invisible’ NPs as subjects when required: this clearly further increases the scope for multiplereadings: if we see a sequence of an NP followed by a verb which may be either intransitive or transitive, itcan be very hard to tell whether the NP is the subject of an intransitive reading, or the object of a transitivereading where the subject is invisible. Sometimes there will be an agreement mismatch between the NP and theverb, in which case clearly the NP cannot be the subject, and sometimes there will be a visible case marker,which again will decide the matter, but very often it is impossible to tell. In Fig. 9, for instance, we obtain fourreadings of (6), arising from the fact that has two readings as a transitive verb, each of which give riseto either a passive reading of (6) with as subject or an active reading with a zero subject and

as object.

2.4. Selection restrictions

Looking for globally consistent syntactic analyses, then, helps eliminate some possible interpretations, andat the same time fills in some bits of phonetically relevant information. We are still, however, often left withmultiple competing interpretations, as in Fig. 10.

Fig. 8. Clitic pronouns (0.28 s).

Fig. 9. Transitive readings with zero subjects vs. passive readings.

Page 10: Towards including prosody in a text-to-speech system for modern standard Arabic

ð7Þ

Fig. 10. Multiple interpretations of (7)

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 93

One way to choose between these is to see which one makes most sense. We take a fairly simple-minded ap-proach to this. Nouns are assigned a position in a coarse taxonomic hierarchy, and verbs use this hierarchy tospecify constraints on their arguments. The transitive class (i) entry for , for instance, specifies itssubcat list as [agent:living, object:nonliving], where the first element in the pair specifies the the-matic role of this argument and the second specifies what kind of thing should fill it. Then when a filler is sug-gested for the given role, the constraint is checked.

There are a number of well-known problems with using selection restrictions of this kind for disambigua-tion. Firstly, it is extremely difficult to assign appropriate categories. Secondly, once you have done so thenyou inevitably rule out non-standard (e.g. metaphorical) uses of words. The advantage is that they can bechecked extremely quickly. Fall (1990) and Aıt-Kaci et al. (1989), for instance, suggest an encoding of simpletype-hierarchies for which it is possible to check incompatibility with a couple of operations on bitstrings, anddescription logics (Ohlbach et al, 1997; Baader and Sattler, 2001), allow rather more expressive power whilststill supporting very efficient inference over types. We choose to employ a logic which allows us to use a typelattice, rather than a simple type hierarchy, as in Fig. 11.

Note that we distinguish between partitions, such as the division between animal and plant, and simple sub-set relations (e.g. the fact that anything which has a gender is a living organism), with (. . .) being used to markpartitions and [. . .] to mark possibly overlapping subsets (so that (living[(animal(. . .), plant(. . .)),gendered(. . .),age(. . .)])) says that animal, plant, gendered and age are all subsets of living, but that animal

and plant are mutually exclusive. The upper case letters mark reentrant points, so that if something isdescribed as a man then we know that it is human (and hence a mammal, and hence . . .) and male and adult.

Fig. 11. Simple type lattice.

Page 11: Towards including prosody in a text-to-speech system for modern standard Arabic

94 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

We then express constraints in terms of conjunctions of positive and negative features, so that a male child,for instance might be described as male&:adult. The complexity of comparing two such descriptions iso((M + N) · K), where M and N are the number of conjuncts in the two descriptions and K is the maximumdistance from any term in either description to a maximal element in the lattice. This is slower than using thesimple bit-string encoding, and faster than using a description logic, with an expressive power that is corre-spondingly somewhere in between.

We make the constraints on verb arguments as general as possible, and we make the classification of nounsas specific as possible. All we are using the taxonomy for here is to help us choose between readings of a single

verb. The object of , for instance, must be something you can read, so we could say

that it demands written material for its object. All we need for distinguishing between

and , however, is the fact that the object of

must be non-living whereas the object of must be living.

The more general the constraints we introduce, the less likely we are to rule out genuine readings.There are several ways we could, in principle, improve on this approach to sense selection:

• We could try to use corpus data about co-occurrence patterns, rather than using a hand-coded taxonomy.To do this we would, clearly, need an appropriate corpus. Obtaining an appropriate dataset for Arabicwould be extremely difficult, since the raw texts being sampled will necessarily fail to distinguish betweenthe different stems and hence it will be impossible to automatically collect statistics relating to them. In anycase, obtaining such a corpus is outside the scope of the work reported here.

• We could try to encode more detailed information about individual words, and then carry out appropriateinferences with this information to choose sensible interpretations. This is clearly an appealing choice, ashas been argued by a number of authors (Hirst, 1987; Asher and Lascarides, 1995; Wedekind, 1996; Gar-dent and Konrad, 2000). The problem here is that you have to collect and encode an enormous amount ofinformation, which is an extremely challenging task (Lenat and Guha, 1990), and that the inference pro-cesses required for using this information can be very slow.

We therefore choose to work with the very simple form of selection restrictions described above. Where wedo not have enough information to choose between underlying forms, we just have to make an arbitrarychoice. In many cases the combination of syntactic constraints and selection restrictions will settle the matter.Where they do not, we just have to choose something and live with the consequences.

2.5. ‘Just-in-time’ constraints

The processes outlined above all help with the task of disambiguating the text, and hence determining thediacritics. As noted, however, simply running them as a pipeline (lexical lookup! morphology! syn-tax! semantics) means that choices would have to be made early, without all the relevant information,and hence that the system would perform large amounts of backtracking. At various points in the discussionabove, we referred to the use of ‘just-in-time’ constraints to avoid this problem as far as possible. In the cur-rent section we will look more closely at this technique, and the effects of making good decisions about whento make choices.

Just-in-time constraints are tests which cannot be checked until the necessary information is specified. Atypical (cleaned-up) example is shown in Fig. 12.

Fig. 12. Constraints on subject of MSA verbal sentence.

Page 12: Towards including prosody in a text-to-speech system for modern standard Arabic

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 95

This constraint applies to the subjects of MSA verbal sentences. It says that when you know where the startof the subject is (which you will not know until you decide which item you believe to be the subject) then if thesubject appears in its canonical position immediately after the verb (start@SUBJ=end@V) then the verb canbe third singular even if the subject is plural, though they are still expected to agree in gender; if, on the otherhand, the subject precedes the verb then the two items should agree in every detail, and furthermore the subjectshould be a definite NP.

The constraint itself is well known, and is fairly straightforward. The key innovation here is the observationthat the point where this should be checked is when the subject and verb are combined. You cannot check ituntil you know whether the subject is in canonical position or not, but at the same time you cannot afford togenerate two versions of the verb, one looking for a subject in canonical position with the weaker version ofthe constraint and another looking for it somewhere else with the stronger version.

This phenomenon is widespread. There are many constraints on individual items that cannot be checkeduntil you know about the context in which the item is being used:

• you cannot check whether a noun should carry the definite or indefinite form of a case marker until youknow whether it is the head of a construct phrase (and since in most cases the form is not written but isnonetheless phonetically different, this makes a difference for TTS);

• you do not know whether the subject and verb have to agree in number until you know whether the subjectis in canonical position;

• you do not know the form of the complementiser until you know (a) what the governing verb isand (b) what the mood of the embedded clause is;

• you do not know whether the subject of a sentence is nominative or accusative until you know the form ofthe governing complementiser;

• you may not know the form of a noun until you know the context in which it is being used.

By declining to choose between options like these until we have the relevant information, we can saveexploring multiple closely related hypotheses.

3. Phonetic transcription

The linguistic analysis outlined above is useful for TTS since it helps determine the underlying form corre-sponding to a particular surface form in a particular context. It helps with diacriticisation, and it also providesinformation about such matters as case marking, mood marking and the forms of complementisers. This isnot, however, its sole function.

Our overall goal is to provide spoken output which is both intelligible (for which we need, at least, the dia-critics) and natural. To obtain natural sounding speech, we have to impose an appropriate intonation contour.The linguistic analysis of Section 2 also supplies the information we need for this part of the task.

Roughly speaking, the intonation of spoken Arabic is determined by the following factors:

Utterance type: the overall function of the utterance (statement, query, command) determines a pair oftrend lines.

Phonetic chunking: the utterance may be made up of a number of phonetic phrases – sections each of whichhas its own internal prosodic shape.

Syllabification: determining the syllables that make up the utterance is important for two reasons. (i) Thegeneral tune of the utterance assigns points on the higher trend line to stressed syllables,with non-stressed syllables falling away towards the lower one. (ii) Certain local phono-logical effects, such as the spreading of ‘emphatic’ sounds,5 which are affected by the syl-lable structure.

5 These are also known as ’coloured’ sounds: the term refers to a set of phonetic properties, notably pharyngealisation, rather than to thenotion that some item is particularly important. In some dialects phonemes can inherit emphatic marking from neighbouring emphaticsounds by a process of ’spreading’ outlined in step 2 of the transformations below.

Page 13: Towards including prosody in a text-to-speech system for modern standard Arabic

96 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

3.1. From structure to phonemes

We proceed in two stages. We use the structural analysis to produce an annotated sequence of phonemes.As noted above, the structural analysis provides us with the fully diacriticised form of the text, together withinformation about the utterance type and the division into phonetic phrases.

We obtain the utterance type directly from the syntactic analysis. A sentence containing the interrogative

particle or an interrogative pronoun such as is assumed to be a question. For other sen-

tences, the form of the verb determines whether we regard them as statements or commands. It is not, in fact,always possible to distinguish between statements and commands simply on the basis of the appearance of thetext. We do not, however, currently have any way of making use of wider contextual information for making achoice between these two, and hence if the surface form fails to distinguish between them we assume that theutterance was a statement.

Obtaining phonetic phrases is a slightly more complex task. The situation is not helped by the fact thatwhilst there is fairly widespread agreement that the prosody of spoken Arabic depends on grouping the textinto chunks, each of which has its own local contour, there is no concensus about the nature of these chunks.Suggestions include phrases, phrasal daughters of the main verb, and clauses (Zaki and Rajouani, 2002; Hirstand Di Cristo, 1998; Benkirane, 1998). We have taken the view that the main clause together with any non-clausal arguments in canonical position forms a single phonetic phrase, along with any sentential complementsand any constituents in non-canonical positions.

Treating sentential complements as phonetic phrases is well-attested in the literature. There is also someevidence that items in non-canonical positions have their own prosodic contours. This is likely to be linkedto the fact that, as in other languages (Steedman, 2000), the main reasons for shifting items to such positionsis to mark them as being in some way interesting. Providing them with a marked intonation pattern helps dothe same job.

The linguistic analysis of Section 2 is required for spotting such items. You can hardly determine whethersome item is in canonical position or not unless you have some idea of where the canonical position is, and ofcomparing the position where something actually appears with where it would normally have been expected toappear.

The result of these two stages is a sequence of phonemes, together with markers showing boundaries

between morphemes, words and phonetic phrases. For , for

instance, we obtain the following sequence:

?iþ ‘taqadþ 0þ a&0&?anna&� hu& � ‘ulimþ 0þ 0þ a&f‘gfagflgfag&� hu

This contains a mixture of phonemes (in a fairly straightforward transliteration, with standard alphabeticcharacters plus the symbols ? and ‘ representing distinct phones and 0 for an empty affix) andcontrol symbols, where + marks a morpheme boundary, & a word boundary, and * a phonetic phraseboundary. In addition, the curly brackets {. . .} mark phones that cannot be stressed, and – indicatescliticisation.

We then apply a series of transformations, encoded as FSTs, as follows:

(1) Obtain allophones: we replace the definite article by if it is followed by a fricative conso-

nant, and we delete glottal stops from sequences like and . At this point we do various

bits of bookkeeping such as deleting empty phonemes and marking short and long vowels.(2) Find syllable boundaries and deal with emphatic sounds: we split the sequence into CVC and CV sylla-

bles, and then look for emphatic sounds. The rules here are fairly intricate:(a) if an ‘emphatic consonant’ is followed by a long or a short then this vowel is marked as being

emphatic, and the following syllable is also marked as being emphatic if its vowel is or

(and the one after that, if it has a suitable vowel, and so on for as long as the rule applies),(b) any syllable containing either or and is marked as emphatic,

Page 14: Towards including prosody in a text-to-speech system for modern standard Arabic

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 97

(c) any syllable containing either or and is marked as being emphatic, and the samepropagation of this property as in (2a) applies.

(3) Mark stress: within each phonetic phrase, the following rules are used to assign stress.(a) if there is just one syllable then it recieves the stress for the phrase,(b) if there is a ‘heavy’ syllable (one with a long vowel or with two final consonants) then this will receive

the stress,(c) if there are four or more syllables, then the last one with a long vowel will receive the stress. If there

is no such syllable, then the antepenultimate one gets the stress for the phrase,(d) if there are two or three syllables, the penultimate one gets the stress.

These rules accord with McCarthy (1979)’s description of stress for Cairene speakers. Speakers of dif-ferent local variants of Arabic tend to import the stress pattern from their native dialect when speakingMSA, so there is no universal stress pattern. We have implemented this set of rules partly because theyare attested in the literature for Cairene speakers, and partly because one of the authors is an Egyptiannative and as such is comfortable with this pattern.

(4) Assign local relative pitch. Syllables are assigned LL, L, H or HH pitch relative to their neighbours asfollows:(a) the final stressed syllable of a declarative sentence is marked LL and the final stressed syllable of an

interrogative sentence is marked HH,(b) particles and prepositions are marked L,(c) any stressed syllable that has not already been assigned a pitch is marked H,(d) all other syllables are marked L.

(5) Assign numerical pitch values and lengths. We use the MBROLA (Dutoit et al., 1996; Dutoit, 1997)speech synthesis system to convert our specification of the phonetic form into sounds. MBROLA isa diphone-based synthesiser which requires you to specify the phone, its duration and its pitch. Wespecify the duration by using the values given by Alani (1970). Unfortunately, Alani only gives valuesfor duration when the phone occurs in initial position. In particular, his data does not provide anyinformation about the effects of preceding or following phones, but using this data does produce intel-ligible speech, and collecting reliable data about phone duration in particular local phonetic contexts isbeyond the scope of the work reported here. Baloul et al. (2002) and Zaki and Rajouani (2002) arguethat the pitch contour for Arabic can be described by assigning upper and lower ‘trend lines’. Theshape of the two trend lines is determined by the type of the utterance – generally falling for declarativesentences, generally rising for interrogatives. These ‘tunes’ apply to each phonological phrase in anutterance, so that each phonological phrase of a declarative sentence has a general fall and each pho-nological phrase of an interrogative sentence has a general rise (Rifaat, 2005). HH syllables then lieabove the top trend line, H syllables lie on it, and LL and L syllables lie below or on the bottom trendline respectively.We implement this by defining the upper trend line by P max � ðP max � P minÞ � ðn

NÞ. Pmax

and Pmin are the observed maximum and minimum values for stressed syllables for a typical speaker(e.g. Pmax = 150 Hz, Pmin = 125 Hz for a typical male speaker), N is the number of phones in thephrase and n is the position of this phone within the phrase. A similar lower trend line is specified withthe same shape but lower start and end points. Phones marked HH in the previous section are assignedpitch values a fixed interval above the top trend line, phones marked H are assigned pitch values on thetop line, ones marked L are assigned values on the bottom line and ones marked LL are assigned val-ues a fixed interval below the bottom line (for the example illustrated in the paper, and for the testsdescribed in Section 4, this interval was set to 20 Hz). The same procedure is followed for questions,with the only difference being that the top trend line is defined by P min þ ðP max � P minÞ � ðn

NÞ (i.e. risinginstead of falling) with the bottom one again having the same overall shape but different start and endpoints.

The output of all this for a couple of typical examples is shown in Figs. 13 and 14. The contours here wereobtained by running the output of the system through the Praat speech analyser (Boersma and Weenink,

Page 15: Towards including prosody in a text-to-speech system for modern standard Arabic

Time (s)

0 5.11506

Pitc

h (H

z)

100

150

Fig. 13. Pitch contour for (1).

Time (s)

0 5.48231

Pitc

h (H

z)

100

150

Fig. 14. Pitch contour for (1’).

98 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

2005). Doing this has two advantages: (i) we can compare the contours that we produce with the contours fornative speakers. Studying a graphical representation of the contour allows us to look closely at the effects ofchanges to the algorithm or to the parameters outlined above, though it should never be regarded as a com-plete replacement for actually listening. (ii) Using MBROLA for the final stage of generation introduces per-turbations into the contour that are beyond our control. It is therefore better to look at the contour after it hasbeen run through MBROLA than just to look at the pitch values as we specify them, since there is no guar-antee that they will be exactly preserved by MBROLA.

ð1Þ

ð10Þ

Figs. 13 and 14 contain very marked drops, which were not part of the contour that we specified. These occurat very low intensity, probably unvoiced, segments of the output, and may be artefacts of the MBROLA out-put (possibly arising because of misclassification of voiced segments in the diphone database we are using).They are not, in any case, audible, but they illustrate the potential dangers in relying solely on graphical pre-sentations of the pitch of the actual synthesised speech.

4. Evaluation and conclusions

The goal of the work described above was to use fine-grained linguistic analysis to help produce spokenoutput which was both intelligible and natural sounding from undiacriticised Arabic text. In order to assesswhether we had achieved this goal we carried out a linked pair of experiments based on playing the output oftwo synthesisers for a set of 23 test sentences to a group of 14 subjects. The details and rationales of theseexperiments are given below. The set of test sentences is given in Appendix A.

4.1. Intelligibility

We asked the subjects to write down the fully diacriticised forms of the output of the two systems, and wecompared what they wrote with a reference text produced by someone who had read the undiacriticised text(the subjects were not given access to this text). The reference text was required because it is not possible tocompare the transcriptions directly with the input text, since this has no diacritics to compare with. The goalwas therefore to produce output which was as near as possible to what would be produced by a native speakerfrom the same input.

The aim of this part of the experiment was largely to assess the accuracy of our diacriticisation, but also toensure that the phonetic realisation of our diacriticised forms was understandable.

Page 16: Towards including prosody in a text-to-speech system for modern standard Arabic

Fig. 15. Error classification for Parasite and Sakhr: Sakhr scores in parentheses.

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 99

We carried out this part of the experiment with the output of two synthesisers, namely the system describedabove (referred to as Parasite in the discussion below) and the commercial system Sakhr.6 The goal was toobtain a comparison between our success in this part of the task and that of one of the leading TTS systemsfor Arabic.

We scored the transcriptions of the spoken output by recording cases where the transcription contained aword which was of a different syntactic category from the corresponding word in the reference set; or was ofthe same category but had a different inflectional form (this covered variations in affixes and some internalvowel patterns); or where there was some other kind of discrepancy between the transcription of the spokenoutput and the reference set. The results of this are given in Fig. 15, where we have given the total number oferrors of each kind recorded for each sentence for both systems, with the score for Parasite being given firstand that for Sakhr in parentheses. The inter-subject agreement was very high (around 95% for both sets oftranscriptions) so the figures in Fig. 15 simply include the most common transcription in each case.

These results show that where we are able to carry out the linguistic analysis described above, the outputsometimes assigns part of speech tags more accurately than Sakhr and fairly frequently assigns more appro-priate inflectional morphology (generally case-marking). In particular, Sakhr appears to assume that verbalsentences are in canonical order, and to assign case markers on this basis, whereas our treatment of wordorder allows us to be more careful about allocating the syntactic roles of subject and object. The other caseswhere the Sakhr inflectional morphology differs from the reference text are fine-grained cases such as choosingthe voice of the verb following a complementiser.

There is, of course, a major caveat here: the test sentences are fairly short, and the parser described abovedoes become markedly slower and less reliable as sentence length increases. The most we can claim, then, isthat on comparatively short sentences the kind of information we are exploiting is effective. When texts getsubstantially more complex, then it will be necessary to use a combination of fine-grained linguistic analysisand less sensitive, but more robust, techniques.

We have included a column for ‘other’ errors. The main entries in this column for Parasite concern caseswhere the system’s articulation is misleading, so that subjects have written down a phoneme which is differentfrom the one we specified in the phonetic description. The main entries in this column for Sakhr arise fromselection of the wrong sense of a word. However, given that subjects are likely to allow their intuitions aboutwhat makes sense to override the acoustic evidence in marginal cases, we suspect that there may actually bemore errors of this kind in the output of both systems than are reported by our subjects.

4.2. Prosody

Assessing the intonation contour turned out to be quite difficult. In MSA questions typically have clear lex-ical or structural markers such as the presence of an interrogative particle, and the presence or absence of such

6 http://www.sakhr.com/L_Item/whitepaper/TTS.htm

Page 17: Towards including prosody in a text-to-speech system for modern standard Arabic

Fig. 16. Preferred intonation contours.

100 A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103

markers generally overrode the effects of the intonation contour. This contrasts with English, where it is quiteeasy to turn a sentence whose form indicates that it is a statement into a question by imposing the appropriatecontour: none of our subjects ever classified sentences as queries when we tried this for Arabic.

We therefore chose to play each sentence with both contours (with a random order of presentation) andasked our subjects which contour sounded more natural in each case. The goal here was to see whether theychose the claim contour for declarative forms and the query contour for interrogatives. If in the majority ofcases they chose the correct contour then we can at least say that our contours are appropriate. Since Sakhrimposes no overall contour, we did not record judgements on this for Sakhr. The results of this experiment areshown in Fig. 16.

Subjects were less consistent in their judgements than in the previous experiment, with a number of peoplejudging that the query contour was more appropriate for a number of sentences whose form marked them asbeing declarative. Nonetheless, declarative sentences generally sounded better with the claim contour andinterrogatives universally sounded better with the query contour.

The inclusion of emphatic sounds and junctures (deletion of final short vowels at the end of sentences andbefore glottal stops) improved the quality of the output, but they also highlighted the problem of usingdiphone-based synthesis. The diphone set supplied with MBROLA only includes a fairly small set of diphonesinvolving emphatic sounds, and certain pairs that we would like to include are missing. This issue relates topairs of phones. The set of phones includes most of the emphatic sounds: the problem is that if one of theemphatic sounds did not occur adjacent to some specific phoneme in the set that was used for obtainingthe diphones, the synthesiser will fail even if the individual sounds are present. We therefore had to includea step which replaced missing emphatic pairs by non-emphatic ones, which restricted the usefulness of includ-ing them in the first place. Nonetheless, where we were able to include them they did improve the perceivedquality of the generated speech.

5. Conclusions

The aim of the work reported here was to show how fine-grained linguistic analysis could help with thegeneration of spoken output from written undiacriticised Arabic text. The first part of the paper outlinedhow we obtain the linguistic analysis, with particular emphasis on the use of just-in-time constraints to managethe interaction between various levels of description. The highly underspecified nature of undiacriticised formsmeans that this is particularly important for Arabic, where information that is required for making decisionsabout diacriticisation often emerges quite late. By leaving decisions about case-marking, agreement and voiceuntil the wider context is recognised, we manage to avoid a great deal of backtracking that would otherwise beincurred.

The second part showed how to use this information in producing a narrow phonetic description for use ina diphone-based synthesiser, and considered the quality of the final synthesised output. We considered this

Page 18: Towards including prosody in a text-to-speech system for modern standard Arabic

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 101

from two points of view: do hearers perceive the same forms as would be obtained by someone reading theinput text and transcribing it, and do the intonation contours that we impose match the mood of the targetsentences? The results of the first part show that the linguistic analysis does produce good quality output, butthere are clearly issues of scale and robustness here: we have a fairly small lexicon (about 100 roots, giving riseto about 800 semantically distinct stems), and the parser does grind to halt with sentences with 15 or morewords. It therefore seems likely that some combination of fine-grained linguistic analysis and robust statisticalanalysis would be more effective than either in isolation. The results of the second part show that theintonation contours that we produce are appropriate, but that the lexical/syntactic markers for mood aretoo strong to be easily overridden by prosody. This part of the work is thus promising but requires furtherinvestigation.

Appendix A. Test sentences

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

Appendix B. Transcriptions

The following are a typical set of examples of cases where the transcription produced for the output of thetwo synthesisers differs (the ? is the MBROLA code for the glottal stop).

Page 19: Towards including prosody in a text-to-speech system for modern standard Arabic

102

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103
Page 20: Towards including prosody in a text-to-speech system for modern standard Arabic

A. Ramsay, H. Mansour / Computer Speech and Language 22 (2008) 84–103 103

References

Abdul-Raof, H., 1998. Subject, Theme and Agent in Modern Standard Arabic. TJ Press International, Cornwall.Ades, A.E., Steedman, M.J., 1982. On the order of words. Linguistics and Philosophy 4, 517–558.Aıt-Kaci, H., Boyer, R., Lincoln, P., Nasr, R., 1989. Efficient implementation of lattice operations. ACM Transactions on Programming

Languages 11 (115), 115–146.Alani, S., 1970. Arabic Phonology: An Acoustical and Physiological Investigation. Mouton, The Hague.Asher, N., Lascarides, A., 1995. Lexical disambiguation in a discourse context. Journal of Semantics.Baader, F., Sattler, U., 2001. An overview of tableau algorithms for description logics. Studia Logica 69, 1.Baloul, S., Alissali, M., Baudry, M., Boula de Mareuil, P., 2002. Interface syntaxe-prosodie dans un system de synthese de la parole a

pertir du texte en arabeXXIV Journees d‘Etudes sur la Parole. CNRS/Universite Nancy 2, Nancy.K. Beesley, Arabic finite-state morphological analysis and generation. In: COLING-96, Copenhagen, 1996, pp. 89–94.Benkirane, T., 1998. Intonation in Western Arabic (Morocco). In: Hirst, D.J., Di Cristo, A. (Eds.), Intonation Systems: A Survey of

Twenty Languages. Cambridge University Press, Cambridge, pp. 345–359.Boersma, P., Weenink, D., 2005. Praat: doing phonetics by computer (version 4.3.07) [computer program]. Retrieved from http://

www.praat.org/.Chomsky, N., 1981. Lectures on Government and Binding. Foris Publications, Dordrecht.de Jong, K., Zawaydeh, B.A., 1999. Stress, duration, and intonation in Arabic word-level prosody. Journal of Phonetics 27.Dutoit, T., 1997. An Introduction to Text-To-Speech Synthesis. Kluwer Academic Publishers, Dordrecht.Dutoit, T., Pagel, V., Pierret, N., Bataille, F., van der Vreken, O., 1996. The MBROLA project: towards a set of high-quality speech

synthesizers free of use for non-commercial purposes. In: Proc. ICSLP’96, vol 3, Philadelphia, pp. 1393–1396.Fall, A. 1990. Reasoning with taxonomies. Ph.D. Thesis, Simon Fraser University.Fehri, A.F., 1993. Issues in the Structure of Arabic Clauses and Words. Kluwer Academic Publishers, Dordrecht.Gardent, C., Konrad, K.K., 2000. Interpreting definites using model generation. Journal of Language and Computation 1 (2), 215–230.Hirst, D.J., Di Cristo, A., 1998. Intonation Systems: A Survey of Twenty Languages. Cambridge University Press, Cambridge.Hirst, G., 1987. Semantic Interpretation and the Resolution of AmbiguityStudies in Natural Language Processing. Cambridge University

Press, Cambridge.Kiraz, G., 2001. Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge University Press, Cambridge.Lenat, D.B., Guha, R.V., 1990. Building Large Scale Knowledge Based Systems. Addison Wesley, Reading, MA.Robert Malouf, 1996. A constructional approach to English verbal gerunds. In: Proceedings of the Twenty-second Annual Meeting of the

Berkeley Linguistics Society, Marseille. Available from: <http://hpsg.stanford.edu/rob/papers/>.McCarthy, J., 1979. Formal problems in semitic phonology and morphology. Ph.D. Thesis, MIT.McCarthy, J., 1993. Template form in prosodic morphology. In Papers from the Third Annual Formal Linguistics Society of Midamerica

Conference, Bloomington, Indiana University Linguistics Club.McCarthy, J., Prince, A., 1990. Prosodic morphology and templatic morphology. In: Eid, M., McCarthy, J. (Eds.), Perspectives on Arabic

linguistics II: Papers from the Second Annual Symposium on Arabic Linguistics, Amsterdam. Benjamins, pp. 1–54.Mohammed, A., 2000. Word order agreement and pronominalisation in standard and Palestinian Arabic. Current Issues in Linguistic

Theory, 1–81.Nerbonne, J., Netter, K., Pollard, C. (Eds.), 1994. German in Head-Driven Phrase Structure Grammar. CSLI Lecture Notes. Center for

the Study of Language and Information, Stanford.Ohlbach, H.J., Koehler, J., 1997. Role hierarchies and number restrictions. In: Description Logics 97, Paris.Pollard, C.J., Sag, I.A., 1988. An Information Based Approach to Syntax and Semantics: vol. 1 FundamentalsCSLI Lecture Notes 13.

Chicago University Press, Chicago.Ramsay, A.M., 1999. Direct parsing with discontinuous phrases. Natural Language Engineering 5 (3), 271–300.Rifaat, K., 2005. Structure of Arabic intonation: a preliminary investigation. In: Alhawary, M.T., Banmamoun, E. (Eds.), Perspectives on

Arabic Linguistics XVIIXVIII. John Benjamin, Amsterdam/Philadelphia, pp. 49–67.Sadler, Louisa, 1996. New developments in LFG. In: Brown, Keith, Miller, Jim (Eds.), Concise Encyclopedia of Syntactic Theories.

Elsevier Science, Oxford.Steedman, M., 2000. The Syntactic Process. MIT Press, Cambridge, MA.Wedekind, J., 1996. On inference-based procedures for lexical disambiguation. In: Tsujii, J.-I. (Ed.), Proceedings of the 16th International

Conference on Computational Linguistics (COLING-96), Copenhagen, pp. 980–985.Zaki, A., Rajouani, A., 2002. Rule based model for automatic synthesis of F0 for declarative Arabic sentences. In: Speech Prosody 2002,

Aix-en-Provence, France.