voice source properties of the speech code€¦ · voice source properties of the speech code 1...

14
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Voice source properties of the speech code Fant, G. and Kruckenberg, A. journal: TMH-QPSR volume: 37 number: 4 year: 1996 pages: 045-056 http://www.speech.kth.se/qpsr

Upload: others

Post on 12-Jun-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Voice source properties ofthe speech code

Fant, G. and Kruckenberg, A.

journal: TMH-QPSRvolume: 37number: 4year: 1996pages: 045-056

http://www.speech.kth.se/qpsr

Page 2: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e
Page 3: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

TMH-QPSR 4/1996

45

Voice source properties of the speech code1

Gunnar Fant and Anita Kruckenberg

AbstractThis is an outline of the knowledge we need in order to include the voice source inan advanced model of speech production with applications to text-to-speech rules.Recent results from studies of the Swedish language provide information of sourceproperties and source-vocal tract interaction as a function of the segmental andprosodic frame within an utterance and with reference to aerodynamic conditions.Our study discusses source modelli ng, individual voice qualiti es, segmentaldependencies, influence of stress and accentuation, phrase contours, covariationwith F0.

1 Also presented at The Acoustical Society of America and The Acoustical Society of Japan, Third JointMeeting, Special Session 2aSC, December 1996, Honolulu, Hawaii.

IntroductionA challenge for speech research and technologyis to advance our knowledge base for preservinga maximum of eff iciency in speech recognitionand for improving the quali ty of text-to-speechsynthesis. This is more of a problem in formantor articulatory coded synthesis than in concate-nated systems like the PSOLA. However, aknowledge of factors affecting source parameterscould also be adopted in concatenating strategies.

The basic problem is not that of the voicesource alone. Covarying noise and varioussource-fil ter interaction phenomena have also tobe considered, e.g. changes of source waveformand ampli tude induced by supraglottal con-strictions and the joint dependencies of bothsource and fil ter functions on the glottal state andoverall aerodynamic conditions. Dynamicvariations and coarticulatory phenomena withinan utterance become more complex to model thanspeaker specific average data.

There are similarities between the loss ofsource eff iciency, e.g. in ampli tude and highfrequency content, comparing the effects ofsupraglottal constriction and glottal abductionbefore a pause. These relations are best dealtwith in the frame of an articulatory synthesissystem where all relevant source propertiesbecome the automatic consequences of glottaland supraglottal articulations and lung pressure.

The human vocal cords execute a high degreeof flexibili ty but also of instabili ty in theirvibratory modes which are sensitive toaerodynamic as well as acoustic interactions. Alarge number of conditioning factors contributeto seemingly chaotic variations of small detailesfrom one voice period to the next which add tothe individual character of a voice. These are ofsecondary importance, but need to be studied.

We are beginning to understand most of thebasic phenomena but we lack systematic andsuff iciently complete descriptions. A problem isthat we have very li ttle experience from per-ceptual experiments. Much work is needed toreach an insight in the relative perceptualsalience of various components of a source rulesystem.

Voice source parameterizationThe basic requirement for developing voicesource rules is an eff icient parameterizationsystem. We need to describe essentials with asmall number of time varying parameters withthe option to go into finer details if needed whichrequires an extended number of parameters.Covarying noise generation and modifications ofvocal tract transfer function must also beconsidered.

How do we define the source? The commonconcept is the pulsating glottal flow or itsderivative disregarding subglottal componentsand superimposed interaction ripple. The latter isessentially due to the finite presence of formantoscill ations in the transglottal pressure dropwhich determines the glottal flow in a square lawdependency. This non-linear transform accountsfor the presence of distributed spectral dips,typically a zero around 2F1, and in the timedomain an extra positive peak prior to the mainpeak of the differentiated glottal flow function(Fant, 1986) which complicates the time domainmatching of source parameters. It is moreapparent in a non-leaky relative strong phonationthan in a weak breathy phonation. Anotherdeviation from the ideal LF-model is an extraexcitation at the instant of glottal opening whichis seen in some voices. To these add the effects ofsupraglottal coupling in a leaky voice. All these

Page 4: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Fant & Kruckenberg: Voice source properties of the speech code

46

phenomena add motivation for relying more onfrequency domain than time domain derivation ofsource parameters.

The LF-modelThe LF-model (Fant et al., 1985) is a usefulapproximation that has found a wide use. Formaximal correspondence between a naturalsample and a synthetic replica it is necessary toadjust the LF-parameters to suit particularconstraints such as higher pole conventions incascaded formant synthesis systems, see (Fant,1995).

The basic parameters of the LF-model aredefined in the time domain but we make a maxi-mal use of frequency domain correspondences(Fant & Lin, 1988) and frequency domainmatching (Fant, 1995). A consistent frequencydomain system is now used at MIT (Stevens,1994; Stevens & Hanson, 1994; Hanson, 1995).Transformation between their system and ourshave been established.

The LF-model (Fig. 1) operates with threeshape parameters, Rk, Rg and Ra in addition tothe voice fundamental frequency F0 and anexcaitation ampli tude Ee. The parameter Rkdetermines the degree of pulse skewing; Rg andto a less degree Rk influence the duration of theglottal pulse, and Ra is a relative measure of theduration of the return phase following excitationat closure. Ee is the negative derivative of glottalflow at the instant of excitation, i.e. of maximaldiscontinuity.

The open quotient of the LF-source is

OQ = (1+Rk)/2Rg + Ra (1)

We usually disregard the Ra term in (1).However, Ra is the most important singleparameter for spectrum shaping, It determines acut off frequency

Fa = F0/(2πRa) (2)

where the glottal flow derivative spectrum afteran initial slope of -6 dB/oct turns into a slope of -12 dB/oct.

The low frequency part of the sourcespectrum in the region of the fundamental and thesecond harmonic is determined by the main shapeof a glottal pulse. It is influenced by all three LF-parameters, not only by Rk and Rg but also byRa. The upper part of the source spectrum ismainly determined by Fa.

Stevens & Hanson (1994) and Hanson (1995)quantify the low frequency region of the sourcespectrum by H1*, the ampli tude of the voicefundamental and H2*, the ampli tude of thesecond harmonic where * indicates a property ofthe source spectrum in distinction to the completesound spectrum. We have established anempirical relation for deriving H1*- H2* fromthe open quotient (Fant, 1993).

H1*-H2* = -6 + 0.27exp(5.5OQ) (3)

For facili tating spectrum matching withoutinverse fil tering we have derived the followingrelations for predicting variations in H1 and H1-H2 from variations in Rk, Rg, and Rk (Table 1).

These relations and the access to a codebookhave facili tated frequency domain derivations ofLF-parameters.

Table 1. Change in each of Ra, Rk and Rgneeded to increase the level of the fundamental

Fig. 1. The LF-model

Page 5: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

TMH-QPSR 4/1996

47

H1 by 1 dB and H1-H2 by 1 dB keeping otherparameters constant.

Parameter dR/dH1 dR/d(H1-H2)Ra 1.0 1.25Rk 3.0 4.5Rg -12.0 -10.0

The Rd -transformA major extension of the LF model (Fant et al.,1994; Fant, 1995) has been to introduce a datareduction scheme whereby the waveshapeparameters Rk, Rg and Ra are collapsed into asingle parameter Rd.

Rd=(1/0.11)(0.5+1.2 Rk)(Rk/4Rg+Ra) (4)

This is an approximation to the basic physicaldefinition

Rd = (Uo/Ee)(F0/110) (5)

which is independent of the LF-model. Here, Uois the ampli tude of the glottal flow pulse above apossible steady leakage flow. The ratio(Uo/Ee)=Td is the effective duration in ms of thefalli ng branch of glottal flow, see Fig. 1. At anormalising F0=110 Hz there is numericalidentity between Rd and Td.

An advantage of introducing the Rd-parameter is that default values of Rk, Rg and Racan be predicted from any particular Rd value.From statistical analysis (Fant et al., 1994; Fant,1995) of data mainly from Gobl (1988) we havefound

Rap=(-1+4.8Rd)/100 (6)

Rkp=(22.4+11.8 Rd)/100 (7)

Rgp is obtained from Eq. 6 and 7 inserted intoEq 4. These default values are summarisedbelow.

Table 2. Default LF-parameters for a range ofRd values. Fap=F0/(2πRap) refers to F0=100Hz.

Rd Rap Fap Rkp Rgp OQp

% Hz % % %0.3 0.44 3600 26 179 350.5 1.40 1590 28.3 137 470.7 2.36 674 30.7 118 55.51.0 3.8 420 34.2 103 651.4 5.7 280 39.0 95 732.0 8.6 185 46.0 93.5 782.7 12.0 133 54.3 98.0 79

The predictabili ty is often remarkably good (Fantet al., 1994; Fant, 1995). An appreciable part ofthe variance inherent in a complete LF-representation of a number of voice samples, e.g.from different speakers or within an utterance, isreduced by the Rd-transform Eq. 4, whichcapitalises on constraints in parametercovariation. In practice, we often need to recoverthe complete LF-representation which isaccomplished by introducing the ratios of trueand predicted parameters. These parametercoefficients are

ka = (Ra/Rap) = (Fap/Fa) (8a)

kg = (Rg/Rgp) (8b)

kk = (Rk/Rpk) (8c)

The value of kk is fully predictable from [Rd, ka,kg] which constitutes a waveshape vector. Theparameter coeff icients Eq. 8 become useful forspecifying contextual deviations from defaultrules or speaker specific glottal wave shapes - akind of principle component analysis.

A breathy voice is characterised by a high kaand a pressed voice a high kg, a low ka and oftena high kk. Females usually have higher Rd and kathan men which implies lower Fa. Voicedconsonants usually have higher Rd and higher kaand lower Rg than vowels. These data from Fant(1995, 1996) are in general agreement with thosereported in (Gobl, 1988; Carlson et al., 1989;Karlsson, 1995).

A sequence of LF source spectra covering themain range of Rd-parameters is shown in Fig. 2.

The four Rd values, Rd=0.5, 0,7, 1.4 and 2.7,are associated with increasing open quotientsOQ=0.35, 0.55, 0.73 and 0.79. They representbase forms of (1) an highly adducted, somewhatpressed male voice, (2) normal male voice, (3)normal female voice, and (4) breathy femalephonation or breathy termination of voicingbefore pause irrespective of voice type. The kafactor adds to the breathyness and the kg factorto the degree of press.

Voice source rulesThe overall strategy for dealing with theinfluence of individual and contextual variationsin connected speech is to submit the voiceexcitation parameter Ee and the shape vector [Rd,ka, kg] to a number of operational rules. Theseare conveniently handled on a decibel scale,which implies a summation of contributions fromseveral factors and levels of analysis. Thus, thevalue of the source parameters at any instant oftime is determined by

Page 6: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Fant & Kruckenberg: Voice source properties of the speech code

48

an interpolation of data from adjacent targetvalues specified at specific time locations withinthe utterance. These data are influenced byphonetic segment identities through the degree ofsupraglottal constriction and the state of theglottal opening, speaker specific constants andhabits, degree of emphasis, accentuation andstress, location within the utterance, speciallywith respect to its boundaries.

The rules take into account systematiccovariations of Ee and Rd across segments andwithin a phrase contour and the extent to whichEe and Rd covary with F0. These F0 depen-dencies are of basic importance. They enable aprediction of default values of Ee(F0) valid fornon-close vowels uttered at constant subglottalpressure, which determines the basic Ee(t)contour of an utterance prior to the application ofcontributions from all other factors mentionedabove. As a result there is a general tendency ofa positive correlation between Ee and F0 whichis further enhanced by the simultaneous F0 andEe dips of voiced obstruents. The envelopes ofEe(t) and F0(t) show similar contours.

In a first approximation Rd is independent ofthe basic F0-contour. The covariation of Rd andEe follows variational rules that supplement the

segment specific data, specially in transitionsbetween extreme values. A common rule is that 1dB increase in Ee is associated with 0.5 dBdecrease of Rd. According to Eq. 5, at constantF0, this implies that Uo varies with the squareroot of Ee. At a hard voiced onset Uo and Eevary proportionally and Rd is constant. In theabducted termination of voicing before a pauseor before an unvoiced consonant we find atendency of constant Uo, i.e. a 1 dB decrease ofEe is accompanied by 1dB increase of Rd. On avariational basis, we may thus define a relation

∆log(1/Rd)=Ks∆logEe (9)

where the constant Ks has a normal range ofvariation from 0 to 1. In a transition towards amore eff icient adducted phonation, an Eeincrease may be attained at a lower Uo and theconstant Ks attains negative values.

Rule sequence1. Define the domain and boundaries of a

phrase or a breathgroup. Deriveaccentuations, relative stress levels andsegment

Fig. 2. LF source spectra at F0=100 Hz and four representative Rd values.

Page 7: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

TMH-QPSR 4/1996

49

durations. Outline, on a semitone scale, anormalised F0 contour for the overallintonation including regions of focalaccentuation. Include expected F0-perturbations due to extreme vocal tractnarrowing or extreme glottal opening. Thisbasic layout is a prerequisite generatedfrom the text to phonetic rules componentsof the synthesis.

2. Select speaker specific data. These includea reference midfrequency F0r, of the orderof 100-150 Hz for a male voice and 150-300 Hz for a female voice, and in additiona specification of the modulation range byan upper and a lower limiting F0. Select asource shape vector [Rd, ka, kg] and alsoformant frequency scale factors (Fant,1975) appropriate for the sex of thespeaker and the length of the vocal tract.

3. From data in (1) and (2) construct a firstapproximation to the basic F0 contourwithin the utterance. Remap the defaultnormalised Ee(F0) with respect to thespeakers particular F0r and calculate thecorresponding basic Ee(t)-contour.

4. Add onset and offset Ee contours anddeviations from the overall Ee declinationnot predictable from F0 alone, i.e. due tothe declination of subglottal pressure, seeFig. 7.

5. Introduce tabulated segment specificreference values of data on Ee and [Rd, ka,kg]. Make note of the time locations withineach phonetic segment where thesereference data apply.

6. Add modifications with respect to theoverall voice intensity level and to localvariations of emphasis, accentuation andstress.

7. Interpolate Ee and the source shape vector[Rd, ka, kg] within the utterance, frame byframe or pitch synchronously. Take care to

preserve coarticulations within and acrosssegments. Observe the particular locationsof the targets which may not coincide withsegment boundaries or segment centers.The abduction gesture of an aspiratedunvoiced stop may start already at theonset of the preceding stressed vowel.

8. Add aspiration noise and formant band-width broadening as a function of Ra andka and the particular phonetic segment.Add covariation of the vocal tract transferfunction if needed.

Supporting dataThe voice source rules above constitute a frame-work that suggests a principal procedure. Someof the detailed contents will now be discussed.

Speaker specific dataA typical Rd vector for male vowels (Fant, 1995)is [Rd=0.7, ka=1, kg=1] which implies Ra=2.4%,Fa=680 Hz at F0=100 Hz, Rk=31% andRg=118%. According to Eq. 8, H1*-H2*= 0.2dB. The open quotient OQ without Ra is 0.555and with Ra included 0.58. The correspondingspectrum is shown to the left in Fig. 3.

The right hand figure pertains to a somewhatmore sonorous voice of the same Rd=0.7 butwith Fa=2700 Hz, Rg =177% and Rk=47%which implies ka=0.25, kg=1.5 and Rk=1.5. Thegreater relative dominance of the second har-monic, H1-H2=-3.4 dB is predictable from thesmaller OQ==O.415 inserted into Eq. 7.Perceptual tests indicate that the main quali tydifference lies in the Fa-domain, i.e. in the lessspectral til t in the upper diagram. Here theoverall i ntensity is also 2 dB higher than in thelower diagram. This is an example of a greatereff iciency at the same Ee, Uo and F0, often foundin a stressed vowel situated late in a phrase.

Fig. 3. Synthesised spectra of vowel [ � :] , both with Rd=0.7, F0=100 Hz. Left: default values,ka=1, kg=1, OQ=55.5%. Right: ka =0.25, kg=1.5, OQ=41.5%.

Page 8: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Fant & Kruckenberg: Voice source properties of the speech code

50

Rd values of the order of 0.5-1.5 are foundfor male voices, which overlaps the distributionfor female voices with Rd in the range of 0.8-2.5.Increasing Rd implies increase of Rk and Ra, Fadecreasing and Rg on the whole decreasing.

Female voices usually have larger ka and thuslower Fa than men (Karlsson, 1992; Karlsson &Neovius, 1994; Karlsson, 1995; Fant, 1995).This is especially true of breathy, soft femalevoices, which also show a substantial glottalleakage and aspiration noise (Fant & Lin, 1988;Klatt & Klatt, 1990; Fant et al., 1991; Karlsson,1992; Karlsson & Neovius, 1994; Stevens &Hanson, 1994; Fant, 1995; Hansson, 1995;Karlsson, 1995).

A voice attribute outside the domain of theLF-model found in some voices is the appearanceof subglottal formants introduced by pole-zeropatterns (Fant & Lin, 1988; Karlsson &Neovius, 1994). An other is the occurrence of anextra excitation from a discontinuity at theinstant of glottal opening which adds somewhatto the spectral shape and to irregularities ofwaveforms. It has been shown (Båvegård &Fant, 1994) that perturbations induced by acomplete articulatory synthesis add to theperceived quality.

Segmental data and emphasis/deemphasisThe left hand part of Fig. 4 from Fant (1995)shows logEe in dB, referred to as EE, versuslog(1/Rd) in dB, referred to as EUF, for a corpusof Swedish vowel and consonants originatingfrom a Swedish subject. The overall span from

open vowels to stop consonant is of the order of10 dB in 1/Rd and 20 dB in Ee. This is anexample of a more general rule of covariation of1 dB in 1/Rd with 2 dB in Ee, i.e. Ks=0.5 whichwe have found to be typical of dynamicvariations within and across segments and as aconsequence of varying voice effort. The righthand part of Fig. 4 contains a limited set ofvowels from a French speaker (Karlsson, 1995).The trend is similar but for a linear relation of Eeand 1/Rd.

In a vowel, increased voice effort or emphasiswill cause increased Ee and lowered Rd and ka,and often a higher F0 which contribute to anincrease of Fa and thus a high frequencyemphasis (Fant, 1959; Fant & Kruckenberg,1994; Fant, 1995; Sluij ter, 1995), oftencombined with increased kg and thus decreasedopen quotient (Fant, 1995, 1996). A commontrend is that Rd and Ee covary with a Ks=0.5.However, a high narrow vowel might reach amore extreme target which may counteract thistrend. Lexical stress without an F0 mediatedaccentuation will have a rather small effect onthe source parameters.

Increased emphasis adds to segmentalcontrasts not only in terms of formant patternsbut also in terms of properties of the voice source(Fant, 1993). This is the consequence of moreextreme targets, open vowels become more openwhereas consonants become more constricted.Voiced consonants attain a higher Rd. Thenarrowing of the supraglottal constriction in avoiced consonant will cause a reduction of Ee aswell as 1/Rd which counteracts an increase due

Fig. 4. Segmental influence on Ee and 1/Rd in logarithmic measures. Left: Swedish reference subject.Right: French speaker.

Page 9: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

TMH-QPSR 4/1996

51

to raised subglottal pressure. In addition, thereenters an intensity loss due to formant frequencyshifts.

The underlying mechanism of the source tractinteraction is a loss of transglottal pressure dueto the increase of supraglottal flow resistance(Bickley & Stevens, 1986; Strik & Boves, 1992;Fant, 1995). Similar reductions of 1/Rd and Eeoccur at boundaries towards termination ofvoicing before a pause or before an unvoicedconsonant. A common denominator is a loweredsubglottal pressure and abducted vocal folds(Strik & Boves, 1992). The local drop in Psub ina stressed vowel preceding an aspirated stop wasdemonstrated in Fig. 2 of Fant et al. (1996).

As seen in Fig. 4, voiced consonants havehigher Rd and lower Ee values than vowels. Thisis also true of narrowly articulated vowelscompared to open vowels. In Swedish, emphasisof a close long vowel will cause a more extremetarget, e.g. a [j] element within an /i:/ segment,see Fig. 8.

The loss of vowel/consonant contrastassociated with deemphasis occurs morefrequently in connected speech than would beimplied by the phonological structure of anutterance. This is also an individual speakerattribute. Incomplete oral closure causes nasalconsonants to be realised as nasalised vowels andstops as approximants. These phenomena areconveniently handled in articulatory synthesis butare more diff icult to structure in formantsynthesis.

Continuities and coarticulationGlottal parameters are not constant within aphonetic segment. They follow general gesturepatterns of continuity and coarticulation (Gobl,1988; Gobl & Ní Chasaide, 1988; Ní Chasaide etal., 1994). A typical example is that the onset ofvoicing usually has a smaller time constant thanthe offset towards a following unvoiced segment.This is especially the case of pre-occlusiveaspiration (in the transition from a stressed vowelto an unvoiced aspirated stop) which not onlyproduces noise at the offset of the vowel and atthe initiation of the occlusion, but also imposesthe typical abduction correlates of increased Rdand ka, increased bandwidths mainly affectingthe first formant, and traces of subglottalformants (Fant & Lin, 1988; Fant et al., 1991).These features are fully developed at theboundary, but may be detectable already at theonset of the vowel (Gobl & Ní Chasaide, 1988).

Calculations (Fant, 1995, 1996) of the effectsof glottal leakage, see Fig. 5, show that a glottalarea as small as a few mm2 causes a substantial

increase of the formant bandwidths, mainly ofB1. The following empirical formula forbandwidth increments have been suggested (Fant,1995).

∆B1 = 250(F1/500)2(Ra/12) (9a)

∆B2 = (∆B1) (F1/F2)/2 (9b)

The weakening of F1 from a substantialbandwidth increase is a consistent spectralattribute of abduction which explains the “F1-cutback” in the transitional phase from the releaseof an unvoiced consonant to a following openvowel and it also occurs regularly in voicetermination before a pause.

Subglottal pressure, Ee and F0Preliminary results from a study of subglottalpressure, Ps, in speech, measured directly from atracheal probe, was reported in (Fant, 1996; Fantet al., 1996). More recently we have gained somefurther insights in the covariation of subglottalpressure and speech wave parameters includinginverse fil tering. Data were obtained fromglissando phonations, single vowel utterancesand connected speech.

Covariation of Ee and Ps in glissandophonations are shown in Fig. 6. They support theearlier reported tendency (Fant, 1982; Fant &Kruckenberg, 1994; Fant et al., 1994) of Eerising with the second power of F0 up to thespeakers midfrequency, F0r=130 Hz for thisparticular male subject, at F0>F0r followed by adecrease or increase depending on Ps. An

Fig. 5. Calculated increase in formant band-widths as a function of glottal opening.

Page 10: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Fant & Kruckenberg: Voice source properties of the speech code

52

Fig. 8. Recordings of subglottal and supraglottal pressures added to our standard analysis display.The text is "Ingrid fick brev från Arne". "Ingrid" and "brev" have accent 1. "Arne" has accent 2. Atthe bottom SPL with LP 1000 Hz and SPLH, high frequency preemphasis.

Fig. 6. Covariation of Ee and subglottalpressure Ps with F0 in glissando phonations ofa vowel [ ae] . Data from a continuation of Fantet al. (1996).

Fig. 7. Stylised outline of the Ps dependentcomponent of the Ee contour within a phrase(breathgroup).

Page 11: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

TMH-QPSR 4/1996

53

analysis of the covarying subglottal pressurerevealed a rise of Ps proportional to F00.7 belowF0r,, and above F0r a considerable variation. Psboth increasing and decreasing.

The glottal maximum area as studied bymeans of fiberscope filming showed a maximumjust below F0r with an increase in proportion toPs at F0<F0r and a decrease in proportion to F0-1 at F0>F0r. The total increase of Ee fromF0=75Hz to F0r= 130 Hz was of the order of 10dB. In this region we found Ee to be proportionalto F01.35 and to Ps1.1. These relations alsoprovided a good prediction for single vowelutterances and could be used a guide for analysisof connected speech.

Other main findings from these aerodynamicstudies are that the voice onset after a pauserequires a minimum Ps of about 3.5 cm H2Owhile vocal cord vibrations can continue down tovery low transglottal pressures at an abductedvoice offset. Voiced consonants as well assemiclosure targets in Swedish long stressed

vowels show up as a significant local rise insupraglottal pressure which reduces the trans-glottal pressure.

The overall declination of subglottal pressurewithin a phrase is of the order of 2 dB/sec. WithEe approximately proportional to Ps at constantF0 we would therefore expect the same rate ofdeclination of Ee within the phrase, see Fig. 7. Itincludes a 50 ms risetime and a faster thanaverage declination in the last 0.5 seconds. Thisis the correction to the general Ee(F0) to beadded to the rule (4) for calculating the initialEe(t) contour of the phrase. A typical examplefrom a subglottal pressure recording is shown inFig. 8.

Accentuation and stressA general observation of interest is how Ps varieswith stresses and accentuations within anutterance. We found a build up of Ps early in theonset of a focally accentuated syllable, i.e. in the

Fig. 9. Illustration of focal accent 1, female subject, AK [15,18] . Observe how the intensity curves,SPL and SPLH reflect the shape of an Ee(F0) with maximum at F0=F0r=215 Hz, and an intensityminimum at the peak F0=320Hz.

Page 12: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Fant & Kruckenberg: Voice source properties of the speech code

54

region of an anticipated syllable P-center,followed by a decrease at a higher rate than theoverall Ps declination. This tendency was alsofound in non-focal accentuation. It explains thetendency reported in (Fant & Kruckenberg,1994, 1995) that Ee does not follow the F0-contour in the focal F0 peak area but staysconstant or shows a minimum at the maximal F0,and maxima whenever F0 passes through the F0rvalue.

We may now compare our earlier findings(Fant & Kruckenberg, 1995) in Fig. 9, with datafrom the recent aerodynamic study (Fant et al.,1996) in Fig. 10. The sentence is the same. Fig. 9pertains to a female subject, AK, where weestimated an F0r of 215 Hz. For the male subjectin Fig. 10, we estimated an F0r of 130 Hz. Bothshowed a pronounced intensity minimum in theF0-peak area of the [� :] in

Fig. 10. The same sentence as in Fig. 9 produced by the male subject, SH (Fant et al., 1996).Observe the dip in subglottal pressure synchronized with the peak F0, and similarities with Fig. 9.

Page 13: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

TMH-QPSR 4/1996

55

“Lenar” , which for the male voice was initiatedby a local drop in subglottal pressure followed bya pressure restoration in the final word “ igen”,marking a prosodic boundary.

Another noteworthy observation comparingFig. 9 and 10 is that the semitone F0-scalepreserves an almost identical shape and size ofthe main F0 peak of the female and maleutterances.

The main acoustic correlate of focalaccentuation in Swedish is the local dominanceof the F0 peak usually combined with a signi-ficant increase of intensity, high frequencyemphasis and increased duration of the stressedword. However, individual variations are large.The difference between focal and nonfocal stressmay be signaled by F0 alone On the whole,duration and F0 appear to be the main stresscorrelates. Excluding focal accentuation theaverage increase of Ee and Fa are of the order of2 dB only (Fant & Kruckenberg, 1994, 1995). Inemphatic focal accentuation we encounter 3-5times larger values. Lexical unaccented stressmay be realised by duration alone.

ConclusionsIn order to develop a more profound insight inthe theory of speech production as a guide forfuture developments of speech technology weneed to learn more about the voice source and itsrole of an integrator of segmental and prosodicstructure. We have suggested a framework forthe further developments of voice source rulesand exempli fied their contents This modelli ng isbest understood with reference to aerodynamicconstraints and the continuous gestures ofpulmonary activity and glottal and supraglottalarticulations.

The rules have not yet been implemented in atext-to-speech system. Also, it remains to test therelative perceptual salience of the variouscomponents of our system. Some details may notbe very significant or perceptually masked bydeficiencies in other parts of the rule system. Weneed a deeper insight in the dependencies ofindividual voice quali ties on source functions andsegmental and prosodic structures, specially withrespect to temporal dynamics. A mere change inaverage values of glottal parameters is notsufficient.

AcknowledgementsThis work has been financed by grants fromTFR, The Swedish Research Council forEngineering Sciences, the Bank of Sweden Ter-centenary Foundation, the Carl Trygger

Foundation and support from the Telia PromotorAB. The authors are indebted to JohanLilj encrants for valuable support in dataprocessing.

ReferencesBickley CC & Stevens KN (1986). Effects of a vocal

tract constriction on the glottal source: Experi-mental and modelli ng studies. Journal ofPhonetics 14: 373-382.

Båvegård M & Fant G (1994). Notes on glottalsource interaction ripple. STL-QPSR, KTH,4/1994: 63-78.

Carlsson R, Fant G, Gobl C, Granström B, KarlssonI & Lin Q (1989). Voice source rules for text-to-speech synthesis. ICASP, I: 223-226.

Fant G (1959). Acoustic Analysis and Synthesis ofSpeech with Applications to Swedish. EricssonTechnics, No. 1, 1959.

Fant G (1975). Non-uniform vowel normalization,STL-QPSR 2-3/1975: 1-19.

Fant G (1982). Preliminaries to the analysis of thehuman voice source. STL-QPSR 4/1982: 1-27.

Fant G (1986). Glottal flow: Models and interaction.Journal of Phonetics, 14/3-4: 393-399.

Fant G (1993). Some problems in voice sourceanalysis. Speech Communication, 13: 7-12.

Fant G (1995). The LF-model revisited. Transfor-mations and frequency domain analysis. STL-QPSR 2-3/1995: 119-156.

Fant G (1996). The voice source in connectedspeech. To be published in Speech Communi-cation.

Fant G & Kruckenberg A (1994). Notes on stressand word accent in Swedish. Proc Intl Symposiumon Prosody 1994, Yokohama. Also published inSTL-QPSR 2-3/1994: 125-144.

Fant G & Kruckenberg A (1995). The voice sourcein prosody. ICPhS 95, II: 622-625.

Fant G & Lin Q (1988). Frequency domain inter-pretation and derivation of glottal flow parameters.STL-QPSR 2-3/1988: 1-21.

Fant G, Hertegård S & Kruckenberg A (1996). Focalaccent and subglottal pressure. TMH-QPSR2/1996: 29-32.

Fant G, Kruckenberg A & Nord L (1991). Prosodicand segmental speaker variations. SpeechCommunication, 10: 521-531.

Fant G, Kruckenberg A, Lil jencrants J & BåvegårdM (1994). Source parameters in continuousspeech. Transformation of LF-parameters ICSLP-94, Yokohama.

Fant G, Lil jencrants J & Lin Q (1985). A four-parameter model of glottal flow. STL-QPSR4/1985: 1-13.

Gobl C (1988). Voice source dynamics in connectedspeech. STL-QPSR 1/1988: 123-159.

Gobl C & Ní Chasaide A (1988). The effects ofadjacent voiced/voiceless consonants on the vowelvoice source: a cross language study. STL-QPSR 2-3/1988: 23-59.

Page 14: Voice source properties of the speech code€¦ · Voice source properties of the speech code 1 Gunnar Fant and Anita Kruckenberg Abstract T his is an ou tline of the kno w ledg e

Fant & Kruckenberg: Voice source properties of the speech code

56

Hanson M (1995). Glottal characteristics of femalespeakers. Ph.D thesis, Harvard Univ. Cambridge,MA.

Karlsson I (1992). Modelli ng voice variations infemale speech synthesis. Speech Communication,11: 491-495.

Karlsson I (1995). Voice qualit y, male/femalevariations and speaking style. SPEECH MAPS(ESPRIT/BR No 6975) WP1 Year 3 report, editedby CH Shadle, Delivery 26: 9-13.

Karlsson I & Neovius L (1994). VCV-sequencies ina preliminary text-to-speech system for femalespeech. STL-QPSR 1/1994: 47-57.

Klatt D & Klatt L (1990). Analysis, synthesis andperception of voice qualit y variations amongfemale and male talkers. J Acoust Soc Am 87: 820-857.

Ní Chasaide A, Gobl C & Monahan P (1994).Dynamic variation of the voice source in VCV

sequences: intrinsic characteristics of selectedvowels and consonants. SPEECH MAPS(ESPRIT/BR No.6975) Delivery 15, Annex D.

Sluijter A (1995). Phonetic Correlates of Stress andAccent. PhD thesis, University of Leiden, TheHague: Holland Academic Graphics.

Stevens KN (1994). Prosodic influences on glottalwaveform: Preliminary data. Intl Symposium onProsody, 1994, Yokohama, 53-63.

Stevens KN & Hanson M (1994). Classification ofglottal vibration from acoustic measurements. In:Fujimura O & Hirano M, eds, Vocal FoldPhysiology. Singular Publ Group, 147-170.

Strik H & Boves L (1992). On the relation betweensource parameters and prosodic features ofconnected speech. Speech Communication, 11/2-3:167-174.