[ieee africon 2007 - windhoek, south africa (2007.10.26-2007.10.28)] africon 2007 -...

1

Abstract—The impressionistic phonetic specification of vowel

quality is usually given in terms of the quasi-articulatory

dimensions “height,” “backness” and “rounding,” relative to the

quality of the cardinal vowels placed in a stylized vowel

quadrilateral. For some purposes, an acoustic approach to vowel

dimensions may be preferable. A system of reference vowels

created by a scaling discretisation of formant space (SDF) is used

in this study to derive new acoustic dimensions closely aligned to

the traditional dimensions mentioned above. Some implications of

the new acoustic dimensions for vowel classification are

investigated, using a well-known vowel dataset. Based on this

analysis, it is argued that “lengthening” is a more appropriate

vowel dimension than “rounding.”

Index Terms— Cardinal vowels, speech analysis, vowel

dimensions, vowel formants.

I. INTRODUCTION

N the vowel chart of the International Phonetic Association

(IPA), some vowels are labeled as “rounded” [1]. Often this

is a binary distinction contrasted to “unrounded,” but it may

also be used as a continuous dimension. Thus we find the

following statement in the Handbook of the IPA: “In the back

series of cardinal vowels ([� � o u]) lip-rounding progressively

increases, from none on [�] to close rounding on [u]” [1, p12].

Roundedness or rounding has been separated from two other

vowel dimensions (“backness” and “height” are used by the

IPA), at least since the mid nineteenth century, when A. M.

Bell broke with the i-a-u vowel triangle of Hellwag [2]. The

three labels are (at least originally) articulatory, and the

complete treatment of vowel quality in the Handbook [1] is

devoid of acoustic reference. However, the intermediate

cardinal vowels are placed at auditorily equal distances,

resulting in a quaint mix of factors leading the IPA to call their

vowel space “abstract” [1, p12]. The cardinal vowels and their

description were not invented by the IPA of course, but largely

inherited from Daniel Jones [3], [4].

In this paper, a purely acoustic approach is taken to the

problem of choosing reference vowels. By the transformation

of formant frequencies, three new acoustic dimensions are

defined, denoted by H, B, and S. The acoustic H and B map

readily to height and backness respectively, but the relation of

Manuscript received 5 March 2007; revised 1 June 2007.

(The author is with the University of Johannesburg, Johannesburg, South

Africa (e-mail: [email protected]).

S to rounding is more complex. The choice of symbol S is

inspired by the fact that it contains all scale information, while

the B-H plane is normalized. S is influenced by lengthening of

the vocal tract, which may be brought about by rounding the

lips, but also by other means.

The rest of this paper will be structured as follows. The regular

geometric framework of acoustic reference vowels is briefly

introduced in Part II. In Part III the three new acoustic vowel

dimensions are developed, and illustrated using a well-known

dataset. The “lengthening” dimension is given special

consideration and contrasted with traditional “rounding.” The

conclusion drawn in Part IV, is that “lengthening” is better for

general phonetic purposes.

II. REFERENCE VOWELS DERIVED FROM A SCALING

DISCRETIZATION OF FORMANT SPACE

It was mentioned above that the impressionistic reference

vowels employed by the IPA [1] are the cardinal vowels

defined by Daniel Jones [3]. The following procedure can be

used to define acoustic reference vowels (ARV), intended to

replace the cardinal vowels in an acoustic approach to

determine static vowel quality [5], [6]:

1) Measure the first three formant frequencies for every

frame of the vowel signal. Designate them as F1, F2, F3.

Acoustic formant analysis is a non-trivial exercise in

“Lengthening” is a better vowel dimension than

“rounding”

Hendrik F. V. Boshoff, Member, IEEE

I

|i y

|e ø

|� œ

|æ

| �

|�

|� �

|a �

|� u

|� o

|� �

|

log(F1)

log(F2)

log(F3)

Fig. 1. Acoustic reference vowels (ARVs) in the dimensionless plane

derived from a scaling discretization of formant space (SDF). The vowel to

the left of the vertical line in every cell is neutral and the one to the right

“lengthened.” Symbols are selected from the IPA.

1-4244-0987-X/07/$25.00 ©2007 IEEE.

2

general [5], but recent advances in the field hold the

promise to provide reliable measurements [7].

2) Discretise the two ratios log(F2/F1) and log(F3/F2)

using the constant 2/)51( +=φ to form a regular

grid. This reference grid is called the scaling

discretization of formant space (SDF), which partitions

vowel space into diamond shaped cells. A one-to-one

mapping is possible between the cells of this grid and

the cardinal and other IPA vowels [5], [8], [9].

3) Define the geometric center point of every cell as the

acoustic reference vowel (ARV) associated with that

cell. Allow two forms of every ARV, provisionally

called “neutral” and “lengthened.” Labels for the ARV

are taken from the IPA [6].

Fig. 1 shows the SDF, with IPA vowel symbols mapped to

every articulatorily accessible cell [10] in the grid.

The scheme of acoustic reference vowels has the attractive

property that it constitutes a ratio scale of measurement, the

highest level in Stevens’ theory [11], [6].

III. ACOUSTIC VOWEL DIMENSIONS

A. Rotation of log-formant space

Comparing Fig. 1. with the IPA vowel chart, the similarities

in vowel placement are clear. We may therefore expect the

horizontal direction to be related to “backness” and the vertical

direction to “height.” The following transformation constitutes

a simple rotation of the axes in log-formant space, and

provides a definition of the three new acoustic dimensions:

.

log

log

log

31

31

31

61

32

61

210

21

3

2

1

−

−

=

F

F

F

S

B

H

(1)

It can easily be verified (e.g. in Matlab) that the matrix in

(1) is indeed purely a rotation, by calculating its norm and

comparing it to 1.

The log formant frequency frame of reference and the new

acoustic dimension frame share the same origin, and the

rotation concentrates all scale information in S, leaving both H

and B dimensionless or normalized. Fig. 1. actually represents

the dimensionless B-H projection, with S vertical to it. Vowel

quality is therefore shown on that two dimensional plane

where all scale information is projected in a point. This is

exactly what we need if the vowel code is to be scaling, or

independent of absolute size.

B. The dimensionless B-H plane, “backness” and “height”

Even though the direction of B is as required, the zero point

is problematic. If we want all the vowels traditionally labeled

as “front” to have the same value of “backness,” we have to

start measuring on the line sloping 30º to the left of vertical,

instead of at the origin on the B axis. (Note that the same

problem exists in the IPA vowel quadrilateral.) It is possible to

define B´ in the desired way:

,3' HBB += (2)

which yields

.log3

22

3'

=

FF

B (3)

This result is intuitively satisfying, as B´ depends only on

one of the log-ratios previously discretised. It may be

interpreted by saying that B´ is proportional to the logarithm of

the relative gap between the second and third formants. From

(1) it follows in the same way that H is proportional to the

logarithm of the relative gap between the first and third

formants. F3 is thus seen to play a normalizing role, and there

is a figure-and-ground shift in focus from formant positions to

the gaps between them.

It should be clear that H and B´ constitute normalized

acoustic dimensions which come very close to the articulatory

dimensions “height” and “backness” respectively.

The standard phonetic practice, following Ladefoged’s

influential book [12], uses the frequencies F1 and F2-F1 as

acoustic correlates of “height” and “backness” respectively.

This approach is not satisfactory however, as it depends on

absolute frequency, and vowel quality in this plane therefore

needs different calibrations for speakers of different size.

One potential problem with H and B´, is that they are not

independent (see (2)). If we want independent dimensions, we

need to work with H and B.

An interesting feature of the SDF is the shape and

arrangement of the cells. According to this discretization, it is

possible for example to find a realization of the vowel [i] in

which H takes a lower value than that of some instance of [e].

It is yet to be determined experimentally whether this

prediction is accurate, and it will constitute a test for theory.

C. The scale dimension and “rounding”

The third dimension resulting from the rotation of log-

formant space is not normalized. As mentioned before, all

scale information is indeed concentrated in S. The inevitable

result is that S combines information on both the scale of the

speaker (i.e. length of the vocal tract) and the “rounding” of

the spoken vowel. In order to determine vowel quality, these

two effects have to be teased apart. This problem could be

avoided with the other two dimensions by using the

dimensionless plane, but in the scaling direction, it is inherent.

A possible though complex clue to the “neutral” length of

the vocal tract is the time-averaged value of F0, the glottal tone

or voice frequency. More direct may be the average value of S

over speech segments containing several different vowels.

It is important to notice that S is an average of the

logarithmic values of the first three formant frequencies, which

means it will change as the length of the vocal tract varies, not

only with variations in lip-rounding.

The newly defined scaling dimension S was tested by

applying it to the well-known dataset of Peterson and Barney

(P&B) [13]. This set contains two repetitions each of

pronunciations of ten different American English vowels in the

same context ([hVd]) by 33 men, 28 women and 15 children.

Peterson and Barney measured and recorded the values of F0

to F3 for each utterance.

3

During initial experiments, it was found that the scale values

for each of the three groups of speakers clustered fairly well. It

was decided that the resonant frequencies of F1=500 Hz,

F2=1500 Hz and F3=2500 Hz of a uniform acoustic tube of

length 17.5 cm, closed at one end and open at the other, would

be used as reference. This very roughly corresponds to the

formants that may be expected when a man articulates the

neutral vowel schwa ([?]), which was unfortunately not

included among the ten P&B vowels. The scale value S for this

vector of formant frequencies can be calculated using (1), and

the result is 25.6. A normalized scale value can be defined as

follows:

)6.25(2'

SSm −⋅= (men). (4)

This yielded average normalized values of close to 0 for men,

−1 for women and −2 for children, for both the vowels AH (≈[5]) and AA (≈[@]) from the dataset. The factor 2 in (4) is

motivated purely by the fact that it gives integer values for the

women and children averages. The negative relationship

between S and S′ makes the latter more intuitive, as large

resonance frequencies (and therefore values of S) are

associated with small scales.

In order to compare scale values across the groups, a

different normalization was applied to the women and children

data:

1)6.25(2' +−⋅= SSw (women), (5)

2)6.25(2' +−⋅= SSc (children). (6)

For women, this gives a neutral scale constant of 26.1, which

implies a neutral vocal tract length of 15.2 cm. For children,

the values are 26.6 and 13.3 cm. These implied lengths seem

reasonable.

D. Measured results in the new coordinate system

The vowels in the P&B set are not cardinal vowels, nor even

reference vowels of any sort. However, it is still interesting to

see how they map to the new acoustic coordinate system.

Fig. 2. shows the average positions of the vowel IH (≈[H]) for the three groups on the dimensionless B-H plane. An

indication of the spread within each group is given by plotting

principal axis ellipses of one standard deviation around the

average point. The data clearly clusters in the cell of ARV [e].

The ellipses of the men and the children are almost

indistinguishable, while that of the women is somewhat

smaller.

The fact that speakers on the largest and the smallest scales

give almost identical results is taken as an indication of how

well the transformation normalizes away scale differences.

The discrete cells imply that systematic differences between

groups (like that between the men and the women) can be

accommodated, as long as the patterns remain within the same

cell. The transformation therefore need not project the vowels

of different speakers in a point; only within a cell.

Fig. 3. has the normalized scale values for the same vowel

IH, calculated using the applicable equation from (4) to (6),

and shown individually. There is little evidence of systematic

differences between the groups after transformation. Again, it

seems that the normalization has been successful. The average

normalized scale values for the three groups are -0.26, -0.37

and -0.45 respectively.

For comparison, the corresponding pitch values measured

for each vowel are given in Fig. 4. Here the difference between

the groups is fairly obvious.

Graphs similar to those in Fig. 2. and Fig. 3. were plotted

for each of the ten vowels in the P&B dataset. While the front

vowels map quite well to the expected cells, the back vowels

show more spread. This tendency was already encountered in

the context of cardinal vowels [8], [9], where much more rigor

was employed in controlling the exact vowel production. What

is actually surprising is the extent to which this dataset matches

the ARV cells.

E. Lengthening as a vowel dimension

The normalized scale values calculated for the three groups

and ten vowel qualities in the P&B dataset are collected in

Table I. The measured values are fairly constant between

groups, taking into account that each group has been

normalized with a single empirical constant, independent of

vowel category.

20 40 60 80 100 120 140

-4

-3

-2

-1

0

1

2

3

4

Scale relative to group scaling constant: Vowel IH

Norm

alis

ed s

ca

le

Utterance #

Fig. 3. Normalized scale calculated with (4)-(6) of the Peterson & Barney

vowel IH, separately shown for men (+), women (o) and children (*). There

is little systematic difference between the groups after normalization.

Peterson & Barney: Vowel IH

|i y

|e ø

|� œ

|æ

| �

|�

|� �

|a �

|� u

|� o

|� �

|

Fig. 2. Average position (crosses) and principal axis ellipses of one standard

deviation for the Peterson & Barney vowel IH, separately shown for men,

women and children. The data clearly clusters in the cell of ARV [e].

4

It is possible to round these scale values to the nearest

integer (a somewhat arbitrary step, as there is no reason to

expect integer values). This has been done approximately in

the fifth column of Table I. The last column contains a

“lengthening” value given to each of four levels, namely

neutral, shortened, lengthened and extra long.

Considering this gross classification, it seems clear that there

are important differences between the new “lengthening” and

traditional “rounding.” The first surprise is IY (≈[i]), which shows no sign of being shortened. It seems that the spreading

of the lips in [h] is a compensatory move to keep the vocal tract

neutral against the lengthening caused by the frontal raising of

the tongue.

The IPA states that “[…] in the front series [` D d h] the lips

are neutral for [`], and become progressively more spread

through the series to [h]” [1, p13]. The lengthening pattern in

the front series of the P&B dataset is the opposite: there is

progressive shortening from [h] to [`], as can be verified using

Table I. The only two shortened vowels are therefore in

unexpected positions, if expectation is conditioned by

rounding.

The P&B back vowels are all lengthened, including UH

(≈[L]), which is considered an unrounded vowel by the IPA.

UW (≈[t]) is extra long, as could be expected, so there may

still be a lengthening contrast between the two. Also extra long

is ER, the American rhoticised schwa, presumably due to the

hunched tongue. It turns up as a “front” vowel in the ARV cell

of [D], a somewhat unexpectedly position for a variety of

schwa.

In summary, this dataset indicates that “lengthening” is not

the same as “rounding,” although the latter may contribute to

the first.

IV. CONCLUSION

The problem with “rounding” as an articulatory feature is

that it places undue emphasis on the visible part of the vocal

tract. The “lengthening” introduced here in the context of the

scale dimension of vowels, is dependent on the whole tract. It

takes into account compensatory movements of all vocal

organs, of which the lips are only one set.

If the main emphasis is on teaching articulation, “rounding”

may be important. However, if one is engaged in creating a

system of vowel classification, “lengthening” gives a better

summary of all contributions to formant change. Additionally,

in our coordinate system, the scale dimension arises naturally

as the restant of the transformation, after splitting off the two

dimensions that are normalized.

I therefore propose that “lengthening” is a more appropriate

vowel dimension than “rounding,” for general phonetic

purposes.

REFERENCES

[1] Handbook of the International Phonetic Association. Cambridge:

Cambridge University Press, 1999.

[2] H. F. V. Boshoff and E. C. Botha, “On the structure of vowel space: A

genealogy of general phonetic concepts,” Proceedings ICSLP 5, Sydney,

1998.

[3] D. Jones, An outline of English phonetics, 9th Ed. Cambridge, Heffer,

1964.

[4] D. Abercrombie, Elements of general phonetics. Edinburgh: Edinburgh

University Press, 1967.

[5] H. F. V. Boshoff, The acoustic discretization of vowel space, Ph D

Thesis, University of Stellenbosch, 1997. In Afrikaans.

[6] H. F. V. Boshoff, “A new acoustic reference frame for vowels,”

Proceedings XIVth International Congress on Phonetic Sciences, San

Francisco, 1999.

[7] L. Deng, L. J. Lee, H. Attias, A. Acero, “Adaptive Kalman filtering and

smoothing for tracking vocal tract resonances using a continuous-valued

hidden dynamic model”, IEEE Transactions on Audio, Speech and Language Processing, pp13-32 Jan 2007.

[8] H. F. V. Boshoff and E. C. Botha, “An acoustical analysis of the

cardinal vowels as spoken by Daniel Jones,” Proceedings S.A.

20 40 60 80 100 120 140

102

Pitch: Vowel IH

Fre

quency [

Hz]

Utterance #

Fig. 4. Pitch values of the individual Peterson & Barney vowel utterances

IH, separately shown for men (+), women (o) and children (*).

TABLE I

NORMALIZED SCALE VALUES

Vowel Men Women Children Trunc Lengthening

IY 0.44 0.02 0.05 0 neutral

IH -0.26 -0.37 -0.45 0 neutral

EH -0.40 -0.94 -0.96 -1 shortened

AE -0.88 -1.32 -1.42 -1 shortened

AH -0.07 -0.07 -0.08 0 neutral

AA -0.12 -0.06 -0.07 0 neutral

AO 0.83 1.65 1.46 +1 lengthened

UH 1.30 1.64 1.24 +1 lengthened

UW 3.00 2.65 2.39 +3 extra long

ER 1.66 1.38 1.63 +2 extra long

Scale

constant 25.6 26.1 26.6

Normalized scale values of the Peterson & Barney vowel dataset, averaged

over speaker class, and reduced to discrete values of “lengthening.”

5

Symposium on Communications & Signal Processing, pp 73―78.

IEEE, Piscataway, 1998.

[9] H. F. V. Boshoff and E. C. Botha, “The cardinal vowels revisited: Test

of a scaling discretization of formant space,” Proceedings Ninth Annual

S.A. Workshop Pattern Recognition 1―6, PRASA, Stellenbosch, 1998.

[10] H. F. V. Boshoff and E. C. Botha , “Investigating the limits of vowel

articulation”, South African Journal of Science, 2000.

[11] S. S. Stevens, “On the theory of scales of measurement,” Science, 103,

677- 680, 1946.

[12] P. Ladefoged, A course in phonetics. Orlando: Harcourt Brace. 3rd ed.

1993.

[13] G. E. Peterson and H. L. Barney. “Control methods used in a study of

vowels,” Journal of the Acoustical Society of America, 24, 175-184,

1952.

Copyright Information

© 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists,

or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

[ieee africon 2007 - windhoek, south africa (2007.10.26-2007.10.28)] africon 2007 -...

Documents