[ieee africon 2007 - windhoek, south africa (2007.10.26-2007.10.28)] africon 2007 -...
TRANSCRIPT
1
Abstract—The impressionistic phonetic specification of vowel
quality is usually given in terms of the quasi-articulatory
dimensions “height,” “backness” and “rounding,” relative to the
quality of the cardinal vowels placed in a stylized vowel
quadrilateral. For some purposes, an acoustic approach to vowel
dimensions may be preferable. A system of reference vowels
created by a scaling discretisation of formant space (SDF) is used
in this study to derive new acoustic dimensions closely aligned to
the traditional dimensions mentioned above. Some implications of
the new acoustic dimensions for vowel classification are
investigated, using a well-known vowel dataset. Based on this
analysis, it is argued that “lengthening” is a more appropriate
vowel dimension than “rounding.”
Index Terms— Cardinal vowels, speech analysis, vowel
dimensions, vowel formants.
I. INTRODUCTION
N the vowel chart of the International Phonetic Association
(IPA), some vowels are labeled as “rounded” [1]. Often this
is a binary distinction contrasted to “unrounded,” but it may
also be used as a continuous dimension. Thus we find the
following statement in the Handbook of the IPA: “In the back
series of cardinal vowels ([� � o u]) lip-rounding progressively
increases, from none on [�] to close rounding on [u]” [1, p12].
Roundedness or rounding has been separated from two other
vowel dimensions (“backness” and “height” are used by the
IPA), at least since the mid nineteenth century, when A. M.
Bell broke with the i-a-u vowel triangle of Hellwag [2]. The
three labels are (at least originally) articulatory, and the
complete treatment of vowel quality in the Handbook [1] is
devoid of acoustic reference. However, the intermediate
cardinal vowels are placed at auditorily equal distances,
resulting in a quaint mix of factors leading the IPA to call their
vowel space “abstract” [1, p12]. The cardinal vowels and their
description were not invented by the IPA of course, but largely
inherited from Daniel Jones [3], [4].
In this paper, a purely acoustic approach is taken to the
problem of choosing reference vowels. By the transformation
of formant frequencies, three new acoustic dimensions are
defined, denoted by H, B, and S. The acoustic H and B map
readily to height and backness respectively, but the relation of
Manuscript received 5 March 2007; revised 1 June 2007.
(The author is with the University of Johannesburg, Johannesburg, South
Africa (e-mail: [email protected]).
S to rounding is more complex. The choice of symbol S is
inspired by the fact that it contains all scale information, while
the B-H plane is normalized. S is influenced by lengthening of
the vocal tract, which may be brought about by rounding the
lips, but also by other means.
The rest of this paper will be structured as follows. The regular
geometric framework of acoustic reference vowels is briefly
introduced in Part II. In Part III the three new acoustic vowel
dimensions are developed, and illustrated using a well-known
dataset. The “lengthening” dimension is given special
consideration and contrasted with traditional “rounding.” The
conclusion drawn in Part IV, is that “lengthening” is better for
general phonetic purposes.
II. REFERENCE VOWELS DERIVED FROM A SCALING
DISCRETIZATION OF FORMANT SPACE
It was mentioned above that the impressionistic reference
vowels employed by the IPA [1] are the cardinal vowels
defined by Daniel Jones [3]. The following procedure can be
used to define acoustic reference vowels (ARV), intended to
replace the cardinal vowels in an acoustic approach to
determine static vowel quality [5], [6]:
1) Measure the first three formant frequencies for every
frame of the vowel signal. Designate them as F1, F2, F3.
Acoustic formant analysis is a non-trivial exercise in
“Lengthening” is a better vowel dimension than
“rounding”
Hendrik F. V. Boshoff, Member, IEEE
I
|i y
|e ø
|� œ
|æ
| �
|�
|� �
|a �
|� u
|� o
|� �
|
log(F1)
log(F2)
log(F3)
Fig. 1. Acoustic reference vowels (ARVs) in the dimensionless plane
derived from a scaling discretization of formant space (SDF). The vowel to
the left of the vertical line in every cell is neutral and the one to the right
“lengthened.” Symbols are selected from the IPA.
1-4244-0987-X/07/$25.00 ©2007 IEEE.
2
general [5], but recent advances in the field hold the
promise to provide reliable measurements [7].
2) Discretise the two ratios log(F2/F1) and log(F3/F2)
using the constant 2/)51( +=φ to form a regular
grid. This reference grid is called the scaling
discretization of formant space (SDF), which partitions
vowel space into diamond shaped cells. A one-to-one
mapping is possible between the cells of this grid and
the cardinal and other IPA vowels [5], [8], [9].
3) Define the geometric center point of every cell as the
acoustic reference vowel (ARV) associated with that
cell. Allow two forms of every ARV, provisionally
called “neutral” and “lengthened.” Labels for the ARV
are taken from the IPA [6].
Fig. 1 shows the SDF, with IPA vowel symbols mapped to
every articulatorily accessible cell [10] in the grid.
The scheme of acoustic reference vowels has the attractive
property that it constitutes a ratio scale of measurement, the
highest level in Stevens’ theory [11], [6].
III. ACOUSTIC VOWEL DIMENSIONS
A. Rotation of log-formant space
Comparing Fig. 1. with the IPA vowel chart, the similarities
in vowel placement are clear. We may therefore expect the
horizontal direction to be related to “backness” and the vertical
direction to “height.” The following transformation constitutes
a simple rotation of the axes in log-formant space, and
provides a definition of the three new acoustic dimensions:
.
log
log
log
31
31
31
61
32
61
210
21
3
2
1
−
−
=
F
F
F
S
B
H
(1)
It can easily be verified (e.g. in Matlab) that the matrix in
(1) is indeed purely a rotation, by calculating its norm and
comparing it to 1.
The log formant frequency frame of reference and the new
acoustic dimension frame share the same origin, and the
rotation concentrates all scale information in S, leaving both H
and B dimensionless or normalized. Fig. 1. actually represents
the dimensionless B-H projection, with S vertical to it. Vowel
quality is therefore shown on that two dimensional plane
where all scale information is projected in a point. This is
exactly what we need if the vowel code is to be scaling, or
independent of absolute size.
B. The dimensionless B-H plane, “backness” and “height”
Even though the direction of B is as required, the zero point
is problematic. If we want all the vowels traditionally labeled
as “front” to have the same value of “backness,” we have to
start measuring on the line sloping 30º to the left of vertical,
instead of at the origin on the B axis. (Note that the same
problem exists in the IPA vowel quadrilateral.) It is possible to
define B´ in the desired way:
,3' HBB += (2)
which yields
.log3
22
3'
=
FF
B (3)
This result is intuitively satisfying, as B´ depends only on
one of the log-ratios previously discretised. It may be
interpreted by saying that B´ is proportional to the logarithm of
the relative gap between the second and third formants. From
(1) it follows in the same way that H is proportional to the
logarithm of the relative gap between the first and third
formants. F3 is thus seen to play a normalizing role, and there
is a figure-and-ground shift in focus from formant positions to
the gaps between them.
It should be clear that H and B´ constitute normalized
acoustic dimensions which come very close to the articulatory
dimensions “height” and “backness” respectively.
The standard phonetic practice, following Ladefoged’s
influential book [12], uses the frequencies F1 and F2-F1 as
acoustic correlates of “height” and “backness” respectively.
This approach is not satisfactory however, as it depends on
absolute frequency, and vowel quality in this plane therefore
needs different calibrations for speakers of different size.
One potential problem with H and B´, is that they are not
independent (see (2)). If we want independent dimensions, we
need to work with H and B.
An interesting feature of the SDF is the shape and
arrangement of the cells. According to this discretization, it is
possible for example to find a realization of the vowel [i] in
which H takes a lower value than that of some instance of [e].
It is yet to be determined experimentally whether this
prediction is accurate, and it will constitute a test for theory.
C. The scale dimension and “rounding”
The third dimension resulting from the rotation of log-
formant space is not normalized. As mentioned before, all
scale information is indeed concentrated in S. The inevitable
result is that S combines information on both the scale of the
speaker (i.e. length of the vocal tract) and the “rounding” of
the spoken vowel. In order to determine vowel quality, these
two effects have to be teased apart. This problem could be
avoided with the other two dimensions by using the
dimensionless plane, but in the scaling direction, it is inherent.
A possible though complex clue to the “neutral” length of
the vocal tract is the time-averaged value of F0, the glottal tone
or voice frequency. More direct may be the average value of S
over speech segments containing several different vowels.
It is important to notice that S is an average of the
logarithmic values of the first three formant frequencies, which
means it will change as the length of the vocal tract varies, not
only with variations in lip-rounding.
The newly defined scaling dimension S was tested by
applying it to the well-known dataset of Peterson and Barney
(P&B) [13]. This set contains two repetitions each of
pronunciations of ten different American English vowels in the
same context ([hVd]) by 33 men, 28 women and 15 children.
Peterson and Barney measured and recorded the values of F0
to F3 for each utterance.
3
During initial experiments, it was found that the scale values
for each of the three groups of speakers clustered fairly well. It
was decided that the resonant frequencies of F1=500 Hz,
F2=1500 Hz and F3=2500 Hz of a uniform acoustic tube of
length 17.5 cm, closed at one end and open at the other, would
be used as reference. This very roughly corresponds to the
formants that may be expected when a man articulates the
neutral vowel schwa ([?]), which was unfortunately not
included among the ten P&B vowels. The scale value S for this
vector of formant frequencies can be calculated using (1), and
the result is 25.6. A normalized scale value can be defined as
follows:
)6.25(2'
SSm −⋅= (men). (4)
This yielded average normalized values of close to 0 for men,
−1 for women and −2 for children, for both the vowels AH (≈[5]) and AA (≈[@]) from the dataset. The factor 2 in (4) is
motivated purely by the fact that it gives integer values for the
women and children averages. The negative relationship
between S and S′ makes the latter more intuitive, as large
resonance frequencies (and therefore values of S) are
associated with small scales.
In order to compare scale values across the groups, a
different normalization was applied to the women and children
data:
1)6.25(2' +−⋅= SSw (women), (5)
2)6.25(2' +−⋅= SSc (children). (6)
For women, this gives a neutral scale constant of 26.1, which
implies a neutral vocal tract length of 15.2 cm. For children,
the values are 26.6 and 13.3 cm. These implied lengths seem
reasonable.
D. Measured results in the new coordinate system
The vowels in the P&B set are not cardinal vowels, nor even
reference vowels of any sort. However, it is still interesting to
see how they map to the new acoustic coordinate system.
Fig. 2. shows the average positions of the vowel IH (≈[H]) for the three groups on the dimensionless B-H plane. An
indication of the spread within each group is given by plotting
principal axis ellipses of one standard deviation around the
average point. The data clearly clusters in the cell of ARV [e].
The ellipses of the men and the children are almost
indistinguishable, while that of the women is somewhat
smaller.
The fact that speakers on the largest and the smallest scales
give almost identical results is taken as an indication of how
well the transformation normalizes away scale differences.
The discrete cells imply that systematic differences between
groups (like that between the men and the women) can be
accommodated, as long as the patterns remain within the same
cell. The transformation therefore need not project the vowels
of different speakers in a point; only within a cell.
Fig. 3. has the normalized scale values for the same vowel
IH, calculated using the applicable equation from (4) to (6),
and shown individually. There is little evidence of systematic
differences between the groups after transformation. Again, it
seems that the normalization has been successful. The average
normalized scale values for the three groups are -0.26, -0.37
and -0.45 respectively.
For comparison, the corresponding pitch values measured
for each vowel are given in Fig. 4. Here the difference between
the groups is fairly obvious.
Graphs similar to those in Fig. 2. and Fig. 3. were plotted
for each of the ten vowels in the P&B dataset. While the front
vowels map quite well to the expected cells, the back vowels
show more spread. This tendency was already encountered in
the context of cardinal vowels [8], [9], where much more rigor
was employed in controlling the exact vowel production. What
is actually surprising is the extent to which this dataset matches
the ARV cells.
E. Lengthening as a vowel dimension
The normalized scale values calculated for the three groups
and ten vowel qualities in the P&B dataset are collected in
Table I. The measured values are fairly constant between
groups, taking into account that each group has been
normalized with a single empirical constant, independent of
vowel category.
20 40 60 80 100 120 140
-4
-3
-2
-1
0
1
2
3
4
Scale relative to group scaling constant: Vowel IH
Norm
alis
ed s
ca
le
Utterance #
Fig. 3. Normalized scale calculated with (4)-(6) of the Peterson & Barney
vowel IH, separately shown for men (+), women (o) and children (*). There
is little systematic difference between the groups after normalization.
Peterson & Barney: Vowel IH
|i y
|e ø
|� œ
|æ
| �
|�
|� �
|a �
|� u
|� o
|� �
|
Fig. 2. Average position (crosses) and principal axis ellipses of one standard
deviation for the Peterson & Barney vowel IH, separately shown for men,
women and children. The data clearly clusters in the cell of ARV [e].
4
It is possible to round these scale values to the nearest
integer (a somewhat arbitrary step, as there is no reason to
expect integer values). This has been done approximately in
the fifth column of Table I. The last column contains a
“lengthening” value given to each of four levels, namely
neutral, shortened, lengthened and extra long.
Considering this gross classification, it seems clear that there
are important differences between the new “lengthening” and
traditional “rounding.” The first surprise is IY (≈[i]), which shows no sign of being shortened. It seems that the spreading
of the lips in [h] is a compensatory move to keep the vocal tract
neutral against the lengthening caused by the frontal raising of
the tongue.
The IPA states that “[…] in the front series [` D d h] the lips
are neutral for [`], and become progressively more spread
through the series to [h]” [1, p13]. The lengthening pattern in
the front series of the P&B dataset is the opposite: there is
progressive shortening from [h] to [`], as can be verified using
Table I. The only two shortened vowels are therefore in
unexpected positions, if expectation is conditioned by
rounding.
The P&B back vowels are all lengthened, including UH
(≈[L]), which is considered an unrounded vowel by the IPA.
UW (≈[t]) is extra long, as could be expected, so there may
still be a lengthening contrast between the two. Also extra long
is ER, the American rhoticised schwa, presumably due to the
hunched tongue. It turns up as a “front” vowel in the ARV cell
of [D], a somewhat unexpectedly position for a variety of
schwa.
In summary, this dataset indicates that “lengthening” is not
the same as “rounding,” although the latter may contribute to
the first.
IV. CONCLUSION
The problem with “rounding” as an articulatory feature is
that it places undue emphasis on the visible part of the vocal
tract. The “lengthening” introduced here in the context of the
scale dimension of vowels, is dependent on the whole tract. It
takes into account compensatory movements of all vocal
organs, of which the lips are only one set.
If the main emphasis is on teaching articulation, “rounding”
may be important. However, if one is engaged in creating a
system of vowel classification, “lengthening” gives a better
summary of all contributions to formant change. Additionally,
in our coordinate system, the scale dimension arises naturally
as the restant of the transformation, after splitting off the two
dimensions that are normalized.
I therefore propose that “lengthening” is a more appropriate
vowel dimension than “rounding,” for general phonetic
purposes.
REFERENCES
[1] Handbook of the International Phonetic Association. Cambridge:
Cambridge University Press, 1999.
[2] H. F. V. Boshoff and E. C. Botha, “On the structure of vowel space: A
genealogy of general phonetic concepts,” Proceedings ICSLP 5, Sydney,
1998.
[3] D. Jones, An outline of English phonetics, 9th Ed. Cambridge, Heffer,
1964.
[4] D. Abercrombie, Elements of general phonetics. Edinburgh: Edinburgh
University Press, 1967.
[5] H. F. V. Boshoff, The acoustic discretization of vowel space, Ph D
Thesis, University of Stellenbosch, 1997. In Afrikaans.
[6] H. F. V. Boshoff, “A new acoustic reference frame for vowels,”
Proceedings XIVth International Congress on Phonetic Sciences, San
Francisco, 1999.
[7] L. Deng, L. J. Lee, H. Attias, A. Acero, “Adaptive Kalman filtering and
smoothing for tracking vocal tract resonances using a continuous-valued
hidden dynamic model”, IEEE Transactions on Audio, Speech and Language Processing, pp13-32 Jan 2007.
[8] H. F. V. Boshoff and E. C. Botha, “An acoustical analysis of the
cardinal vowels as spoken by Daniel Jones,” Proceedings S.A.
20 40 60 80 100 120 140
102
Pitch: Vowel IH
Fre
quency [
Hz]
Utterance #
Fig. 4. Pitch values of the individual Peterson & Barney vowel utterances
IH, separately shown for men (+), women (o) and children (*).
TABLE I
NORMALIZED SCALE VALUES
Vowel Men Women Children Trunc Lengthening
IY 0.44 0.02 0.05 0 neutral
IH -0.26 -0.37 -0.45 0 neutral
EH -0.40 -0.94 -0.96 -1 shortened
AE -0.88 -1.32 -1.42 -1 shortened
AH -0.07 -0.07 -0.08 0 neutral
AA -0.12 -0.06 -0.07 0 neutral
AO 0.83 1.65 1.46 +1 lengthened
UH 1.30 1.64 1.24 +1 lengthened
UW 3.00 2.65 2.39 +3 extra long
ER 1.66 1.38 1.63 +2 extra long
Scale
constant 25.6 26.1 26.6
Normalized scale values of the Peterson & Barney vowel dataset, averaged
over speaker class, and reduced to discrete values of “lengthening.”
5
Symposium on Communications & Signal Processing, pp 73―78.
IEEE, Piscataway, 1998.
[9] H. F. V. Boshoff and E. C. Botha, “The cardinal vowels revisited: Test
of a scaling discretization of formant space,” Proceedings Ninth Annual
S.A. Workshop Pattern Recognition 1―6, PRASA, Stellenbosch, 1998.
[10] H. F. V. Boshoff and E. C. Botha , “Investigating the limits of vowel
articulation”, South African Journal of Science, 2000.
[11] S. S. Stevens, “On the theory of scales of measurement,” Science, 103,
677- 680, 1946.
[12] P. Ladefoged, A course in phonetics. Orlando: Harcourt Brace. 3rd ed.
1993.
[13] G. E. Peterson and H. L. Barney. “Control methods used in a study of
vowels,” Journal of the Acoustical Society of America, 24, 175-184,
1952.
Copyright Information
© 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists,
or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.