neuromorphic detection of vowel representation spaces · 2018. 2. 11. · the full vowel triangle...

Neuromorphic Detection of Vowel Representation Spaces

Pedro Gómez-Vilda1 , José Manuel Ferrández-Vicente2 , Victoria Rodellar-Biarge1 , Agustín Álvarez-Marquina1 ,

Luis Miguel Mazaira-Fernández1 , Rafael Martínez-Olalla1 , and Cristina Muñoz-Muías1

1 Grupo de Informática Aplicada al Tratamiento de Señal e Imagen, Facultad de Informática, Universidad Politécnica de Madrid,

Campus de Montegancedo, s/n, 28660 Madrid pedroQpino.datsi.fi.upm.es

2 Dpto. Electrónica, Tecnología de Computadoras, Univ. Politécnica de Cartagena,

30202, Cartagena

Abstract. In this paper a layered architecture to spot and characterize vowel segments in running speech is presented. The detection process is based on neuromorphic principies, as is the use of Hebbian units in layers to implement lateral inhibition, band probability estimation and mutual exclusión. Results are presented showing how the association between the acoustic set of patterns and the phonologic set of symbols may be created. Possible applications of this methodology are to be found in speech event spotting, in the study of pathological voice and in speaker biometric characterization, among others.

1 Introduction

Speech processing is evolving from classical paradigms more or less statistically oriented to psycho- and physiologic paradigms more inspired in speech percep-tion facts [1]. Especially important within speech perception are vowel representat ion spaces. These may be formally deñned as applications between the space of acoustic representations at the cortical level to the set of perceptual symbols deñned as vowels at the phonologic or linguistic level [12]. These relations can be expressed using graphs and Self Organizing Maps [10]. In the present work the aim is placed in mimicking some of the most plausible physiological mechanisms used in the Auditory Pathways and Centres of Human Perception for vowel spotting and characterization [11]. The detection and characterization of vowel spaces is of most importance in many applications, as in pathological characterization or forensic speaker recognition, therefore the present work will concéntrate in speciñc vowel representation space detection and characterization by neuromorphic methods. The paper is organized as follows: A brief description of vowel nature based in formant characteristics and dynamics is given in sec-tion 2. In section 3 the layers of a Neuromorphic Speech Processing Architecture

based on Hebbian Units [7] implementing the detection paradigms is presented. In section 4 some results are given from simulations, accompanied by a brief discussion. Conclusions are presented in section 5.

2 Nature and Structure of Vowels

Speech may be described as a time-running acoustic succession of events (or pho-netic sequence, see Fig.2.top) [7]. Each event is associated with an oversimpliñed phonation paradigm composed of vowels, and non-vowels. The acoustic-phonetic nature of these beads is based on the association of the two ñrst resonances of the Vocal Tract, which are referred to as 'formants', and described as F i and F2. F i in the range of 200-800 Hz is the lowest,. F2 sweeps a wider range, from 500 to 3000 Hz. Under this point of view the nature of vowels may be described by formant stability during a time interval larger than 30 ms, and relative position in the F2 vs F i space, in which is often called the 'Vowel Triangle' (see Fig . l ) .

Non-vowel sounds are characterized by unstable formants (dynamic), by not having a representation inside the vowel triangle, or by lacking a neat F2 vs F i pat tern. Sounds as [u>, j , 6, d, J, g,p, t, c, k,/3, S, £, 7, r, r] are included in the ñrst class. The second class comprises vowel-like sounds by their stability as [/, A, T, v, z, m, n, n, rj] but with representation spaces out of the área delimited by the triangle [i, a, u]. The third group includes unvoiced sounds as [/, s, ¡p, 9, f, x, ] which are articulated without phonation (vocal fold vibration) and produce smeared pseudo-formants in the spectrum resulting from turbulent air flow in the vocal t ract . The International Phonetic Alphabet (IPA) [2] has been used,

Second Formant (F=: 500-3000 Hz)

Firsl Formaní {F,: 200-800 Hz)

Fig. 1. Subset of the Reference Vowel Triangle for the case under study. The plot of F2 (ordinate) vs F2 (abscissa) is the one classically used in Linguistics. The vowel set i, e, a, o, u is sometimes referred as the cardinal set. The number of vowels differentiated by a listener (full line) depends on the phonologic coding of each language. Other acoustic realizations (dash line) are commonly assigned to nearby phonologic representations. For instance, in the case of study the acoustic realization [as] in Spanish could be perceptually assigned by a listener to / a / .

(s K ['I : -0 í I»] '¡

' " • " ( w)

;'[o] ; ( W )

8 rme seri«- «hju¡iuf<$oiodii»-6k w*v

First ?n(j Socond Firmante- Bshsbilimíoíotfe-Bt w&* S * W f l | 1 1 1 1 1 [ 1

« S i s — S S — S 3 S — S 3 - 1000 12M 14ÍO UOO i i M JWO IMO 1«N> ¿y™

Fki t Foímarp ¡ H Z I S K W K J Formant [Hz)

Fig. 2. Top: time series of the utterance -es hábil un solo ata- ([esal3lLOvnsolodias]) uttered by a male speaker. Middle: Adaptive Linear Prediction Spectrogram (grey background) and first two formants (superimposed in color). The color dots mark the positions of each pair (Fi,F2) from green (the oldest) to red (the most recent). An approximate phonetic labeling is given as a reference. Bottom Left: Formant plot of F2 vs Fi . Bottom Right: Same plot as a Formant Chart commonly used in Linguistics. The black circles give the centroids of the vowel triangle extremes and its center of gravity. The blue triangle and circles give the limit positions of the five cardinal vowels / i / , / e / , / a / , / o / , / u / (male speaker in blue, female in magenta). These plots show the formant trajectories of the utterance. There is color correspondence between the bottom and middle templates to track formant trajectories on the time axis.

with symbols between square brackets [a] and bars / a / are phonemes (acoustic representations) and phonologic representations, respectively. A target sentence is used as an example in Fig.2 which reproduces a spectrogram with both static and dynamic formant pat terns . The sentence -es hábil un solo día- represents the full vowel triangle in Spanish, although acoustically some of the vowels are not extreme. Formants are characterized in this spectrogram (middle témplate) by darker energy envelope peaks. Wha t can be observed in the ñgure is tha t the vowels and vowel-like sounds correspond to stable positions of the formants.

3 Neuromorphic Computing for Speech Processing

The term 'neuromorphic' is used for emulating information processing by neu-rologic systems. As far as speech is concerned, it has to see with neuronal units and circuits found in the Auditory Pathways and Centres. The functionality of these structures is becoming better understood as neurophysiology is deep-ening in functionality [3] [13] [15]. Preliminary work has been carried out on the

characterization of speech dynamics by the Auditory Cortex for consonant de-scription [4] [5], where a Neuromorphic Speech Processing Architecture (NSPA) based in Hebbian Units [7] was proposed and widely discussed. The present paper is focussed on the sections of the NSPA speciñcally devoted to vowel characterization. A general description is given in Fig.3.

Laleral Inhihilion Torwlopic vowel Barxl vowel vortel Famaní Band Trackíjg Gíoaiping Asfttonn*n[ Oy Temporal PfrtTiNíifl r f r.J..U::il C H Í M ^ Q

Exdjsis*»

Fig. 3. Vowel processing and representation sections of the Neuromorphic Speech Processing Architecture described in [4] [5]. Upper data-flow pipeline: Spectrogram Estima-tion Front-End, Lateral Inhibition Formant Profiling, Tonotopic Band Tracking, Vowel Band Grouping, Vowel Assignment by Mutual Exclusión and Vowel Temporal Clip-ping. Lower data-flow pipeline: Static Formant Tracking and Temporal Static Masking (see text for a detailed description).

Spectrogram Estímatíon Front-End. This section provides a spectral description of speech s(n) evolving in the time domain (spectrogram as the one in Fig.2.middle). A matrix XcF('m,n) is produced describing frequency activity in time (where n is the time Índex) as a result of a linear layer of characteristic frequency (CF) units. These units may be seen as roughly related to nerve ñbres in the Auditory Periphery each one reacting to a speciñc channel in frequency (where m is the frequency Índex). In the present case Linear Predictive Coding have been used to build the spectrogram:

X C F ( T O , n) = 20 • log 10

K

k=í

C-k' -jmkOr

(1)

where a,k,n, 1 < k < K is the set of coefñcients of the equivalent K-order In-verse Filter, f¿ the frequency resolution (separation between channels) and T the sampling interval.

Lateral Inhíbítíon Formant Profilíng. The activity of neighbour ñbres is reduced to represent formant descriptions at the lowest cost by lateral inhibition [6] as:

XLi{m) = u {^2 wLI(i)XCF(m + i) - $Li(m) (2)

where w¿ are the weights in the lateral inhibition connections. Typically for a set of ñve weights (r=2) these may be set up to conñgurations such as — 1/6, —1/3,1, —1/3, —1/6, reproducing the classical Mexican Hat. The function implicit in (2) may be seen as a Hebbian Unit modelling membrane integration and threshold (/,) by weighted average and nonlinear conforming. Therefore u(.) is a nonlinear activation function (step or sigmoid) ñring if membrane activity overcomes a speciñc threshold t9¿/(m).

Tonotopíc Band Trackíng. Vowel detection is based on the combination of activity by band tracking units (BTU's) from neighbour CF ñbers by Hebbian Units

13

XBT(S) =u \ ^2 wBT{i,s)XLI{^s+i)-dLI{s)\ (3)

where s is the band Índex, 7S and ¡3S are the Índices to the center frequency and half the bandwidth respectively. In this case, the weights of the summation WBT are selected to reproduce the output probability of the band according to a marginal probability density function (gaussian, with ¡JLS and as the band mean and standard deviation):

xBT{i,s)) = r{£i\l,s,as) = —^=e-^r-o-sV27T (4)

- f3sf¿ < 6 < -Psfi\ íi = ify Ms = 7s^5 °"s = Ps&

Vowel Band Groupíng. Once a sufñcient number of BTU's have tuned their respective frequency spaces, they must be somehow combined among themselves to represent vowel activity as ordered pairs XBT(Í),XBT(J)- This combination strategy is very much language-dependent, based on a previous agreement among the speakers of the language. As a matter of fact each language has developed its own encoding table, which ñnds its counterpart in the representation spaces to be found in the Auditory Centers. As an example, the encoding table for the ñve cardinal vowels [a, e, i, o, u] for standard Spanish is shown in Table 1. Other languages are known to have a larger symbol system, in which case the phonological vowel set would be correspondingly larger.

Table 1. Phonol. Formant Association Table for Spanish

B T U ' s F2/F1 (Hz) 550-850

700-1100 900-1500

1400-2400 1700-2900

220-440

M aliased aliased aliased

N

300-600 void

/ o / aliased

1*1 void

550-950 void void / a / void void

This conñguration is the result of averaging estimations from 8 male speakers, a similar table for female speakers could be produced. The positions marked as 'void' correspond to non-vowel sounds (second class), whereas the positions marked as 'aliased' may be ascribed to nearby valid vowel representation spaces showing a larger probability function with respect to the acoustic model.

Vowel Assígnment by Mutual Exclusión. The vowel representation spaces must be unambiguously coded to bear plausible meaning to the listener. Therefore a strong exclusión mechanism is proposed, which would be activated each time enough activity is detected simultaneously by several units in a speciñc acoustic space, thus the vowel showing the largest activity or detection probability reacts as a 'winner-takes-all' silencing other possible vowel candidates. A neural circuit combines each two band activities by pairs according to the following paradigm:

Xp{v) = WP1(XBT(SI) x WP2{XBT{S2)

Xa{v)))=u{Xp{v)-da{v)) (5)

where Xp(z/) may be seen as the activation probability for vowel v given the input témplate XCF(HI) , and v is the Índex to the set of vowels in the phonological system:

XP(v) = p(v I XCF(m));v G {u, o, a, i, e} (6)

On its turn, weights wpi and wp2 encode the relative probabilities of the respective formants in the detection of the vowel. The symbol (x) represents the logical operator and, and may be implemented also by a Hebbian Unit. The mutual exclusión among representation spaces is governed by the following combination paradigm:

Xa{v))) = u{E{vu v2)Xa{v) - de{v)) (7)

where E(vi, v2) is the mutual exclusión matrix, pre-wired as in the present case:

E(v1,v2) =

/ + 1.0 -1.0 1-0.0 1-0.0

\ - 0 . 2

-1.0 -1.0 -1.0 -0.2

-0.0 0.0 - 0 . 2 \ -0.2 -0 .2+0.0 -1.0 -1 .0+0.0 -0.2+1.0 -1.0

-0.0+0.0 -1.0

(8)

-LO/

The elements in the main diagonal are set to +1.0, each vowel probability exciting the next unit (solid arrow in Fig.3) whereas it acts as a strong, weak or neuter inhibitory input (-1.0, -0.2, +0.0) to other vowels (dash arrows). Equation (7) is a discriminant function [8] based on Bayesian Decisión Theory using log likelihood ratios:

T l s , JPJXCF | V\ Y , s í l\Le{v) >Íe{v) , n , L e ( i / ) = / onR^^rX e ( i / ) = l0;¿eH<eeH

( 9 )

Vowel Temporal Clipping. This step adds the stability property demanded for vowel sounds. A control signal as Zy/C(n) marking the temporal segments or intervals where formants are stable within some limits is used to inhibit or enable the expression of each vowel by logical and functions (x) as deñned in (5):

Xd{V) = U (ZV/C X Xe(u) - &d{V)) (10)

Static Formant Trackíng. The temporal clipping signal is estimated by tracking the segments where the ñrst two formants remain relatively stable. This activity is captured using mask-based neuromorphic units as already explained in [4] [5] which process the spectrogram as a true auditory image [8]:

Xsj?(m, n) = u P Q

y ^ y ^ w S F ( p , q)X(m+p, n-q)) - # g F ( m ) p=-P q=0

(11)

The weight matrix WSF (P, q) is a bell-shaped histogram displaced in the time Índex (q). Practical valúes for P and Q are 4 and 8, respectively, resulting in a 9x9 mask.

Temporal Static Masking. Stability has to be detected separately on the two ñrst formants and further combined. Two independent units, <pi and <pi will be tuned to two frequency bands centred at (71, 72) with half bandwidths (/?i, (¡2) similarly to (3):

XBF{<¿>) = u \ 5 3 WBF{Í,^P)XSF{1LP + i) - # B F ( ¥ > ) J (12)

The weights of the integration function are ñxed as gaussian distributions fol-lowing (4). The fusión of formant masking units is carried out by a classical and operator:

Zv/c = u (XBF(fi) x XBF{^Í) - fiv/c) (13)

This signal is used in (10) to validate the intervals of formant stable activity which can be associated to vowel representation spaces.

Table 1. Phonol. Formant Association Table for Spanish

B T U ' s F2/F1 (Hz) 550-850

700-1100 900-1500

1400-2400 1700-2900

220-440

M aliased aliased aliased

N

300-600 void

/ o / aliased

1*1 void

550-950 void void / a / void void

This conñguration is the result of averaging estimations from 8 male speakers, a similar table for female speakers could be produced. The positions marked as 'void' correspond to non-vowel sounds (second class), whereas the positions marked as 'aliased' may be ascribed to nearby valid vowel representation spaces showing a larger probability function with respect to the acoustic model.

Vowel Assígnment by Mutual Exclusión. The vowel representation spaces must be unambiguously coded to bear plausible meaning to the listener. Therefore a strong exclusión mechanism is proposed, which would be activated each time enough activity is detected simultaneously by several units in a speciñc acoustic space, thus the vowel showing the largest activity or detection probability reacts as a 'winner-takes-all' silencing other possible vowel candidates. A neural circuit combines each two band activities by pairs according to the following paradigm:

Xp{v) = WP1(XBT(SI) x WP2{XBT{S2)

Xa{v)))=u{Xp{v)-da{v)) (5)

where Xp(z/) may be seen as the activation probability for vowel v given the input témplate XCF(HI) , and v is the Índex to the set of vowels in the phonological system:

XP(v) = p(v I XCF(m));v G {u, o, a, i, e} (6)

On its turn, weights wpi and wp2 encode the relative probabilities of the respective formants in the detection of the vowel. The symbol (x) represents the logical operator and, and may be implemented also by a Hebbian Unit. The mutual exclusión among representation spaces is governed by the following combination paradigm:

Xa{v))) = u{E{vu v2)Xa{v) - de{v)) (7)

where E(vi, v2) is the mutual exclusión matrix, pre-wired as in the present case:

E(v1,v2) =

/ + 1.0 -1.0 1-0.0 1-0.0

\ - 0 . 2

-1.0 -1.0 -1.0 -0.2

-0.0 0.0 - 0 . 2 \ -0.2 -0 .2+0.0 -1.0 -1 .0+0.0 -0.2+1.0 -1.0

-0.0+0.0 -1.0

(8)

-LO/

4 Results and Discussion

In what follows some results from processing the model sentence in Fig.2 with the proposed structure will be shown. The details of the architecture are the following: 1 < m < M = 512 CF ñbre units are used, deñning a resolution in frequency of 16 Hz for a sampling frequency of 8000 Hz. A spectrum frame is produced each 2 ms to define a stream of approximately 500 frames per second. The dimensions of the BTU's are defined as in Table 1. An example of the operation of BTU's XBT(220 - 440) and XBT(1800 - 3000) and the formant fusión unit Xa(/i/) is shown in Fig.4.

First Formant Activity

Fig. 4. Top: Activity of BTU XBT(220 - 440). Input activity at the unit membrane befo re (blue) and after integration (red), and firing after threshold (green). Middle: ídem for XBT(1800 - 3000). Bottom: Fusión of both BTU's in unit Xa(/i/) (in green). The spectrogram is given as a reference.

This unit selects vowel segments corresponding to [I] or [i], and to [e] (first segment between 0.04 and 0.13 s). This is compliant with the ability of any BTU to capture activity from acoustic spaces overlapping in part with neighbour units as explained before. When the respective activities of both Xa(/e/) and Xa(/i/) are subject to mutual exclusión the first segment will be assigned to / e / (cyan) and the two last ones will be captured by / i / (blue) as seen in Fig.5. Vowel detection is evident after this operation.

The use of the temporal static masking signal Zv/C helps in removing certain ambiguities in vowel-consonant assignments as it may be seen in Fig.6. The vowel intervals have been delimited to the most stable segments of the utterance. Table 2 gives a detailed description of the detection process.

Relative Vowel Probabilities 4000

3000

2000

1000

- I 1 1- - I 1 r -

ir — ^ , -£*r~ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Vowel Delectron by Probability Mutual Exclusión

Fig. 5. Top: Probability estimates for the five vowels at layer Xa{y). The first two formants are superimposed for reference as by layer -XLI(TO). Bottom: Activity of layer Xe{y). Vowel color reference: /i/-blue, /e/-cyan, /a/-green, /o/-yellow, /u/-red.

Table 2. Vowel detection results

Interval (ms) 0.04-0.13 0.13-0.21 0.21-0.27 0.27-0.30 0.31-0.35 0.35-0.41 0.41-0.50 0.50-0.53 0.53-0.69 0.69-0.76 0.76-0.77 0.77-0.86 0.86-0.89 0.89-0.96

0.96-0.1.05

Observations [e] is detected

void (sibilant [s]) [a] is detected

[as] is detected as / e / void (approximant [/?])

[i] is detected void (lateral [1])

[v] is detected as / o / void (nasal [n] and a sibilant [s])

[o] is detected as / o / void (lateral [1])

[o] is detected as / o / void (approximant [6])

[i] is detected unstable [i —> e —>ae] is fragmentarily detected as / e /

Vcwel Presen» Detection fronn Siable CF Unite

Time (sec.)

Specific Vowel Activity from BandCF Unrts

0.1 0.2 0 3 0.4 0.5 ( Time {sec.)

Fig. 6. Top: Output activity of the temporal masking unit Zy/c- Bottom: Activity of layer Xd(v)

5 Conclusions

Through the present work it has been shown that vowel characterization can be carried out based on the criteria of formant stability and relative position in-side the vowel triangle of the speaker using neuromorphic (Hebbian) processing units (neurons). It has also been shown that band categorization is carried out using gaussians as marginal distributions. Under this point of view the mem-brane activity of band categorization neurons (after integration) may receive the consideration of conditional probabilities. Output ñring rates are to be seen as results of decision-making algorithms when mutual exclusión is used on com-peting conditional probabilities. The process relies strongly on the use of lateral inhibition to proñle formants and to establish vowel representation spaces in a "winner-takes-all" strategy. This implies a decisión problem which may produce unexpected results, as in the interval 0.50-0.53, where a rather obscure vowel [v] is mistaken as / o / . This fact demands a small explanation: although the re-sulting vowel space is not fully represented by / o / the acoustic-phonetic space controlled by this symbol is very ubiquitous, as to be able of seizing the sur-rounding space, which is not very much questioned by any of the other vowel representations except /u / , -see the mutual exclusión matrix in (8). This result is left deliberately 'as-is' to put into evidence eager seizing (aliasing or usurpa-tion) of unclaimed representation spaces by strongly implanted vowels under the phonological point of view. This behaviour may explain difñculties in speakers of reduced vowel representation spaces to recognize much richer vowel systems from foreign origin. The utility of these results is to be found in automatic

phonetic labeling of the speech trace for speech spotting, as well as in the de-tection of the speaker's identity [14], where stable characteristic vowel segments are sought for contrastive similarity tests.

Acknowledgements

This work is being funded by grants TEC2009-14123-C04-03 from Plan Nacional de I + D + i , Ministry of Science and Technology of Spain and CCG06-UPM/TIC-0028 from C A M / U P M .

References

1. Acero, A.: New Machine Learning Approaches to Speech Recognition. In: FALA 2010, Vigo, Spain, November 10-12 (2010); ISBN: 978-84-8158-510-0

2. http://www.arts.gla.ac.uk/IPA/ipachart.html

3. Barbour, D.L., Wang, X.: Temporal Coherence Sensitivity in Auditory Cortex. J. Neurophysiol. 88, 2684-2699 (2002)

4. Gómez, P., Ferrández, J.M., Rodellar, V., Fernández, R.: Time-frequency Repre-sentations in Speech Perception. Neurocomputing 72, 820-830 (2009)

5. Gómez, P., Ferrández, J.M., Rodellar, V., Alvarez, A., Mazaira, L.M., Olalla, R., Muñoz, C : Neuromorphic detection of speech dynamics. Neurocomputing 74(8), 1191-1202 (2011)

6. Greenberg, S., Ainsworth, W.H.: Speech processing in the auditory system: an overview. In: Greenberg, W.A.S. (ed.) Speech Processing in the Auditory System, pp. 1-62. Springer, New York (2004)

7. Hebb, D.O.: The Organization of Behavior. Wiley, New York (1949) 8. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing. Prentice-Hall,

Upper Saddle River (2001) 9. Jahne, B.: Digital Image Processing. Springer, Berlin (2005)

10. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997) 11. Munkong, R., Juang, B.H.: Auditory Perception and Cognition. IEEE Signal Proc.

Magazine, 98-117 (May 2008) 12. O'Shaughnessy, D.: Speech Communication. Human and Machine. Addison-

Wesley, Reading (2000) 13. Palmer, A., Shamma, S.: Physiological Representation of Speech. In: Greenberg,

S., Ainsworth, W., Popper, A. (eds.), pp. 163-230. Springer, New York (2004) 14. Rose, P., Kinoshita, Y., Alderman, T.: Realistic Extrinsic Forensic Speaker Dis-

crimination with the Diphthong / a l / . In: Proc. l l t h Austr. Int. Conf. on Speech Sci. and Tech., pp. 329-334 (December 2006)

15. Shamma, S.: Physiological foundations of temporal integration in the perception of speech. J. Phonetics 31, 495-501 (2003)

http://www.arts.gla.ac.uk/IPA/ipachart.html

neuromorphic detection of vowel representation spaces · 2018. 2. 11. · the full vowel triangle...

Documents