analysis of stressed speech on teager energy operator (teo)

Analysis of Stressed Speech on TeagerEnergy Operator (TEO)

Bhagyalaxmi Jena1 and Sudhansu Sekhar Singh2

1Electronics and Communication Engineering,Silicon Institute of Technology, Bhubaneswar

2School of Electronics Engineering,KIIT University, Bhubaneswar

January 11, 2018

1 INTRODUCTION

In speech production, as well as in many human-engineered elec-tronic communication systems, the information to be transmittedis encoded in the form of a continuously varying (analog) waveformthat can be transmitted, recorded, manipulated, and ultimately de-coded by a human listener. In the case of speech, the fundamentalanalog form of the message is an acoustic waveform, which we callthe speech signal.

In speech communication system, the speech signal is transmit-ted, stored and processed in many ways. The Representation of thespeech signal must be such that the information content can easilybe extracted by human listeners, or automation by machine.

2 STRESS

Stress can be defined as any condition that causes a speaker tovary speech production from neutral conditions. If a speaker is ina quiet room with no task obligations, then the speech producedis considered neutral. With this definition, two stress effect areasemerge: perceptual and physiological.

1

International Journal of Pure and Applied MathematicsVolume 118 No. 16 2018, 667-680ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

667

Perceptually induced stress results when a speaker perceiveshis environment to be different from normal such that his intentionto produce speech varies from neutral conditions. The causes ofperceptually induced stress include emotion, environmental noise(i.e., the Lombard effect), and actual task workload (e.g., a pilot inan aircraft cockpit).

Physiologically induced stress is the result of a physical im-pact on the human body that results in deviations from neutralspeech production despite intentions. Causes of physiological stresscan include vibration, G-force, drug interactions.

2.1 STRESSED SPEECH

Stress is a psychological state that is a response to a threat or a de-manded task & is normally accompanied by specific emotions (fear,anger, sorrow etc.)[1]. Stress may be induced by external factors(workload, noise, vibration, sleep loss, etc.) and also by internalfactors (such as emotion or fatigue state). The changes in emotioncan affect the behavior of the speech involuntarily. Thus, stressedspeech can be defined as any deviation in speech with respect tothe neutral style. This deviation can be in the form of speakingstyle, selection and usage of words, duration of sentence, etc.

Stress is the relative emphasis that may be given to certain syl-lables in a word, or to certain words in a phrase or sentence. Stressis typically signaled by such properties as increased loudness andvowel length, full articulation of the vowel, and changes in pitch.The terms stress and accent are often used synonymously, but theyare sometimes distinguished, with certain specific kinds of promi-nence (such as pitch accent, variously defined) being considered tofall under accent but not under stress.

In this case, stress specifically may be called stress accent ordynamic accent study group on speech and language technologyre-cently completed a three year project on the effect ofstress onspeech production and system performance. For this purpose vari-ous speech databases were collected. A definition of various statesof stress and the corresponding type of stressor is proposed[2]. Re-sults are reported from analysis and assessment studies performedwith the databases collected for this project. The primary goal ofthe study reported here was to identify the effect of various types

2

International Journal of Pure and Applied Mathematics Special Issue

668

of stress on the effectiveness of communication.

3 ANALYTICAL MODEL

There are different algorithms that can be used to observe the dif-ference in patterns in different domains. Some of the algorithmsare HMM algorithm, LPC algorithm, TEO algorithm etc.

3.1 Hidden Markov Model (HMM)

In its discrete form, a hidden Markov process can be visualize asa generalization of the Urn problemwith replacement (where eahitem from the urn is returned to the original urn before the nextstep). Consider this example: in a room that is not visible to anobserver there is a genie. The room contains urns X1, X2, X3, eachof which contains a known mix of balls, each ball labeled y1, y2, y3The genie chooses an urn in that room and randomly draws a ballfrom that urn. It then puts the ball onto a conveyor belt, where theobserver can observe the sequence of the balls but not the sequenceof urns from which they were drawn. The genie has some procedureto choose urns; the choice of the urn for the n-th ball depends onlyupon a random number and the choice of the urn for the (n 1)-thball. The choice of urn does not directly depend on the urns chosenbefore this single previous urn; therefore, this is called a Markovprocess. It can be described by the upper part of figure.

The Markov process itself cannot be observed only the sequenceof labeled balls, thus this arrangement is called a ”hidden Markovprocess”. This is illustrated by the lower part of the diagram, whereone can see that balls y1, y2, y3; y4 can be drawn at each state.Even if the observer knows the composition of the urns and hasjust observed a sequence of three balls, e.g. y1, y2 and y3 on theconveyor belt, the observer still cannot be sure which urn (i.e., atwhich state) the genie has drawn the third ball from. However,the observer can work out other information, such as the likelihoodthat the third ball came from each of the urns.

3


669

Fig. 3.1(a)-Hidden Markkov Process

3.2 Linear Predictive Coding (LPC)

Linear predictive coding (LPC) is a tool used mostly in audio sig-nal processing and speech processing for representing the spectralenvelope of a digital signal of speech in compressed form, usingthe information of a linear predictive model. It is one of the mostpowerful speech analysis techniques, and one of the most usefulmethods for encoding good quality speech at a low bit rate andprovides extremely accurate estimates of speech parameters.

LPC starts with the assumption that a speech signal is producedby a buzzer at the end of a tube (voiced sounds), with occasionaladded hissing and popping sounds (sibilants and plosive sounds).Although apparently crude, this model is actually a close approxi-mation of the reality of speech production. The glottis (the spacebetween the vocal folds) produces the buzz, which is characterizedby its intensity (loudness) and frequency (pitch). The vocal tract(the throat and mouth) forms the tube, which is characterized byits resonances, which give rise to formants, or enhanced frequencybands in the sound produced. Hisses and pops are generated by theaction of the tongue, lips and throat during sibilants and plosives.

LPC analyzes the speech signal by estimating the formants, re-moving their effects from the speech signal, and estimating theintensity and frequency of the remaining buzz. The process of re-moving the formants is called inverse filtering, and the remainingsignal after the subtraction of the filtered modeled signal is called

4


670

the residue.The numbers which describe the intensity and frequency of the

buzz, the formants, and the residue signal, can be stored or trans-mitted somewhere else. LPC synthesizes the speech signal by re-versing the process: use the buzz parameters and the residue tocreate a source signal, use the formants to create a filter (whichrepresents the tube), and run the source through the filter, result-ing in speech.

Because speech signals vary with time, this process is done onshort chunks of the speech signal, which are called frames; gener-ally, 30 to 50 frames per second give intelligible speech with goodcompression.

3.3 Teager Energy Operator(TEO)

The Teager Energy Operator, which provides a measure of the en-ergy of a speech signal, was motivated by experiments in speechand hearing by Teager and Teager (1980, 1981, 1983, and 1990).In these experiments Teager demonstrated that the airflow in thevocal tract is separated and adheres to the walls of the vocal tract.Given these observations, the geometry of the vocal tract, and theresults of some experiments with whistle cavities, Teager proposedthe model of speech productions. In this model, air exits the glottisas a jet and attaches to the nearest wall of the vocal tract. As theair passes over the cavity between the true vocal folds and the falsevocal folds, vortices of air are created. The bulk of the air contin-ues propagating towards the lips while adhering to the walls of thevocal tract.

Just like the normal energy operator, Teager energy operator(TEO) is also used to calculate energy of a signal.

The TEO for a continuous time signal is defined as:

ψ(x(n)) = ddtx(t)2 − x(t)

(d2

dt2x(t)

)

Where as the TEO for a discrete signal is defined as:

ψ(x(n)) = x(n)2 − x(n+ 1)x(n− 1)

5


671

Where, x (n) is the sampled speech signal,

ψ is the Teager energy operator.The Teager energy operator is a non linear operator which was

introduced to calculate instantaneous energy with improved signalto noise ratio (SNR). There is always some noise associated withthe recording and processing equipment. With TEO, the noisy partpresent in the signal is suppressed while calculating its correspond-ing energy. That is, the noise energy is not taken into consideration.Contrary to this, the normal squared energy operator takes the in-put signal along with the noise present and finds the energy.

4 WORK APPROACH

To analyze the signal under different condition, it is required tocreate a database of speech signal. Sound recording is an elec-trical or mechanical inscription of sound waves, such as spokenvoice, singing, instrumental music, or sound effects. The two mainclasses of sound recording technology are analog recording anddigital recording[3]. Acoustic analog recording is achieved by asmall microphone diaphragm that can detect changes in atmo-spheric pressure (acoustic sound waves) and record them as a graphicrepresentation of the sound waves on a medium such as a phono-graph (in which a stylus senses grooves on a record). In magnetictape recording, the sound waves vibrate the microphone diaphragmand are converted into a varying electric current, which is then con-verted to a varying magnetic field by an electromagnet, which makesa representation of the sound as magnetized areas on a plastic tapewith a magnetic coating on it.

A wide range of speech database is available which aims for thedevelopment of speech synthesis/recognition and for linguistic re-search. We have created a database of 10 males and 10 femalesaged between 20-25 years, where they were subjected to ExamStress. Their speech was recorded just before the examination andan hour after the examination. As we know that the pattern ofspeech changes with the content of utterance of speech, so to makethe analysis precise, the phrase, The weather is too hot today wastaken into account. The complete database was observed and the

6


672

change in the pattern of speech was studied over normal speech.The created database undergoes different algorithms where we

estimate different parameters of a signal in different domains i.e.time domain and frequency domain.

WORK DONE

Stress is an extremely important factor in speech perception.Stressed syllables are generally the best articulated syllable in eachword. Therefore, stressed syllables proved islands of sound reliabil-ity in the normal blur of speech.

The vowels, are usually longer and louder in stressed syllables.More importantly, they tend to keep their full vowel value. Bycontrast, the vowels in unstressed syllables (reduced syllables at fastspeaking rates) all tend to move towards a neutral or central vowelsound, like the /schwa/ sound in about. So, as voice technologycontinues to mature, it becomes important to understand how stressand emotion influence speech production in actual environment.

5 ANALYSIS OF SPEECH

5.1 WINDOW FUNCTION

A window function is a mathematical tool that limits the inputsignal. That is, it allows only a defined interval of input signal,while restricting the outer interval of the signal. Thus, we can saythat a window function is somewhat a time domain filter whichallows only a defined interval of signal to pass while attenuatingthe signal falling outside the defined interval.

There are many types of window functions, like rectangular,hamming, hanning, blackmann etc.

A rectangular window is defined as:

w(n) = 1; 0 < n ≤ (N − 1) (1)

0; otherwise

where,

7


673

N is the total number of samples of the signal.The window function used in this paper is the hamming window

because of its spectral efficiency, which will be discussed later.Hamming window is defined as:

w(n) = 0.54 + 0.46 cos

(2πnN

)(2)

where , N is the total number of samples of the input signals.From the digital communications concept, we know that a band

limited signal is not time limited and a time limited signal is notband limited. As a result, if are not using any window technique,that is we are unknowingly using a rectangular window, thus we aretime limiting the signal which results into a considerable spectralleakage in frequency domain resulting in loss of information. Butif we use a hamming window, though we are compromising withthe amplitude of the signal, we will be getting a better frequencydomain representation of the signal and a less frequency leakage[10].

This reduction of amplitude of the signal because of the windowfunction can be reduced if we use the concept of overlapping ofwindow function. This will not only result in better approximationof the signal, but also reduces the spectral leakage. In this paper,we have used a 50% overlapping window function.

Fig. 5.1(a) Windowing

8


674

5.2 OUTPUT OF THE TEAGER ENERGYOP-ERATION(TEO)

A database of 5 men and 5 women was made in order to carry outthe operations. Their voice was recorded through the voice record-ing tool. Pre-emphasis of the signal was done and noise cancellationwas performed [12].

Speech signal having a high pitch was chosen to carry out theanalysis in order to get clear outputs.

Fig. 5.2(a)- Teager Energy of Normal Speech

Fig. 5.2(b) - Teager Energy of Stressed Speech

9


675

Fig. 5.2(c) Teager Energy of Female Speech

Fig. 5.2(d) Teager Energy of Male Speech

The fig 5.2(a) and fig 5.2(b) shows the output of the teagerenergy of normal speech signal and stressed speech signals respec-tively. In fig 5.2(c) and fig 5.2 (d) we can see the comparisonbetween the normal speech signal and stressed speech signal of 5female and 5 male respectively. We can observe that in both casesthe teager energy in the stressed speech is more than the normalspeech.

6 CONCLUSION

Voice activity detection systems find its use in Artificial Intelligencealso. It is basically the study and design of intelligent agent in whichan intelligent agent is a system that perceives its environment andtakes actions that maximize its chances of success.

10


676

Cognitive Computing is the simulation of human thought pro-cesses in a computerized model. These platforms encompass natu-ral language processing, speech and vision, dialogue and narrativegeneration.

References

[1] D. A. Cairns and J. H. L. Hansen, Nonlinear analysis anddetection of speech under stressed conditions, J. Acoust. Soc.Amer., vol. 96, pp. 33923400, 1994.

[2] V. Mohan and et al, Analysis & Synthesis of Speech SignalUsing Matlab, International Journal of Advancements in Re-search & Technology, Volume 2, Issue 5, M ay-2013.

[3] M. Sigmund, Introducing the database ExamStress for speechunder stress, Proceedings of 7th IEEE NordicSignal ProcessingSymposium (NORSIG 2006). Reykjavik, pp. 290-293, 2006.

[4] T. Johnstone and K. Scherer, The effects of emotions on voicequality, Proceedings of 14th International Congressof PhoneticScience. San Francisco, pp. 2029-2032, 1999.

[5] D. Ververidis and C. Kotropoulos, Emotional speech recogni-tion: Resources, features, and methods, SpeechCommunica-tion, vol. 48, No. 9, pp. 1162-1181, 2006.

[6] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recog-nition, Englewood Cliffs, NJ: Prentice-Hall, 1993.

[7] Cowie, R., Cornelius, R.R., 2003. Describing the emotionalstates that are expressed in speech. Speech Comm. 40 (1), 532.Cowie, R., Douglas-Cowie, E., 1996. Automatic statistical.Rep. 236, Univ. of Hamburg.

[8] Flanagan, J.L., 1972. Speech Analysis, Synthesis and Percep-tion. second ed.. Springer-Verlag, NY.

[9] Heuft, B., Portele, T., Rauth, M., 1996. Emotions in time do-main synthesis. In: Proc. Internat. Conf. on Spoken LanguageProcessing (ICSLP 96), Vol. 3, pp. 19741977.

11


677

[10] Markel, J.D., Gray, A.H., 1976. Linear Prediction of Speech.Springer-Verlag, NY.

[11] Quatieri, T.F., 2002. Discrete-Time Speech Signal Processing.Prentice-Hall, NJ.

[12] Rahurkar, M., Hansen, J.H.L., 2002. Frequency band analysisfor stress detection using a Teager energy operator based fea-ture. In: Proc. Internat. Conf. on Spoken Language Processing(ICSLP 02), Vol. 3, pp. 20212024.

[13] Steeneken, H.J.M., Hansen, J.H.L., 1999. Speech under stressconditions: overview of the effect of speech production and onsystem performance. In: Proc. Internat. Conf. on Acoustics,Speech, and Signal Processing (ICASSP 99), Phoenix, Vol. 4,pp. 20792082.

[14] Womack, B.D., Hansen, J.H.L., 1996. Classification of speechunder stress using target driven features. Speech Comm. 20,131150.

[15] Zhou, G., Hansen, J.H.L., Kaiser, J.F., 2001. Nonlinear featurebased classification of speech under stress. IEEE Trans. SpeechAudio Processing 9 (3), 201216.

[16] Deller, J. R., Hansen, J. H. L., Proakis, J. G., 2000. Discete-Time Processing of Speech Signals. N.Y.: Wiley.

[17] M. Sigmund, Voice Recognition by Computer. Tectum Verlag,Marburg, 2003.

[18] M. Sigmund and P. Matjka, An environment for automaticspeech signal labelling, Proceedings of 28th IASTED Interna-tional Conference on Applied Informatics. Innsbruck, pp. 298-301, 2002.

[19] A. Nagoor Kani, 2005. Signals & Systems. Tata McGraw HillEducation.

[20] Sanjit K Mitra, 2009. Digital signal processing, A computerbase approach, Tata McGraw Hill.

12


678

[21] Lawrence R. Rabiner, Ronald W. Schafer, 2003. Digital Pro-cessing of Speech Signals. AT&T.

[22] Alan V. Oppenheim, Alan S. Willsky, S. Hamid Nawab, 2005.Signal & Systems. PHI Learning.

13


679

analysis of stressed speech on teager energy operator (teo)

Documents