computer speech recognition acoustic signal processing

of 7

A Fundamental Examination of

Computer Speech Recognition Acoustic Signal Processing

Ebe Helm

ABSTRACT

Speech Recognition technology may arguably be one of the longest evolving of

all the computer related sciences. Indeed, some may agree that much of the

advances were both attained, and also inhibited, decades ago while waiting for

computer processing capabilities to catch up with the more advanced mathematics

and techniques developed to attain the goals [1]. This paper begins to illustrate

and examine some of the most basic concepts and challenges encountered in

acoustic signal processing related to computer speech recognition.

Introduction

The intent of this investigation was to make direct observations of acoustic waveforms acted upon

by simple algorithms and functions created at the programmatic level. Using a general purpose

programming language (in this case C+) as a platform afforded a more adaptive environment in

contrast with pre-defined fixed-application software. The program is used to evolve and develop a

series of functions which are grouped into acoustic signal filtering, metrics, rules applied to

individual waveform (word segments) and finally processing the resultant signatures in both

amplitude and time domain. These functions are also applied to the frequency profiles of the

waveforms to demonstrate acoustic spectrum compared with the amplitude data. The same test

phrase “One Testing Two Three” is used in all the figures following. The computer system and

integral audio components used were generic. No special equipment, microphones or other

processing was used. The audio is captured in Pulse Code Modulation (PCM) format at 20 kHz/16bit

resolution. The duration of captured audio is constant at two seconds providing 40,000 samples.

Note: The basic equations discussed below employ the Cyrillic capital letter И suggested by Valerii

Salov in [2] to relate function iteration as perhaps a more aesthetic and effective alternative to pseudo

code.

1. Noise Filtering

The first thing apparent when repeating the same phrase into the system was that the waveforms

would change significantly affected by the background noise floor. The predominant cause was

observed to be from the air movement commonly known as HVAC noise. The ‘noise filter’ equation

given below describes a nested loop where ‘m’ is the iterations performed on an array of i = 40,000

samples. Using a variation of the Low-Pass filter described next; an inverse waveform is created

and added to both raw and low-pass waveforms. This destructive interference appeared most

effective with lower frequency noise when ‘m’ is approximately 1.5 times the low-pass iteration. In

this case m = 20 as seen in fig. 1b below.

𝑛𝑜𝑖𝑠𝑒 𝑓𝑖𝑙𝑡𝑒𝑟

n+1

И0→m

x+1

И0→i

n0𝑓(x) = ( f(x-n) + f(x) + f(x+n )

3) →

x+1

И0→i

f(x) = f(x) − n0𝑓(x)

of 7

Fig. 1a: Raw and Low-Pass data lines without noise filtering

Fig. 1b: Noise filtering applied

2. Low-Pass Filtering - Glotteral Pulse

Human voice is essentially understood to be comprised of two independent sound sources acting

together to produce speech. Exhaled breath modulated by shaping the mouth, lips and tongue

produce higher frequencies with somewhat lower energy then the Larynx/Glottis. The lower

frequencies and higher relative energy of the ‘Glottal’ pulse produce a unique signature that may

itself provide a distinguishing feature for pattern recognition. The following equation simply repeats

the averaging of every three values in the array until the higher frequencies are smoothed out; thus

leaving a lower frequency sine wave characteristically produced by the Glottis. In the example

below, the glottal pulse became well defined as ‘m’ approached 14 iterations; yielding a fundamental

frequency (test subject) of approximately 130Hz.

𝑙𝑜𝑤 𝑝𝑎𝑠𝑠

n+1

И0→m

x+1

И0→i

l0𝑓(x) = ( f(x-n) + f(x) + f(x+n )

3) → 𝐹0 = 1/𝜏𝑠 𝑜𝑟 𝑎𝑠 𝐹0 = 1000/𝜏𝑚𝑠

Fig. 2: Glotteral Pulse Sinewave after Low-Pass filtering

3. High-Pass Filtering - Soft Palate Sounds Recovery

Sixteen-bit PCM audio data yields values that can range as much as ±32,767. For display purposes

these values were constrained to ±100. Because of the logarithmic nature of audio data, this means

that the lower energy soft palate sounds would almost completely disappear into the noise floor.

Recovery of the high frequency soft palate sounds can however, be accomplished by applying

another filter. Most often an array of filters is applied [3]. The high-pass filter illustrated below

combines aspects of the previous two filters. One iteration of the low-pass (smoothing function)

creates a mask of all the lowest frequencies. This in turn is subtracted from the original raw data to

produce a waveform consisting of only the highest frequencies. Fig. 3 below shows a comparison

of 7

of the raw data with the high-pass result. The signature of the two waveforms is now distinctly

different. The high frequency plot now clearly shows energy where the otherwise missing soft-palate

sounds of t and th occur.

ℎ𝑖 𝑝𝑎𝑠𝑠

x+1

И0→i

h0𝑓(𝑥) = f(x) − ( f(x-1) + f(x) + f(x+1 )

3)

Fig. 3: Soft palate sounds t and th recovered using High-Pass filtering

4. Waveform Profile

From this point forward it becomes more practical to work with data that is quantized into a more

manageable form representing the acoustic envelope. This is accomplished by plotting the average

of the absolute values of a sum of samples. In Fig. 4 the full two seconds of audio (40,000 samples)

is displayed as 1000 pixels with a compression of 40:1.

𝑝𝑟𝑜𝑓𝑖𝑙𝑒

x+1

И0→i

S0𝑓(𝑥) =

x+j

Σx-j

|𝑓(𝑥)| →

x+1

И0→i

po𝑓(𝑥) = S0𝑓(𝑥)

𝑗

Fig. 4: Waveform Profile

5. Noise Floor Attenuation

One of the more difficult aspects of processing audio for speech recognition is actually separating a

speaker’s voice from the noise floor. Just as the energy level of an individual’s voice can change

unpredictably, so can the background noise. Even from one word to the next. A linear or constant

threshold, even one changing as the background noise is sampled, proves to be something of a

challenge. For the purposes of this exorcise the data was attenuated using the natural log as a

coefficient. Although somewhat effective, this was still unreliable; often requiring manual

adjustment in the number of iterations.

An interesting observation led to the supposition that perhaps attenuation level might be determined,

not by trying to decide what is noise and what is voice data (amplitude), but rather by data length

(time domain). As seen in Fig. 5 below, an un-attenuated waveform of 39,412 samples in length is

repeatedly attenuated until the longest segment is less then what might be expected to be the longest

spoken word. This is to say that, it may be reasonable that the longest spoken word should not be

of 7

longer than perhaps 800ms. If the signal is therefore attenuated until all segments are below that

length, then it might be found that the noise is effectively suppressed and only the desired data

remains.

Fig. 5: Noise Floor Attenuation

6. Common Floor Rule

Rules can now be formulated to make decisions on how the data is to be manipulated. The first of

these assumes the relationship of overlapping wave segments and groups them accordingly. The

Common Floor Rule includes only the floor area shared by all three lines. Fig. 6b. The interesting

consideration here is that by not recombining these three waveforms, a greater degree of overall

pattern uniqueness might be retained.

Fig. 6a: Separate Raw, Low and High frequencies

Fig. 6b: Common floor rule applide - 𝐹𝑙𝑜𝑜𝑟 = ∀ {𝑅𝑎𝑤 ∧ 𝐿𝑜𝑤 ∧ 𝐻𝑖𝑔ℎ ∶ 𝑓(𝑥) = 0}

7. Orphan Segment Rule

The test phrase “One, Testing, Two, Three” is now shown in Fig. 7a below; incorrectly plotted as

five word segments. The next observation indicated that leading soft palate sounds are frequently

isolated with no corresponding Raw or Low frequency signal present. The next rule applied adds

these orphans to their following segment. In this case the isolated th completes ree and becomes the

4th word three as shown in Fig. 7b.

Two other rules (omitted here for brevity) were applied to remove noise fragments and separate

closely marginalized segments. There is the consideration that too many rules would suggest that

there is something wrong with the approach. However, in this last case separating individual words

in ‘connected speech’ extends beyond the reach of simple amplitude signal processing. Recognizing

individual words when run together in normal speaking has been the quandary in this science for

over half a century.

of 7

Fig. 7a: Orphan Segment Rule

Fig. 7b: Orphan Segment Rule - 𝐹𝑙𝑜𝑜𝑟 ≠ ∀ {𝑅𝑎𝑤 ∧ 𝐿𝑜𝑤 ∶ 𝑓(𝑥) = 0 , 𝐻𝑖𝑔ℎ: 𝑓(𝑥) ≠ 0}

It is important to note at this point that these illustrations relate only to audio signal processing for

the purposes extracting and distinguishing individual words (or segments) in speech. They are in

fact only relevant as a beginning to the process of speech recognition. The term ‘Speech

Recognition’ in itself can be somewhat ambiguous. Even in as to what is being recognized. The

patterns shown in the figures above certainly contain features which are unique enough for a ‘voice

command system’ such that they are taken as a whole phrase.

8. Frequency v. Amplitude

The two techniques used to measure frequency of the data were slope transitions (peak-counting),

and also counting the number of times the signal transitions between positive and negative (zero-

crossings). Both methods were applied in the program for all three lines, Raw, Low and High

frequencies. Normally, and in current systems, a much broader and more detailed analysis is made

of the audio spectrum. Employing Mel Frequency Cepstral Coefficients (MFCC) derived from Fast

Fourier Transforms (FFT) [4-10]. This significant increase in data challenges even the method for

rendering and displaying the data as two dimensional graphs. This extends to the use of, and in some

cases to those who can actually read sonograms [1].

On a more basic level, this program was simply used to observe the relative fidelity between

utterances in both Amplitude and Frequency. It was interesting to see that, while amplitude will vary

from one utterance to the next; frequency, on the other hand will tend to retain fidelity. Somewhat

analogous to AM verses FM radio. A sampling window of 50ms is used to create the frequency

profiles as shown in Fig. 8 below.

Fig. 8: Two test phrase samples compared. Amp (black), Peaks (red), Zeros (blue)

of 7

9. Time Domain Companding

The most important of the concepts examined here is the one saved for last. It can best be illustrated

after all the previous steps have been taken. It is also one of the most interesting, as that even with

deliberate intent to speak the same words, or phrase repeatedly; it is simply impossible to repeat a

phrase to an exact match. True, sometimes very close results are seen, but never a perfect match. In

this way, human speech has a quality that is seemingly infinitely elastic. It stretches, not only in

length, but in itself from one phoneme to the next. Yet, somehow we manage to recognize these

sounds with a very high degree of acuity. Certainly cognitive functions play the dominate role,

however the line of demarcation between hearing and then perceiving what is said, even in artificial

terms, may perhaps be in the timing.

The last function applied to the data was to compress or expand (compand) each segment to a

constant value and also constrain the positioning of each segment on the time line. Figure 9a below

shows a comparison of the raw amplitude profile of the test phrase spoken twice. The variations in

segment length and their relative separation on the time line are apparent. On occasion two

utterances of the same phrase may appear to match almost exactly, but this is rare. Figure 9b presents

the same data after the companding function is applied as described above. The several steps taken

to get to this point serve to illustrate both the enigma and the challenge in this science. How to match

patterns that match, but never really match? Never quite exactly.

Fig. 9a: Two uterances of the same phrase “One, testing, two, three”

Fig. 9b: The same data from 8a overlayed after time domain companding

For the purposes of this illustration, the companding length values were kept constant at 8000 for

all word segments with a separation of 1000 between them. Some variation on this, in a simple

phrase recognition system, might prove effective in distinguishing long and short words. This begins

to suggest the concepts of speech recognition as evolving from pattern matching a phrase, taken as

a whole, verses recognizing individual words spoken one at a time verses the ability to discern the

meaning of continuous connected speech. Something that has led the evolution of the science away

from the amplitude domain and toward looking at speech as a frequency signature.

There is a timeline of over forty years of wonderful explorations in mathematics devoted to solving

this enigma. Dynamic Time Warping (DTW) stretches and squeezes the individual segments trying

to find a fit. Mel Frequency Cepstral Coefficient (MFCC) and Fast Fouier Transforms (FFT) sample

the data in 10ms frames to calculate and produce an image of the audio spectrum. Hidden Markov

Models (HMM) are used on multiple levels to predict not only which words may follow, but which

phoneme is most likely next. All of these techniques are applied in digesting speech input to match

their patterns in the frequency domain at the phonetic level using the audio spectrum. All intended

to bring the data to the next level and evolution; that of artificial intelligence. Deep Learning.

of 7

References

[1] Elmer P., Lewenstein M., Musello D., His Master’s (Digital) Voice, Time, 4/1/1985, Vol.

125 Issue 13, p83. 2p

[2] Salov , Valerii, Notation for Iteration of Functions, Iteral, Cornell University Library, 2012

[3] Y.L. Chow, M.O. Dunham, O.A. Kimball, M.A. Krasner, G.F. Kubala, J. Makhoul, P.J.

Price, S. Roucos, and R.M. Schwarz, BYBLOS: The BBN Continuous Speech Recognition

System, IEEE Conference Publications, 1987, vol. 12, pp89-92

[4] S. Levinson, AT&T Bell Laboratories, Murray Hill, New Jersey, Continuous speech

recognition by means of acoustic/ Phonetic classification obtained from a hidden Markov

model, Acoustics, Speech, and Signal Processing, IEEE International Conference on

ICASSP '87.

[5] Mark Gales and Steve Young, The Application of Hidden Markov Models in Speech

Recognition, Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195–304

[6] D.B. Paul, Speech Recognition Using Hidden Markov Models, The Lincoln Laboratory

Journal, Volume 3, Number 1 (l990), Carnegie Mellon University Pittsburgh, PA

[7] Wayne Ward, Hidden Markov Models In Speech Recognition, Carnegie Mellon University

Pittsburgh, PA

[8] Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, Voice Recognition Algorithms

using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW)

Techniques, JOURNAL OF COMPUTING, VOLUME 2, ISSUE 3, MARCH 2010

[9] Lawrence R. Rabiner, Aaron E. Rosenberg, and Stephen E. Levinson, Considerations in

dynamic time warping algorithms for discrete word recognition, The Journal of the

Acoustical Society of America, Volume 63, Issue S1

[10] Eamonn J. Keogh and Michael J. Pazzani, Derivative Dynamic Time Warping,

Proceedings of the 2001 SIAM International Conference on Data Mining

computer speech recognition acoustic signal processing

Science