computer speech recognition acoustic signal processing
TRANSCRIPT
Page 1 of 7
A Fundamental Examination of
Computer Speech Recognition Acoustic Signal Processing
Ebe Helm
ABSTRACT
Speech Recognition technology may arguably be one of the longest evolving of
all the computer related sciences. Indeed, some may agree that much of the
advances were both attained, and also inhibited, decades ago while waiting for
computer processing capabilities to catch up with the more advanced mathematics
and techniques developed to attain the goals [1]. This paper begins to illustrate
and examine some of the most basic concepts and challenges encountered in
acoustic signal processing related to computer speech recognition.
Introduction
The intent of this investigation was to make direct observations of acoustic waveforms acted upon
by simple algorithms and functions created at the programmatic level. Using a general purpose
programming language (in this case C+) as a platform afforded a more adaptive environment in
contrast with pre-defined fixed-application software. The program is used to evolve and develop a
series of functions which are grouped into acoustic signal filtering, metrics, rules applied to
individual waveform (word segments) and finally processing the resultant signatures in both
amplitude and time domain. These functions are also applied to the frequency profiles of the
waveforms to demonstrate acoustic spectrum compared with the amplitude data. The same test
phrase “One Testing Two Three” is used in all the figures following. The computer system and
integral audio components used were generic. No special equipment, microphones or other
processing was used. The audio is captured in Pulse Code Modulation (PCM) format at 20 kHz/16bit
resolution. The duration of captured audio is constant at two seconds providing 40,000 samples.
Note: The basic equations discussed below employ the Cyrillic capital letter И suggested by Valerii
Salov in [2] to relate function iteration as perhaps a more aesthetic and effective alternative to pseudo
code.
1. Noise Filtering
The first thing apparent when repeating the same phrase into the system was that the waveforms
would change significantly affected by the background noise floor. The predominant cause was
observed to be from the air movement commonly known as HVAC noise. The ‘noise filter’ equation
given below describes a nested loop where ‘m’ is the iterations performed on an array of i = 40,000
samples. Using a variation of the Low-Pass filter described next; an inverse waveform is created
and added to both raw and low-pass waveforms. This destructive interference appeared most
effective with lower frequency noise when ‘m’ is approximately 1.5 times the low-pass iteration. In
this case m = 20 as seen in fig. 1b below.
𝑛𝑜𝑖𝑠𝑒 𝑓𝑖𝑙𝑡𝑒𝑟
n+1
И0→m
x+1
И0→i
n0𝑓(x) = ( f(x-n) + f(x) + f(x+n )
3) →
x+1
И0→i
f(x) = f(x) − n0𝑓(x)
Page 2 of 7
Fig. 1a: Raw and Low-Pass data lines without noise filtering
Fig. 1b: Noise filtering applied
2. Low-Pass Filtering - Glotteral Pulse
Human voice is essentially understood to be comprised of two independent sound sources acting
together to produce speech. Exhaled breath modulated by shaping the mouth, lips and tongue
produce higher frequencies with somewhat lower energy then the Larynx/Glottis. The lower
frequencies and higher relative energy of the ‘Glottal’ pulse produce a unique signature that may
itself provide a distinguishing feature for pattern recognition. The following equation simply repeats
the averaging of every three values in the array until the higher frequencies are smoothed out; thus
leaving a lower frequency sine wave characteristically produced by the Glottis. In the example
below, the glottal pulse became well defined as ‘m’ approached 14 iterations; yielding a fundamental
frequency (test subject) of approximately 130Hz.
𝑙𝑜𝑤 𝑝𝑎𝑠𝑠
n+1
И0→m
x+1
И0→i
l0𝑓(x) = ( f(x-n) + f(x) + f(x+n )
3) → 𝐹0 = 1/𝜏𝑠 𝑜𝑟 𝑎𝑠 𝐹0 = 1000/𝜏𝑚𝑠
Fig. 2: Glotteral Pulse Sinewave after Low-Pass filtering
3. High-Pass Filtering - Soft Palate Sounds Recovery
Sixteen-bit PCM audio data yields values that can range as much as ±32,767. For display purposes
these values were constrained to ±100. Because of the logarithmic nature of audio data, this means
that the lower energy soft palate sounds would almost completely disappear into the noise floor.
Recovery of the high frequency soft palate sounds can however, be accomplished by applying
another filter. Most often an array of filters is applied [3]. The high-pass filter illustrated below
combines aspects of the previous two filters. One iteration of the low-pass (smoothing function)
creates a mask of all the lowest frequencies. This in turn is subtracted from the original raw data to
produce a waveform consisting of only the highest frequencies. Fig. 3 below shows a comparison
Page 3 of 7
of the raw data with the high-pass result. The signature of the two waveforms is now distinctly
different. The high frequency plot now clearly shows energy where the otherwise missing soft-palate
sounds of t and th occur.
ℎ𝑖 𝑝𝑎𝑠𝑠
x+1
И0→i
h0𝑓(𝑥) = f(x) − ( f(x-1) + f(x) + f(x+1 )
3)
Fig. 3: Soft palate sounds t and th recovered using High-Pass filtering
4. Waveform Profile
From this point forward it becomes more practical to work with data that is quantized into a more
manageable form representing the acoustic envelope. This is accomplished by plotting the average
of the absolute values of a sum of samples. In Fig. 4 the full two seconds of audio (40,000 samples)
is displayed as 1000 pixels with a compression of 40:1.
𝑝𝑟𝑜𝑓𝑖𝑙𝑒
x+1
И0→i
S0𝑓(𝑥) =
x+j
Σx-j
|𝑓(𝑥)| →
x+1
И0→i
po𝑓(𝑥) = S0𝑓(𝑥)
𝑗
Fig. 4: Waveform Profile
5. Noise Floor Attenuation
One of the more difficult aspects of processing audio for speech recognition is actually separating a
speaker’s voice from the noise floor. Just as the energy level of an individual’s voice can change
unpredictably, so can the background noise. Even from one word to the next. A linear or constant
threshold, even one changing as the background noise is sampled, proves to be something of a
challenge. For the purposes of this exorcise the data was attenuated using the natural log as a
coefficient. Although somewhat effective, this was still unreliable; often requiring manual
adjustment in the number of iterations.
An interesting observation led to the supposition that perhaps attenuation level might be determined,
not by trying to decide what is noise and what is voice data (amplitude), but rather by data length
(time domain). As seen in Fig. 5 below, an un-attenuated waveform of 39,412 samples in length is
repeatedly attenuated until the longest segment is less then what might be expected to be the longest
spoken word. This is to say that, it may be reasonable that the longest spoken word should not be
Page 4 of 7
longer than perhaps 800ms. If the signal is therefore attenuated until all segments are below that
length, then it might be found that the noise is effectively suppressed and only the desired data
remains.
Fig. 5: Noise Floor Attenuation
6. Common Floor Rule
Rules can now be formulated to make decisions on how the data is to be manipulated. The first of
these assumes the relationship of overlapping wave segments and groups them accordingly. The
Common Floor Rule includes only the floor area shared by all three lines. Fig. 6b. The interesting
consideration here is that by not recombining these three waveforms, a greater degree of overall
pattern uniqueness might be retained.
Fig. 6a: Separate Raw, Low and High frequencies
Fig. 6b: Common floor rule applide - 𝐹𝑙𝑜𝑜𝑟 = ∀ {𝑅𝑎𝑤 ∧ 𝐿𝑜𝑤 ∧ 𝐻𝑖𝑔ℎ ∶ 𝑓(𝑥) = 0}
7. Orphan Segment Rule
The test phrase “One, Testing, Two, Three” is now shown in Fig. 7a below; incorrectly plotted as
five word segments. The next observation indicated that leading soft palate sounds are frequently
isolated with no corresponding Raw or Low frequency signal present. The next rule applied adds
these orphans to their following segment. In this case the isolated th completes ree and becomes the
4th word three as shown in Fig. 7b.
Two other rules (omitted here for brevity) were applied to remove noise fragments and separate
closely marginalized segments. There is the consideration that too many rules would suggest that
there is something wrong with the approach. However, in this last case separating individual words
in ‘connected speech’ extends beyond the reach of simple amplitude signal processing. Recognizing
individual words when run together in normal speaking has been the quandary in this science for
over half a century.
Page 5 of 7
Fig. 7a: Orphan Segment Rule
Fig. 7b: Orphan Segment Rule - 𝐹𝑙𝑜𝑜𝑟 ≠ ∀ {𝑅𝑎𝑤 ∧ 𝐿𝑜𝑤 ∶ 𝑓(𝑥) = 0 , 𝐻𝑖𝑔ℎ: 𝑓(𝑥) ≠ 0}
It is important to note at this point that these illustrations relate only to audio signal processing for
the purposes extracting and distinguishing individual words (or segments) in speech. They are in
fact only relevant as a beginning to the process of speech recognition. The term ‘Speech
Recognition’ in itself can be somewhat ambiguous. Even in as to what is being recognized. The
patterns shown in the figures above certainly contain features which are unique enough for a ‘voice
command system’ such that they are taken as a whole phrase.
8. Frequency v. Amplitude
The two techniques used to measure frequency of the data were slope transitions (peak-counting),
and also counting the number of times the signal transitions between positive and negative (zero-
crossings). Both methods were applied in the program for all three lines, Raw, Low and High
frequencies. Normally, and in current systems, a much broader and more detailed analysis is made
of the audio spectrum. Employing Mel Frequency Cepstral Coefficients (MFCC) derived from Fast
Fourier Transforms (FFT) [4-10]. This significant increase in data challenges even the method for
rendering and displaying the data as two dimensional graphs. This extends to the use of, and in some
cases to those who can actually read sonograms [1].
On a more basic level, this program was simply used to observe the relative fidelity between
utterances in both Amplitude and Frequency. It was interesting to see that, while amplitude will vary
from one utterance to the next; frequency, on the other hand will tend to retain fidelity. Somewhat
analogous to AM verses FM radio. A sampling window of 50ms is used to create the frequency
profiles as shown in Fig. 8 below.
Fig. 8: Two test phrase samples compared. Amp (black), Peaks (red), Zeros (blue)
Page 6 of 7
9. Time Domain Companding
The most important of the concepts examined here is the one saved for last. It can best be illustrated
after all the previous steps have been taken. It is also one of the most interesting, as that even with
deliberate intent to speak the same words, or phrase repeatedly; it is simply impossible to repeat a
phrase to an exact match. True, sometimes very close results are seen, but never a perfect match. In
this way, human speech has a quality that is seemingly infinitely elastic. It stretches, not only in
length, but in itself from one phoneme to the next. Yet, somehow we manage to recognize these
sounds with a very high degree of acuity. Certainly cognitive functions play the dominate role,
however the line of demarcation between hearing and then perceiving what is said, even in artificial
terms, may perhaps be in the timing.
The last function applied to the data was to compress or expand (compand) each segment to a
constant value and also constrain the positioning of each segment on the time line. Figure 9a below
shows a comparison of the raw amplitude profile of the test phrase spoken twice. The variations in
segment length and their relative separation on the time line are apparent. On occasion two
utterances of the same phrase may appear to match almost exactly, but this is rare. Figure 9b presents
the same data after the companding function is applied as described above. The several steps taken
to get to this point serve to illustrate both the enigma and the challenge in this science. How to match
patterns that match, but never really match? Never quite exactly.
Fig. 9a: Two uterances of the same phrase “One, testing, two, three”
Fig. 9b: The same data from 8a overlayed after time domain companding
For the purposes of this illustration, the companding length values were kept constant at 8000 for
all word segments with a separation of 1000 between them. Some variation on this, in a simple
phrase recognition system, might prove effective in distinguishing long and short words. This begins
to suggest the concepts of speech recognition as evolving from pattern matching a phrase, taken as
a whole, verses recognizing individual words spoken one at a time verses the ability to discern the
meaning of continuous connected speech. Something that has led the evolution of the science away
from the amplitude domain and toward looking at speech as a frequency signature.
There is a timeline of over forty years of wonderful explorations in mathematics devoted to solving
this enigma. Dynamic Time Warping (DTW) stretches and squeezes the individual segments trying
to find a fit. Mel Frequency Cepstral Coefficient (MFCC) and Fast Fouier Transforms (FFT) sample
the data in 10ms frames to calculate and produce an image of the audio spectrum. Hidden Markov
Models (HMM) are used on multiple levels to predict not only which words may follow, but which
phoneme is most likely next. All of these techniques are applied in digesting speech input to match
their patterns in the frequency domain at the phonetic level using the audio spectrum. All intended
to bring the data to the next level and evolution; that of artificial intelligence. Deep Learning.
Page 7 of 7
References
[1] Elmer P., Lewenstein M., Musello D., His Master’s (Digital) Voice, Time, 4/1/1985, Vol.
125 Issue 13, p83. 2p
[2] Salov , Valerii, Notation for Iteration of Functions, Iteral, Cornell University Library, 2012
[3] Y.L. Chow, M.O. Dunham, O.A. Kimball, M.A. Krasner, G.F. Kubala, J. Makhoul, P.J.
Price, S. Roucos, and R.M. Schwarz, BYBLOS: The BBN Continuous Speech Recognition
System, IEEE Conference Publications, 1987, vol. 12, pp89-92
[4] S. Levinson, AT&T Bell Laboratories, Murray Hill, New Jersey, Continuous speech
recognition by means of acoustic/ Phonetic classification obtained from a hidden Markov
model, Acoustics, Speech, and Signal Processing, IEEE International Conference on
ICASSP '87.
[5] Mark Gales and Steve Young, The Application of Hidden Markov Models in Speech
Recognition, Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195–304
[6] D.B. Paul, Speech Recognition Using Hidden Markov Models, The Lincoln Laboratory
Journal, Volume 3, Number 1 (l990), Carnegie Mellon University Pittsburgh, PA
[7] Wayne Ward, Hidden Markov Models In Speech Recognition, Carnegie Mellon University
Pittsburgh, PA
[8] Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, Voice Recognition Algorithms
using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW)
Techniques, JOURNAL OF COMPUTING, VOLUME 2, ISSUE 3, MARCH 2010
[9] Lawrence R. Rabiner, Aaron E. Rosenberg, and Stephen E. Levinson, Considerations in
dynamic time warping algorithms for discrete word recognition, The Journal of the
Acoustical Society of America, Volume 63, Issue S1
[10] Eamonn J. Keogh and Michael J. Pazzani, Derivative Dynamic Time Warping,
Proceedings of the 2001 SIAM International Conference on Data Mining