speech & nlp (fall 2014): basic spectral analysis & spoken word recognition
DESCRIPTION
TRANSCRIPT
Speech & NLP
Basic Spectral Analysis
&
Spoken Word Recognition
Vladimir Kulyukin
Outline
Spectral Analysis
Correlations between Air Pressure, Amplitude, & ZCR
Silence vs Non-Silence
WAV File Format
Spoken Word Recognition
Spectral Analysis
Waveform
time
Air Pressure
Each point on the time line is a frame, e.g., a 16-bit
array that records air pressure at that point of time
Spectral Features
A waveform is a distribution of different
frequencies
A distribution of frequencies is a
spectrum
Spectral features are time slices of a
waveform that represent it as a spectrum
Air Pressure
The human ear receives as input a complex
series of changes in air pressure (waveform)
Changes in air pressure are caused by air
passing through the glottis, the nostrils, and the
mouth
Some of these changes are speaker-dependent,
others are language-dependent
Frequency & Amplitude
Frequency is the number of times per some
unit of time that a wave cycles (repeats itself)
A typical unit of time is one second
Cycles per second are called Hertz (Hz)
Amplitude is a measure of air pressure
Human Perception of Frequencies
The human ear can perceive sound waves in
the frequency range [20Hz, 20,000Hz]
Frequencies below 20Hz are inaudible and
are called infrasonic
Frequencies above 20,000Hz are also
inaudible and are called ultrasonic
Wave Motion
Waves transform energy and information but not matter
Light waves, sound waves, & radio waves are periodic
and contain both electric and magnetic energy
When sound waves propagate through a medium the
molecules of the medium collide and vibrate with one
another but maintain the same average position
Energy is transported through the medium even though
there is no net particle displacement (this is a true
wonder of nature, is it not?)
Fourier’s Insight
Every complex waive can be represented as a sum of
many simple waves of different frequencies
Fourier transform is a mathematical method to
separate each frequency component of a wave
It has been experimentally shown that different phones
have different spectra, i.e., different distributions of
features, which makes their recognition possible (at
least, sometimes)
Pitch, Frequency, Loudness, Amplitude
Pitch & Frequency: pitch is the perceptual
equivalent of frequency
Sounds with higher frequencies are perceived to
have higher pitches
Loudness & Amplitude: loudness is the
perceptual equivalent of amplitude
Sounds with higher amplitudes are perceived as
louder
Sound Wave Interpretation
Humans can both understand and transcribe
sound waves
An important implication of the above statement
is that sound waves must contain sufficient
information to make understanding possible
What exactly is this information that makes
sound wave interpretation possible?
ZCR: Zero Crossing Rate
Zero Crossing Rate (ZCR) is a feature that (quite possibly,
but not certainly) describes the information contained in the
waveform
ZCR is the number of times, in a given sample, when
amplitude crosses the horizontal line at 0
Amplitude is another feature that (quite possibly, but not
certainly) describes the information contained in the
waveform
ZCR & Amplitude of Voiced & Unvoiced Speech
ZCR Amplitude
Voiced LOW HIGH
Unvoiced HIGH LOW
Correlations
between
Air Pressure, ZCR, & Amplitude
Air Pressure, Amplitude, & ZCR of the Syllable ‘CA’ in ‘Calcium’
Amplitude
ZCR
Waveform of ‘CA’
Air Pressure, Amplitude, & ZCR of the Syllable ‘AL’ in ‘Calcium’
Amplitude
ZCR
Waveform of ‘AL’
Air Pressure, Amplitude, & ZCR of the Syllable ‘CI’ in ‘Calcium’
Amplitude ZCR
Complete Air Pressure Profile of ‘CI’
Air Pressure, Amplitude, & ZCR of the Syllable ‘UM’ in ‘Calcium’
Amplitude ZCR
Waveform of ‘UM’
Air Pressure, Amplitude, & ZCR of ‘Calcium’
Amplitude ZCR
Waveform of ‘CALCIUM’
Silence vs Non-Silence
Hello, Silence, My Old Friend!
Suppose we have two measures: zero-crossing rate
and amplitude
Can we use those two measures to detect to
separate silence from non-silence?
We can find two thresholds and assume that if a
sample of frames (e.g., a 16-bit representation of
amplitude at a specific point in time) has ZCR and
amplitude below those thresholds, then that sample
is silence
Detection of Silence & Non-Silence
silence_buffer = [];
non_silence_buffer = [];
buffer = [];
while ( there are still frames left ) {
Read a specific number of frames into buffer;
Compute ZCR and average amplitude of buffer;
if ( ZCR and average amplitude are below specific thresholds ) {
add the buffer to silence_buffer;
}
else {
add the buffer to non_silence_buffer;
}
}
Silence vs Non-Silence
// Create a buffer of frame_sample_size frames
double[][] buffer = new double[numChannels][frame_sample_size];
int framesRead; int framesWritten;
long sample_rate = inWavFile.getSampleRate();
// normalizer is a number of seconds
double normalizer = WavFileManip.convertFrameSampleSizeToSeconds((int) sample_rate, frame_sample_size);
int currFrameSampleNum = 0; double currZCR = 0;
double totalAvrgAmp = 0.0; double currAmp = 0.0;
do {
framesRead = inWavFile.readFrames(buffer, frame_sample_size);
currZCR = ZeroCrossingRate.computeZCR01(buffer[channel_num], normalizer);
currAmp = WavFileManip.computeAvrgAbsAmplitude(buffer[channel_num]);
if (framesRead > 0) { currFrameSampleNum++; }
// in silence, zero crossing rate is lower && current amp is lower
if (currZCR <= zcr_thresh && currAmp <= amp_thresh) {
totalAvrgAmp += currAmp;
framesWritten = outWavFile.writeFrames(buffer, framesRead);
bos.write(tabbed_output.getBytes());
}
} while (framesRead != 0);
Example: Extracted Silence Waveform of ‘CALCIUM’
Waveform of Silence Extracted from Waveform of ‘CALCIUM’
Example: Extracted Non-Silence Waveform of ‘CALCIUM’
Waveform of Non-Silence Extracted from Waveform of ‘CALCIUM’
WAV File Format
RIFF: Resource Interchange File Format
RIFF (http://en.wikipedia.org/wiki/Resource_Interchange_File_Format) is a
generic file format for storing data in labeled (tagged) chunks
A chunk is a fragment of information that contains a header and a
data area
The header contains data parameters: size, type of data,
comments, etc.
Data area is a sequence of data fragments that can be interpreted
with the header’s information
Common MIMEs such as WAV, PNG, MP3, AVI, etc. are encoded in
terms of chunks
RIFF: Resource Interchange File Format
<WAVE-form>
RIFF(‘WAVE’,
<FMT-CHUNK> // format
[<FACT-CHUNK>] // fact chunk
[<CUE-CHUNK>] // cue points
[<PLAYLIST-CHUNK>] // playlist
[<ASSOC-DATA-LIST>] // data list
<WAVE-DATA>) // data
Format Chunk & Fact Chunk
<FMT-CHUNK>: this is mandatory and includes sample
encoding, number of bits per channel, sample rate
[<FACT-CHUNK>]: fact chunk, when present, number of
samples of a coding scheme
[<CUE-CHUNK>]: cue chunk, when present, may identify
significant sample numbers present in the wav file
[<PLAYLIST-CHUNK]: playlist, when present, allows the
samples to be played out of order
[<ASSOC-DATA-LIST>]: associated data list allows one to
associate labels and notes with cues
<WAV-DATA>: wave data is mandatory and contains the actual
samples
Linear Pulse Code Modulation (LPCM)
LPCM is the most common WAV audio format
for uncompressed audio
LPCM is used in audio CDs
LPCM stores 2 channels of audio samples
sampled at 44, 100 times per second with 16 bits
per sample each
Spoken Word Recognition
General Outline
Given a directory of audio files with spoken words, process
each file into a table that maps specific words (or phrases) to
digital signal vectors
These signal vectors can be pre-processed to eliminate
silences
An input audio file is taken and digitized into a digital signal
vector
The input vector is compared against the digital vectors in
the table
Using Non-Silence Arrays
Get the amplitude array from the input waveform
Get the amplitude arrays for each waveform stored in
the dictionary
Use a similarity metric to match the amplitude array of
the input waveform and each waveform stored in the
dictionary
Various similarity metrics can be used: cosine, HMMs,
dynamic time warping, etc.
Using Non-Silence Segments
Represent each waveform as a sequence
of non-silence segments
waveform = [seg1, seg2, seg3, …, segn]
For example, in Java, a waveform can be
represented as ArrayList<double[ ]>
Matching Non-Silence Segments
𝑚𝑎𝑡𝑐ℎ_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑊𝑛1 = 𝑠1
1, … , 𝑠𝑛1 ,𝑊𝑚
2 = 𝑠12, … , 𝑠𝑚
2
= 𝑠𝑖𝑚 𝑠𝑖1, 𝑠𝑖2
min {𝑛,𝑚}
𝑖=1
𝐿𝑒𝑡 𝑊𝑛𝑘 = 𝑠1
𝑘 , … , 𝑠𝑛𝑘 𝑏𝑒 𝑤𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑
𝑎𝑠 𝑎 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠
Matching Two Waveform Segments X & Y
Matching Two Waveform Segments X & Y
DTW(X, Y) – Cost of an Optimal Warping Path
𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}
Matching Non-Silence Segments
𝑚𝑎𝑡𝑐ℎ_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑊𝑛1 = 𝑠1
1, … , 𝑠𝑛1 ,𝑊𝑚
2 = 𝑠12, … , 𝑠𝑚
2
= 𝐷𝑇𝑊 𝑠𝑖1, 𝑠𝑖2
min {𝑛,𝑚}
𝑖=1
𝐿𝑒𝑡 𝑊𝑛𝑘 = 𝑠1
𝑘 , … , 𝑠𝑛𝑘 𝑏𝑒 𝑤𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑
𝑎𝑠 𝑎 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠
Optimizations
If we use DTW to compute the similarity b/w the digital
audio input vector and the vectors in the table, it is vital
to keep the vectors as short as possible w/o sacrificing
precision
Possible suggestions: decreasing the sampling rate
and merging samples into super-features (e.g., Haar
coefficients, Fourier coefficients, etc.)
Parallelizing similarity computations
DTW Matching Window Optimization
The computation of DTW can be optimized so that only the cells within a
specific window are considered
Smaller Matrix Optimization
You may have realized by now that if we care only about
the total cost of warping sequence X with sequence Y, we
do not need to compute the entire N x M cost matrix – we
need only two columns
The storage savings are huge, but the running time
remains the same – O(N x M)
We can also normalize the DTW cost by N x M to keep it
low
N-Grams & HMMs
Spoken Word Recognition with N-Grams
4,||||
3,||||
2,||||
1,|| :formula General
123
1
3
1
14
1
1
12
1
2
1
13
1
1
1
1
1
1
12
1
1
1
1
1
1
NwwwwPwwPwwPwwP
NwwwPwwPwwPwwP
NwwPwwPwwPwwP
NwwPwwP
nnnn
n
nn
n
nn
n
nn
nnn
n
nn
n
nn
n
n
nn
n
nn
n
nn
n
n
n
Nnn
n
n
Spoken Word Recognition with Bigrams
1
1
1
1
11
1
11
11
1
|
size dictionary is ,
corpus audioin ofcount
. waveformsfrom extracted features twobe and Let
n
nn
V
i
in
nnnn
V
i
nin
nnnn
nn
fC
ffC
ffC
ffCffP
VfCffC
ffffC
ff
Spoken Word Recognition with HMMs
form wavea from extracted phones of sequence a becan
segments; silence-non of sequence a becan
:examplefor waveform;a from measures of sequence a is
|maxarg|maxarg
|maxarg|maxarg
y
y
y
wPwyPywP
yP
wPwyPywP
LwLw
LwLw
References
M. Muller. Information Retrieval for Music and Motion, Ch.04.
Springer, ISBN 978-3-540-74047-6
Bachu, R. G., et al. “Separation of Voiced and Unvoiced using
Zero Crossing Rate and Energy of the Speech Signal." American
Society for Engineering Education (ASEE) Zone Conference
Proceedings. 2008.
D. Jurafsky & J. Martin. Speech & Language Processing.
Prentice Hall.