speech & nlp (fall 2014): basic spectral analysis & spoken word recognition

Speech & NLP

Basic Spectral Analysis

&

Spoken Word Recognition

Vladimir Kulyukin

Outline

Spectral Analysis

Correlations between Air Pressure, Amplitude, & ZCR

Silence vs Non-Silence

WAV File Format


Spectral Analysis

Waveform

time

Air Pressure

Each point on the time line is a frame, e.g., a 16-bit

array that records air pressure at that point of time

Spectral Features

A waveform is a distribution of different

frequencies

A distribution of frequencies is a

spectrum

Spectral features are time slices of a

waveform that represent it as a spectrum

Air Pressure

The human ear receives as input a complex

series of changes in air pressure (waveform)

Changes in air pressure are caused by air

passing through the glottis, the nostrils, and the

mouth

Some of these changes are speaker-dependent,

others are language-dependent

Frequency & Amplitude

Frequency is the number of times per some

unit of time that a wave cycles (repeats itself)

A typical unit of time is one second

Cycles per second are called Hertz (Hz)

Amplitude is a measure of air pressure

Human Perception of Frequencies

The human ear can perceive sound waves in

the frequency range [20Hz, 20,000Hz]

Frequencies below 20Hz are inaudible and

are called infrasonic

Frequencies above 20,000Hz are also

inaudible and are called ultrasonic

Wave Motion

Waves transform energy and information but not matter

Light waves, sound waves, & radio waves are periodic

and contain both electric and magnetic energy

When sound waves propagate through a medium the

molecules of the medium collide and vibrate with one

another but maintain the same average position

Energy is transported through the medium even though

there is no net particle displacement (this is a true

wonder of nature, is it not?)

Fourier’s Insight

Every complex waive can be represented as a sum of

many simple waves of different frequencies

Fourier transform is a mathematical method to

separate each frequency component of a wave

It has been experimentally shown that different phones

have different spectra, i.e., different distributions of

features, which makes their recognition possible (at

least, sometimes)

Pitch, Frequency, Loudness, Amplitude

Pitch & Frequency: pitch is the perceptual

equivalent of frequency

Sounds with higher frequencies are perceived to

have higher pitches

Loudness & Amplitude: loudness is the

perceptual equivalent of amplitude

Sounds with higher amplitudes are perceived as

louder

Sound Wave Interpretation

Humans can both understand and transcribe

sound waves

An important implication of the above statement

is that sound waves must contain sufficient

information to make understanding possible

What exactly is this information that makes

sound wave interpretation possible?

ZCR: Zero Crossing Rate

Zero Crossing Rate (ZCR) is a feature that (quite possibly,

but not certainly) describes the information contained in the

waveform

ZCR is the number of times, in a given sample, when

amplitude crosses the horizontal line at 0

Amplitude is another feature that (quite possibly, but not

certainly) describes the information contained in the

waveform

ZCR & Amplitude of Voiced & Unvoiced Speech

ZCR Amplitude

Voiced LOW HIGH

Unvoiced HIGH LOW

Correlations

between

Air Pressure, ZCR, & Amplitude

Air Pressure, Amplitude, & ZCR of the Syllable ‘CA’ in ‘Calcium’

Amplitude

ZCR

Waveform of ‘CA’

Air Pressure, Amplitude, & ZCR of the Syllable ‘AL’ in ‘Calcium’

Amplitude

ZCR

Waveform of ‘AL’

Air Pressure, Amplitude, & ZCR of the Syllable ‘CI’ in ‘Calcium’

Amplitude ZCR

Complete Air Pressure Profile of ‘CI’

Air Pressure, Amplitude, & ZCR of the Syllable ‘UM’ in ‘Calcium’

Amplitude ZCR

Waveform of ‘UM’

Air Pressure, Amplitude, & ZCR of ‘Calcium’

Amplitude ZCR

Waveform of ‘CALCIUM’

Hello, Silence, My Old Friend!

Suppose we have two measures: zero-crossing rate

and amplitude

Can we use those two measures to detect to

separate silence from non-silence?

We can find two thresholds and assume that if a

sample of frames (e.g., a 16-bit representation of

amplitude at a specific point in time) has ZCR and

amplitude below those thresholds, then that sample

is silence

Detection of Silence & Non-Silence

silence_buffer = [];

non_silence_buffer = [];

buffer = [];

while ( there are still frames left ) {

Read a specific number of frames into buffer;

Compute ZCR and average amplitude of buffer;

if ( ZCR and average amplitude are below specific thresholds ) {

add the buffer to silence_buffer;

}

else {

add the buffer to non_silence_buffer;

}

}


// Create a buffer of frame_sample_size frames

double[][] buffer = new double[numChannels][frame_sample_size];

int framesRead; int framesWritten;

long sample_rate = inWavFile.getSampleRate();

// normalizer is a number of seconds

double normalizer = WavFileManip.convertFrameSampleSizeToSeconds((int) sample_rate, frame_sample_size);

int currFrameSampleNum = 0; double currZCR = 0;

double totalAvrgAmp = 0.0; double currAmp = 0.0;

do {

framesRead = inWavFile.readFrames(buffer, frame_sample_size);

currZCR = ZeroCrossingRate.computeZCR01(buffer[channel_num], normalizer);

currAmp = WavFileManip.computeAvrgAbsAmplitude(buffer[channel_num]);

if (framesRead > 0) { currFrameSampleNum++; }

// in silence, zero crossing rate is lower && current amp is lower

if (currZCR <= zcr_thresh && currAmp <= amp_thresh) {

totalAvrgAmp += currAmp;

framesWritten = outWavFile.writeFrames(buffer, framesRead);

bos.write(tabbed_output.getBytes());

}

} while (framesRead != 0);

Example: Extracted Silence Waveform of ‘CALCIUM’

Waveform of Silence Extracted from Waveform of ‘CALCIUM’

Example: Extracted Non-Silence Waveform of ‘CALCIUM’

Waveform of Non-Silence Extracted from Waveform of ‘CALCIUM’

WAV File Format

RIFF: Resource Interchange File Format

RIFF (http://en.wikipedia.org/wiki/Resource_Interchange_File_Format) is a

generic file format for storing data in labeled (tagged) chunks

A chunk is a fragment of information that contains a header and a

data area

The header contains data parameters: size, type of data,

comments, etc.

Data area is a sequence of data fragments that can be interpreted

with the header’s information

Common MIMEs such as WAV, PNG, MP3, AVI, etc. are encoded in

terms of chunks

http://en.wikipedia.org/wiki/Resource_Interchange_File_Format







RIFF: Resource Interchange File Format

<WAVE-form>

RIFF(‘WAVE’,

<FMT-CHUNK> // format

[<FACT-CHUNK>] // fact chunk

[<CUE-CHUNK>] // cue points

[<PLAYLIST-CHUNK>] // playlist

[<ASSOC-DATA-LIST>] // data list

<WAVE-DATA>) // data

Format Chunk & Fact Chunk

<FMT-CHUNK>: this is mandatory and includes sample

encoding, number of bits per channel, sample rate

[<FACT-CHUNK>]: fact chunk, when present, number of

samples of a coding scheme

[<CUE-CHUNK>]: cue chunk, when present, may identify

significant sample numbers present in the wav file

[<PLAYLIST-CHUNK]: playlist, when present, allows the

samples to be played out of order

[<ASSOC-DATA-LIST>]: associated data list allows one to

associate labels and notes with cues

<WAV-DATA>: wave data is mandatory and contains the actual

samples

Linear Pulse Code Modulation (LPCM)

LPCM is the most common WAV audio format

for uncompressed audio

LPCM is used in audio CDs

LPCM stores 2 channels of audio samples

sampled at 44, 100 times per second with 16 bits

per sample each

General Outline

Given a directory of audio files with spoken words, process

each file into a table that maps specific words (or phrases) to

digital signal vectors

These signal vectors can be pre-processed to eliminate

silences

An input audio file is taken and digitized into a digital signal

vector

The input vector is compared against the digital vectors in

the table

Using Non-Silence Arrays

Get the amplitude array from the input waveform

Get the amplitude arrays for each waveform stored in

the dictionary

Use a similarity metric to match the amplitude array of

the input waveform and each waveform stored in the

dictionary

Various similarity metrics can be used: cosine, HMMs,

dynamic time warping, etc.

Using Non-Silence Segments

Represent each waveform as a sequence

of non-silence segments

waveform = [seg1, seg2, seg3, …, segn]

For example, in Java, a waveform can be

represented as ArrayList<double[ ]>

Matching Non-Silence Segments

𝑚𝑎𝑡𝑐ℎ_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑊𝑛1 = 𝑠1

1, … , 𝑠𝑛1 ,𝑊𝑚

2 = 𝑠12, … , 𝑠𝑚

2

= 𝑠𝑖𝑚 𝑠𝑖1, 𝑠𝑖2

min {𝑛,𝑚}

𝑖=1

𝐿𝑒𝑡 𝑊𝑛𝑘 = 𝑠1

𝑘 , … , 𝑠𝑛𝑘 𝑏𝑒 𝑤𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑

𝑎𝑠 𝑎 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠

Matching Two Waveform Segments X & Y

DTW(X, Y) – Cost of an Optimal Warping Path

𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}

Matching Non-Silence Segments

𝑚𝑎𝑡𝑐ℎ_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑊𝑛1 = 𝑠1

1, … , 𝑠𝑛1 ,𝑊𝑚

2 = 𝑠12, … , 𝑠𝑚

2

= 𝐷𝑇𝑊 𝑠𝑖1, 𝑠𝑖2

min {𝑛,𝑚}

𝑖=1

𝐿𝑒𝑡 𝑊𝑛𝑘 = 𝑠1

𝑘 , … , 𝑠𝑛𝑘 𝑏𝑒 𝑤𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑

𝑎𝑠 𝑎 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠

Optimizations

If we use DTW to compute the similarity b/w the digital

audio input vector and the vectors in the table, it is vital

to keep the vectors as short as possible w/o sacrificing

precision

Possible suggestions: decreasing the sampling rate

and merging samples into super-features (e.g., Haar

coefficients, Fourier coefficients, etc.)

Parallelizing similarity computations

DTW Matching Window Optimization

The computation of DTW can be optimized so that only the cells within a

specific window are considered

Smaller Matrix Optimization

You may have realized by now that if we care only about

the total cost of warping sequence X with sequence Y, we

do not need to compute the entire N x M cost matrix – we

need only two columns

The storage savings are huge, but the running time

remains the same – O(N x M)

We can also normalize the DTW cost by N x M to keep it

low

N-Grams & HMMs

Spoken Word Recognition with N-Grams

4,||||

3,||||

2,||||

1,|| :formula General

123

1

3

1

14

1

1

12

1

2

1

13

1

1

1

1

1

1

12

1

1

1

1

1

1

NwwwwPwwPwwPwwP

NwwwPwwPwwPwwP

NwwPwwPwwPwwP

NwwPwwP

nnnn

n

nn

n

nn

n

nn

nnn

n

nn

n

nn

n

n

nn

n

nn

n

nn

n

n

n

Nnn

n

n

Spoken Word Recognition with Bigrams

1

1

1

1

11

1

11

11

1

|

size dictionary is ,

corpus audioin ofcount

. waveformsfrom extracted features twobe and Let

n

nn

V

i

in

nnnn

V

i

nin

nnnn

nn

fC

ffC

ffC

ffCffP

VfCffC

ffffC

ff

Spoken Word Recognition with HMMs

form wavea from extracted phones of sequence a becan

segments; silence-non of sequence a becan

:examplefor waveform;a from measures of sequence a is

|maxarg|maxarg

|maxarg|maxarg

y

y

y

wPwyPywP

yP

wPwyPywP

LwLw

LwLw

References

M. Muller. Information Retrieval for Music and Motion, Ch.04.

Springer, ISBN 978-3-540-74047-6

Bachu, R. G., et al. “Separation of Voiced and Unvoiced using

Zero Crossing Rate and Energy of the Speech Signal." American

Society for Engineering Education (ASEE) Zone Conference

Proceedings. 2008.

D. Jurafsky & J. Martin. Speech & Language Processing.

Prentice Hall.

speech & nlp (fall 2014): basic spectral analysis & spoken word recognition

Science