speech & nlp (fall 2014): basic spectral analysis & spoken word recognition

48
Speech & NLP Basic Spectral Analysis & Spoken Word Recognition Vladimir Kulyukin

Upload: vladimir-kulyukin

Post on 14-Dec-2014

80 views

Category:

Science


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Speech & NLP

Basic Spectral Analysis

&

Spoken Word Recognition

Vladimir Kulyukin

Page 2: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Outline

Spectral Analysis

Correlations between Air Pressure, Amplitude, & ZCR

Silence vs Non-Silence

WAV File Format

Spoken Word Recognition

Page 3: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Spectral Analysis

Page 4: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Waveform

time

Air Pressure

Each point on the time line is a frame, e.g., a 16-bit

array that records air pressure at that point of time

Page 5: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Spectral Features

A waveform is a distribution of different

frequencies

A distribution of frequencies is a

spectrum

Spectral features are time slices of a

waveform that represent it as a spectrum

Page 6: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Air Pressure

The human ear receives as input a complex

series of changes in air pressure (waveform)

Changes in air pressure are caused by air

passing through the glottis, the nostrils, and the

mouth

Some of these changes are speaker-dependent,

others are language-dependent

Page 7: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Frequency & Amplitude

Frequency is the number of times per some

unit of time that a wave cycles (repeats itself)

A typical unit of time is one second

Cycles per second are called Hertz (Hz)

Amplitude is a measure of air pressure

Page 8: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Human Perception of Frequencies

The human ear can perceive sound waves in

the frequency range [20Hz, 20,000Hz]

Frequencies below 20Hz are inaudible and

are called infrasonic

Frequencies above 20,000Hz are also

inaudible and are called ultrasonic

Page 9: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Wave Motion

Waves transform energy and information but not matter

Light waves, sound waves, & radio waves are periodic

and contain both electric and magnetic energy

When sound waves propagate through a medium the

molecules of the medium collide and vibrate with one

another but maintain the same average position

Energy is transported through the medium even though

there is no net particle displacement (this is a true

wonder of nature, is it not?)

Page 10: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Fourier’s Insight

Every complex waive can be represented as a sum of

many simple waves of different frequencies

Fourier transform is a mathematical method to

separate each frequency component of a wave

It has been experimentally shown that different phones

have different spectra, i.e., different distributions of

features, which makes their recognition possible (at

least, sometimes)

Page 11: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Pitch, Frequency, Loudness, Amplitude

Pitch & Frequency: pitch is the perceptual

equivalent of frequency

Sounds with higher frequencies are perceived to

have higher pitches

Loudness & Amplitude: loudness is the

perceptual equivalent of amplitude

Sounds with higher amplitudes are perceived as

louder

Page 12: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Sound Wave Interpretation

Humans can both understand and transcribe

sound waves

An important implication of the above statement

is that sound waves must contain sufficient

information to make understanding possible

What exactly is this information that makes

sound wave interpretation possible?

Page 13: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

ZCR: Zero Crossing Rate

Zero Crossing Rate (ZCR) is a feature that (quite possibly,

but not certainly) describes the information contained in the

waveform

ZCR is the number of times, in a given sample, when

amplitude crosses the horizontal line at 0

Amplitude is another feature that (quite possibly, but not

certainly) describes the information contained in the

waveform

Page 14: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

ZCR & Amplitude of Voiced & Unvoiced Speech

ZCR Amplitude

Voiced LOW HIGH

Unvoiced HIGH LOW

Page 15: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Correlations

between

Air Pressure, ZCR, & Amplitude

Page 16: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Air Pressure, Amplitude, & ZCR of the Syllable ‘CA’ in ‘Calcium’

Amplitude

ZCR

Waveform of ‘CA’

Page 17: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Air Pressure, Amplitude, & ZCR of the Syllable ‘AL’ in ‘Calcium’

Amplitude

ZCR

Waveform of ‘AL’

Page 18: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Air Pressure, Amplitude, & ZCR of the Syllable ‘CI’ in ‘Calcium’

Amplitude ZCR

Complete Air Pressure Profile of ‘CI’

Page 19: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Air Pressure, Amplitude, & ZCR of the Syllable ‘UM’ in ‘Calcium’

Amplitude ZCR

Waveform of ‘UM’

Page 20: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Air Pressure, Amplitude, & ZCR of ‘Calcium’

Amplitude ZCR

Waveform of ‘CALCIUM’

Page 21: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Silence vs Non-Silence

Page 22: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Hello, Silence, My Old Friend!

Suppose we have two measures: zero-crossing rate

and amplitude

Can we use those two measures to detect to

separate silence from non-silence?

We can find two thresholds and assume that if a

sample of frames (e.g., a 16-bit representation of

amplitude at a specific point in time) has ZCR and

amplitude below those thresholds, then that sample

is silence

Page 23: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Detection of Silence & Non-Silence

silence_buffer = [];

non_silence_buffer = [];

buffer = [];

while ( there are still frames left ) {

Read a specific number of frames into buffer;

Compute ZCR and average amplitude of buffer;

if ( ZCR and average amplitude are below specific thresholds ) {

add the buffer to silence_buffer;

}

else {

add the buffer to non_silence_buffer;

}

}

Page 24: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Silence vs Non-Silence

// Create a buffer of frame_sample_size frames

double[][] buffer = new double[numChannels][frame_sample_size];

int framesRead; int framesWritten;

long sample_rate = inWavFile.getSampleRate();

// normalizer is a number of seconds

double normalizer = WavFileManip.convertFrameSampleSizeToSeconds((int) sample_rate, frame_sample_size);

int currFrameSampleNum = 0; double currZCR = 0;

double totalAvrgAmp = 0.0; double currAmp = 0.0;

do {

framesRead = inWavFile.readFrames(buffer, frame_sample_size);

currZCR = ZeroCrossingRate.computeZCR01(buffer[channel_num], normalizer);

currAmp = WavFileManip.computeAvrgAbsAmplitude(buffer[channel_num]);

if (framesRead > 0) { currFrameSampleNum++; }

// in silence, zero crossing rate is lower && current amp is lower

if (currZCR <= zcr_thresh && currAmp <= amp_thresh) {

totalAvrgAmp += currAmp;

framesWritten = outWavFile.writeFrames(buffer, framesRead);

bos.write(tabbed_output.getBytes());

}

} while (framesRead != 0);

Page 25: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Example: Extracted Silence Waveform of ‘CALCIUM’

Waveform of Silence Extracted from Waveform of ‘CALCIUM’

Page 26: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Example: Extracted Non-Silence Waveform of ‘CALCIUM’

Waveform of Non-Silence Extracted from Waveform of ‘CALCIUM’

Page 27: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

WAV File Format

Page 28: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

RIFF: Resource Interchange File Format

RIFF (http://en.wikipedia.org/wiki/Resource_Interchange_File_Format) is a

generic file format for storing data in labeled (tagged) chunks

A chunk is a fragment of information that contains a header and a

data area

The header contains data parameters: size, type of data,

comments, etc.

Data area is a sequence of data fragments that can be interpreted

with the header’s information

Common MIMEs such as WAV, PNG, MP3, AVI, etc. are encoded in

terms of chunks

Page 29: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

RIFF: Resource Interchange File Format

<WAVE-form>

RIFF(‘WAVE’,

<FMT-CHUNK> // format

[<FACT-CHUNK>] // fact chunk

[<CUE-CHUNK>] // cue points

[<PLAYLIST-CHUNK>] // playlist

[<ASSOC-DATA-LIST>] // data list

<WAVE-DATA>) // data

Page 30: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Format Chunk & Fact Chunk

<FMT-CHUNK>: this is mandatory and includes sample

encoding, number of bits per channel, sample rate

[<FACT-CHUNK>]: fact chunk, when present, number of

samples of a coding scheme

[<CUE-CHUNK>]: cue chunk, when present, may identify

significant sample numbers present in the wav file

[<PLAYLIST-CHUNK]: playlist, when present, allows the

samples to be played out of order

[<ASSOC-DATA-LIST>]: associated data list allows one to

associate labels and notes with cues

<WAV-DATA>: wave data is mandatory and contains the actual

samples

Page 31: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Linear Pulse Code Modulation (LPCM)

LPCM is the most common WAV audio format

for uncompressed audio

LPCM is used in audio CDs

LPCM stores 2 channels of audio samples

sampled at 44, 100 times per second with 16 bits

per sample each

Page 32: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Spoken Word Recognition

Page 33: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

General Outline

Given a directory of audio files with spoken words, process

each file into a table that maps specific words (or phrases) to

digital signal vectors

These signal vectors can be pre-processed to eliminate

silences

An input audio file is taken and digitized into a digital signal

vector

The input vector is compared against the digital vectors in

the table

Page 34: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Using Non-Silence Arrays

Get the amplitude array from the input waveform

Get the amplitude arrays for each waveform stored in

the dictionary

Use a similarity metric to match the amplitude array of

the input waveform and each waveform stored in the

dictionary

Various similarity metrics can be used: cosine, HMMs,

dynamic time warping, etc.

Page 35: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Using Non-Silence Segments

Represent each waveform as a sequence

of non-silence segments

waveform = [seg1, seg2, seg3, …, segn]

For example, in Java, a waveform can be

represented as ArrayList<double[ ]>

Page 36: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Matching Non-Silence Segments

𝑚𝑎𝑡𝑐ℎ_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑊𝑛1 = 𝑠1

1, … , 𝑠𝑛1 ,𝑊𝑚

2 = 𝑠12, … , 𝑠𝑚

2

= 𝑠𝑖𝑚 𝑠𝑖1, 𝑠𝑖2

min {𝑛,𝑚}

𝑖=1

𝐿𝑒𝑡 𝑊𝑛𝑘 = 𝑠1

𝑘 , … , 𝑠𝑛𝑘 𝑏𝑒 𝑤𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑

𝑎𝑠 𝑎 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠

Page 37: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Matching Two Waveform Segments X & Y

Page 38: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Matching Two Waveform Segments X & Y

Page 39: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

DTW(X, Y) – Cost of an Optimal Warping Path

𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}

Page 40: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Matching Non-Silence Segments

𝑚𝑎𝑡𝑐ℎ_𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑊𝑛1 = 𝑠1

1, … , 𝑠𝑛1 ,𝑊𝑚

2 = 𝑠12, … , 𝑠𝑚

2

= 𝐷𝑇𝑊 𝑠𝑖1, 𝑠𝑖2

min {𝑛,𝑚}

𝑖=1

𝐿𝑒𝑡 𝑊𝑛𝑘 = 𝑠1

𝑘 , … , 𝑠𝑛𝑘 𝑏𝑒 𝑤𝑎𝑣𝑒𝑓𝑜𝑟𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑

𝑎𝑠 𝑎 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠

Page 41: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Optimizations

If we use DTW to compute the similarity b/w the digital

audio input vector and the vectors in the table, it is vital

to keep the vectors as short as possible w/o sacrificing

precision

Possible suggestions: decreasing the sampling rate

and merging samples into super-features (e.g., Haar

coefficients, Fourier coefficients, etc.)

Parallelizing similarity computations

Page 42: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

DTW Matching Window Optimization

The computation of DTW can be optimized so that only the cells within a

specific window are considered

Page 43: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Smaller Matrix Optimization

You may have realized by now that if we care only about

the total cost of warping sequence X with sequence Y, we

do not need to compute the entire N x M cost matrix – we

need only two columns

The storage savings are huge, but the running time

remains the same – O(N x M)

We can also normalize the DTW cost by N x M to keep it

low

Page 44: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

N-Grams & HMMs

Page 45: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Spoken Word Recognition with N-Grams

4,||||

3,||||

2,||||

1,|| :formula General

123

1

3

1

14

1

1

12

1

2

1

13

1

1

1

1

1

1

12

1

1

1

1

1

1

NwwwwPwwPwwPwwP

NwwwPwwPwwPwwP

NwwPwwPwwPwwP

NwwPwwP

nnnn

n

nn

n

nn

n

nn

nnn

n

nn

n

nn

n

n

nn

n

nn

n

nn

n

n

n

Nnn

n

n

Page 46: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Spoken Word Recognition with Bigrams

1

1

1

1

11

1

11

11

1

|

size dictionary is ,

corpus audioin ofcount

. waveformsfrom extracted features twobe and Let

n

nn

V

i

in

nnnn

V

i

nin

nnnn

nn

fC

ffC

ffC

ffCffP

VfCffC

ffffC

ff

Page 47: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

Spoken Word Recognition with HMMs

form wavea from extracted phones of sequence a becan

segments; silence-non of sequence a becan

:examplefor waveform;a from measures of sequence a is

|maxarg|maxarg

|maxarg|maxarg

y

y

y

wPwyPywP

yP

wPwyPywP

LwLw

LwLw

Page 48: Speech & NLP (Fall 2014): Basic Spectral Analysis & Spoken Word Recognition

References

M. Muller. Information Retrieval for Music and Motion, Ch.04.

Springer, ISBN 978-3-540-74047-6

Bachu, R. G., et al. “Separation of Voiced and Unvoiced using

Zero Crossing Rate and Energy of the Speech Signal." American

Society for Engineering Education (ASEE) Zone Conference

Proceedings. 2008.

D. Jurafsky & J. Martin. Speech & Language Processing.

Prentice Hall.