presented at icassp-97, munich, vol. 3 pp. 1647-1650.steveng/pdf/icassp97-modspec.pdf · presented...

Presentedat ICASSP-97, Munich, vol. 3 pp. 1647-1650.

THE MODULATION SPECTROGRAM: IN PURSUIT OF AN INVARIANT

REPRESENTATION OF SPEECH

Steven Greenberg� and Brian E. D. Kingsburyy

�yInternational Computer Science Institute, 1947 Center Street, Suite 600, Berkeley, CA 94704, USA�Department of Linguistics

yDepartment of Electrical Engineering and Computer SciencesUniversity of California at Berkeley, Berkeley, CA 94704, USA

fsteveng,[email protected]

ABSTRACT

Understanding the human ability to reliably process anddecode speech across a wide range of acoustic conditionsand speaker characteristics is a fundamental challenge forcurrent theories of speech perception. Conventional speechrepresentations such as the sound spectrogram emphasizemany spectro-temporal details that are not directly ger-mane to the linguistic information encoded in the speechsignal and which consequently do not display the perceptualstability characteristic of human listeners. We propose anew representational format, the modulation spectrogram,that discards much of the spectro-temporal detail in thespeech signal and instead focuses on the underlying, stablestructure incorporated in the low-frequency portion of themodulation spectrum distributed across critical-band-likechannels. We describe the representation and illustrate itsstability with color-mapped displays and with results fromautomatic speech recognition experiments.

1. INTRODUCTION

Human listeners are able to reliably decode phonetic infor-mation carried by the speech signal across a wide rangeof acoustic conditions and speaker characteristics. Thisperceptual stability is not captured by traditional repre-sentations of speech which tend to emphasize the minutespectro-temporal details of the speech signal. Speaker vari-ability and distortions such as spectral shaping, backgroundnoise, and reverberation that typically exert little or no in- uence on the intelligibility of speech drastically alter suchconventional speech representations as the sound spectro-gram. This disparity between perceptual stability and rep-resentational lability constitutes a fundamental challengefor models of speech perception and recognition. A speechrepresentation insensitive to speaker variability and acous-tic distortion would be a powerful tool for the study of hu-man speech perception and for research in speech codingand automatic speech recognition.A key step for representing speech in a stable fashion is to

focus on the elements of the signal encoding phonetic infor-mation. By suppressing phonetically irrelevant elements ofthe signal, the variability of the representation is reduced.There is signi�cant evidence that much of the phoneticinformation is encoded by slow changes in gross spectralstructure that characterize the low-frequency portion of themodulation spectrum of speech. In the late 1930's the de-velopers of the vocoder found that it was possible to synthe-size intelligible, high-quality speech based on a ten-channelspectral estimate with roughly 300-Hz resolution that waslow-pass �ltered at 25 Hz [1]. More recently, in a study on

the intelligibility of temporally-smeared speech, Drullmanand colleagues have demonstrated that modulations at ratesabove 16 Hz are not required for speech intelligibility [2]. Arepresentation that focuses on slow modulations in speechalso has compelling parallels to the dynamics of speech pro-duction, in which the articulators move at rates of 2{12 Hz[3], and to the sensitivity of auditory cortical neurons toamplitude-modulations at rates below 20 Hz [4].

2. THE MODULATION SPECTROGRAM

We have developed a new representational format forspeech, the modulation spectrogram, that displays and en-codes the signal in terms of the distribution of slow modula-tions across time and frequency. Although not intended asan auditory model, the representation captures many im-portant properties of the auditory cortical representationof speech. The modulation spectrogram represents modu-lation frequencies in the speech signal between 0 and 8 Hz,with a peak sensitivity at 4 Hz, corresponding closely to thelong-term modulation spectrum of speech. The modulationspectrogram is computed in critical-band-wide channels [5]to match the frequency resolution of the auditory system,incorporates a simple automatic gain control and empha-sizes spectro-temporal peaks.Figure 1 illustrates the signal processing procedure used

to produce the modulation spectrogram. Incoming speech,sampled at 8 kHz, is analyzed into approximately critical-band-wide channels via an FIR �lter bank. The �lters aretrapezoidal in shape, and there is minimal overlap betweenadjacent channels. Within each channel the signal envelopeis derived by half-wave recti�cation and low-pass �ltering(the half-power cuto� frequency is 28 Hz). Each channelenvelope signal is downsampled to 80 Hz and then normal-ized by the average envelope level in that channel measuredover the entire utterance. The modulations of the normal-ized envelope signals are analyzed by computing the FFTover a 250-ms Hamming window every 12.5 ms in order tocapture the dynamic properties of the signal. Finally, thesquared magnitudes of the 4-Hz coe�cients of the FFTs areplotted in spectrographic format, with log energy encodedby color. Note that the display portrays modulation energyfrom 0{8 Hz. The e�ective �lter response for the 4 Hz com-ponent is down by 10 dB at 0 and 8 Hz. A threshold is usedin the energy-to-color mapping: the peak 30 dB of the sig-nal is mapped to a color axis, while levels more than 30 dBbelow the global peak are mapped to the color for -30 dB.Bilinear smoothing is used to produce the �nal image.

3. REPRESENTATIONAL STABILITY

The modulation spectrographic representation of speech ismore stable than the conventional spectrographic represen-

FF

TN

ormalize by

long-term avg.

FIR filter bankCritical-band

lowpass

cutoff = 28 H

z

speechim

age

and bilinear interpolationLimiting to peak 30 dB

2100X

lowpass

cutoff = 28 H

z

100Xlong-term

avg.N

ormalize by

FF

T2

Figure

1.Diagram

oftheprocessin

gcurre

ntly

usedto

producemodulatio

nspectro

grams

tationin

lowsign

al-to-noise

ratio(SNR)andreverb

erant

condition

s.Several

processin

gstep

scon

tribute

tothissta-

bility.

Theem

phasis

ofmodulation

sin

theran

geof

0{8Hz

with

peak

sensitiv

ityat

4Hzacts

asamatch

ed�lter

that

passes

only

signals

with

temporal

dynam

icscharacteristic

ofspeech

.Thecritical-b

and-lik

efreq

uency

resolution

oftherep

resentation

expandstherep

resentation

ofthelow

-freq

uency,

high

-energy

portion

sof

thespeech

signal,

while

thethresh

oldingused

inthecolor

mappingem

phasizes

the

spectro-tem

poral

peak

sin

thespeech

signal

that

riseabove

thenoise

oor.

Figu

re2illu

stratesthestab

ilityof

themodulation

spec-

trographic

represen

tationof

speech

bycom

parin

gcon

ven-

tionalnarrow

-bandspectrogram

s1andmodulation

spectro-

gramsfor

cleanandnoisy

version

softheutteran

ce\T

ellme

abouttheThaibarb

ecue."

Thenoisy

samplewas

produced

bymixingtheclean

sample

with

pinknoise

ataSNR

of0dB.Both

themodulation

spectrogram

sandnarrow

-band

spectrogram

scover

approx

imately

thesam

eran

geof

fre-quencies.

How

ever,themodulation

spectrogram

frequency

axisisnonlinear

inaccord

ance

with

thehuman

spatial

fre-quency

coord

inates

describ

edin

[5].While

thenarrow

-bandspectrogram

oftheclea

nspeech

sampleclearly

portray

sfeatu

resofthespeech

signalsuch

asonsets,

formanttra

jectories,andharm

onicstru

cture,

these

features

are

allbutlost

inthenarrow

-bandspectrogram

ofthenoisy

speech

,where

only

afew

spectro-tem

poral

peak

sstan

doutabove

thenoise.

Incon

trastto

thespec-

trograp

hic

represen

tation,themodulation

spectrogram

oftheclean

speech

prov

ides

only

acoarse

pictu

reof

theen-

ergydistrib

utio

nin

timeandfreq

uency.

The�nedetails,

such

asharm

onicity,

arenot

preserved

.A

comparison

ofthemodulation

spectrogram

sfor

theclean

andnoisy

speech

samples

illustrates

thestab

ilityof

therep

resentation

.The

major

features

ofthemodulation

spectrogram

observ

edfor

cleanspeech

arepreserved

inthemodulation

spectrogram

forspeech

embedded

ininten

selevels

ofnoise.

1The

narrow

-band

spectro

gramswere

computed

by

pre-

emphasizin

gthe

speech

,sampled

at8

kHz,

with

the

�lter

H(z)=

1�0:97z�1,

then

perfo

rming512-pointFFTswith

a64-

msHammingwindow

anda16-m

swindow

step.Alower

thresh

-old

of-30dBwasapplied

intheenerg

y-to

-colormapping.

4.

AUTOMATIC

RECOGNITIONBASEDON

MODULATIONSPECTROGRAPHIC

FEATURES

Asim

ilarrep

resentation

alstab

ilityisobserved

forreverb

er-antspeech

,andhas

been

dem

onstrated

intests

with

anau-

tomatic

speech

recognition

system

.In

these

tests,theper-

formance

ofahybrid

hidden

Markov

model/m

ultilayer

per-

ceptron

(HMM/M

LP)recogn

izertrain

edon

cleanspeech

ismeasu

redon

cleanandreverb

erantspeech

.Theperfor-

mance

ofrecogn

izersusin

gdi�eren

tfron

t-endfeatu

reex-

tractionmeth

odsiscom

pared

onthetwotest

sets.Asid

efrom

thefron

t-endprocessin

g,therecogn

izersare

identical,

usin

gsim

ilarly-sized

MLPsfor

phonetic

prob

ability

estima-

tion,andthesam

eHMM

word

models

andclass

bigram

grammar

forspeech

deco

ding.

Furth

erdetails

ontherecog-

nition

experim

ents

areprov

ided

in[6].

Table1com

pares

theperform

ance

ofarecogn

izerusin

gfeatu

resbased

onthemodulation

spectrogram

2with

the

perform

ance

ofarecogn

izerthat

uses

PLPfeatu

res[7].

5.

THEIM

PORTANCEOFTHESYLLABLE

INSPEECHRECOGNITION

Acen

tralprob

lemin

speech

science

istheexplication

oftheprocess

bywhich

thebrain

isable

togo

fromsou

nd

tomean

ing.

Thetrad

itional

models

posit

acom

plex

and

somew

hat

arbitrary

seriesof

operation

sthat

advan

cefrom

theacou

sticsign

alto

phonem

icunits,

fromphonem

icunits

toword

s,andfrom

word

sto

mean

ingthrou

ghalan

guage's

grammar.

How

ever,even

acursory

exam

ination

ofthesta-

tisticalprop

ertiesof

speech

indicates

that

therelation

ship

betw

eensou

ndandsymbolisanythingbutarb

itrary.In-

stead,itappears

that

speech

isorgan

izedinto

syllab

le-likeunits

atboth

theacou

sticandlex

icallev

els,andthat

these

2Thefea

tures

are

computed

inquarter-o

ctavebands,

the

modulatio

ntra

nsfer

functio

nofthesystem

is atbetw

een0

and8Hz,

andnothresh

oldingis

applied

totheoutput.

The

most

importa

ntdi�eren

cebetw

eenthese

features

istheabsen

ceofthresh

olding.Ifthresh

oldingisused

forautomatic

recognitio

n,

thereco

gnitio

nperfo

rmance

onclea

nspeech

degrades.

How

ever,

thesta

bility

oftherep

resentatio

nalform

atisenhanced

bysome

degree

ofthresh

olding.

−1.5

−1

−0.5

0

0 0.2 0.4 0.6 0.8 1 1.2225

470

890

1660

3500

Time (s)

Fre

quen

cy (

Hz)

clean speech: modulation spectrogram

−1.5

−1

−0.5

0

0 0.2 0.4 0.6 0.8 1 1.2225

470

890

1660

3500

Time (s)

Fre

quen

cy (

Hz)

0 dB SNR speech: modulation spectrogram

−1.5

−1

−0.5

0

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Time (s)

Fre

quen

cy (

Hz)

clean speech: narrowband spectrogram

−1.5

−1

−0.5

0

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Time (s)

Fre

quen

cy (

Hz)

noisy speech: narrowband spectrogram

These patterns are far more clearly delineated in the original color versions, which are available in the CD-ROM version ofthe proceedings and at http://www.icsi.berkeley.edu/~bedk/ICASSP97_fig2_color.gif

Figure 2. A comparison of the modulation spectrogram and narrow-band spectrogram for clean and noisyspeech.

syllable-like units are the basis for lexical access from theacoustics of the speech signal.It has been previously suggested that the broad peak at

4 Hz in the modulation spectrum corresponds to the aver-age syllable rate [8]. Recently, we have found a more speci�ccorrelation between the distribution of low-frequency mod-ulations in speech and the statistical distribution of syl-lable durations in spoken discourse [9]. It has also beenshown that the concentrations of energy in the modulationspectrographic display correspond to syllabic nuclei. Thus,it appears that the modulation spectrogram robustly ex-tracts information pertaining to the syllabic segmentationof speech, and that this information is of some utility inrecognizing speech under adverse acoustic conditions [10].Two common objections to a syllabic representation of

English are the relatively complex and heterogeneous sylla-ble structure of English and the large number of syllablesrequired to cover the lexical inventory. However, these the-oretical concerns are not borne out in practice. In spokenEnglish, over 80% of the syllables are of the canonical CV,CVC, VC, and V forms, and many of the remainder reduceto these formats by processes of assimilation and reduction.In written English, only 12 syllables comprise over 25% ofall syllable occurrences, and 339 syllables account for 75%of all syllable occurrences [11]. Spoken English employs asimilarly reduced syllabic inventory [12, 13].The robust encoding of syllabic structure by low-

frequency modulations in speech, the sensitivity of the hu-man auditory system to these modulations, and the statis-tics demonstrating that, in practice, English has a relativelysimple syllabic structure and relies on a small subset of thepossible syllables all support a model of real-time humanspeech perception in which auditory mechanisms parse thespeech signal into syllable-like units and a core vocabularyof a few hundred, highly familiar syllables support e�cientlexical access. This model is described in more detail in[14].

6. CONCLUSIONS

We have developed a new representational format for speechthat captures many important properties of the auditorycortical representation of speech, namely selectivity for theslow modulations in the signal that encode phonetic in-formation, critical-band frequency analysis, automatic gaincontrol, and sensitivity to spectro-temporal peaks in thesignal. These signal processing strategies produce a repre-sentation with greater stability in low SNR and reverberantconditions than conventional speech representations. Theenhanced stability of the modulation spectrogram providesa potentially useful tool for research in human speech per-ception, speech coding, and automatic speech recognition.

REFERENCES

[1] Homer Dudley. Remaking speech. JASA, 11(2):169{177, October 1939.

[2] Rob Drullman, Joost M. Festen, and Reinier Plomp.E�ect of temporal envelope smearing on speech recep-tion. JASA, 95(2):1053{1064, February 1994.

[3] Caroline L. Smith, Catherine P. Browman, Richard S.McGowan, and Bruce Kay. Extracting dynamic param-eters from speech movement data. JASA, 93(3):1580{1588, March 1993.

[4] Christoph E. Schreiner and John V. Urbas. Represen-tation of amplitude modulation in the auditory cortex

PLP mod. spectrogram

clean reverb clean reverb

substitutions 9.2% 33.5% 11.8% 37.7%

deletions 3.2% 33.8% 3.5% 25.9%

insertions 3.5% 2.7% 2.1% 2.6%

total 15.8% 70.1% 17.8% 66.1%

Table 1. Comparison of PLP and modulation spec-trographic features for recognition of clean and re-verberant speech. For the number of words in thetest set (2426) the di�erence in performance onclean speech is not statistically signi�cant, whilethe di�erence in performance on the reverberantspeech is. Note that the performance improvementin reverberation for the modulation spectrographicfeatures over PLP features comes almost entirelyfrom a 23% relative reduction in the deletion rate.

of the cat. I. The anterior auditory �eld (AAF). Hear-ing Research, 21(3):227{241, 1986.

[5] Donald D. Greenwood. Critical bandwidth and thefrequency coordinates of the basilar membrane. JASA,33:1344{1356, 1961.

[6] Brian E. D. Kingsbury and Nelson Morgan. Recog-nizing reverberant speech with RASTA-PLP. In Proc.ICASSP-97. IEEE, 1997.

[7] Hynek Hermansky. Perceptual linear predictive (PLP)analysis of speech. JASA, 87(4):1738{1752, April 1990.

[8] Tammo Houtgast and Herman J. M. Steeneken. A re-view of the MTF concept in room acoustics and its usefor estimating speech intelligibility. JASA, 77(3):1069{1077, March 1985.

[9] Steven Greenberg, Joy Hollenback, and Dan Ellis. In-sights into spoken language gleaned from phonetictranscription of the Switchboard corpus. In Proc.ICSLP-96, 1996.

[10] Su-Lin Wu, Michael L. Shire, Steven Greenberg, andNelson Morgan. Integrating syllable boundary infor-mation into speech recognition. In Proc. ICASSP-97.IEEE, 1997.

[11] Godfrey Dewey. Relative Frequency of English SpeechSounds, volume 4 of Harvard Studies in Education.Harvard University Press, Cambridge, 1923.

[12] Norman R. French, Charles W. Carter, Jr., and WalterKoenig, Jr. The words and sounds of telephone conver-sations. The Bell System Technical Journal, IX:290{325, April 1930.

[13] Steven Greenberg, Joy Hollenback, and Dan Ellis. TheSwitchboard transcription project. Technical report,International Computer Science Institute, 1997.

[14] Steven Greenberg. Understanding speech understand-ing: Towards a uni�ed theory of speech perception.In William Ainsworth and Steven Greenberg, editors,Proc. of the ESCA Workshop on the Auditory Basis ofSpeech Perception, pages 1{8. ESCA, 1996.

presented at icassp-97, munich, vol. 3 pp. 1647-1650.steveng/pdf/icassp97-modspec.pdf · presented...

Documents