presented at icassp-97, munich, vol. 3 pp. 1647-1650.steveng/pdf/icassp97-modspec.pdf · presented...
TRANSCRIPT
Presentedat ICASSP-97, Munich, vol. 3 pp. 1647-1650.
THE MODULATION SPECTROGRAM: IN PURSUIT OF AN INVARIANT
REPRESENTATION OF SPEECH
Steven Greenberg� and Brian E. D. Kingsburyy
�yInternational Computer Science Institute, 1947 Center Street, Suite 600, Berkeley, CA 94704, USA�Department of Linguistics
yDepartment of Electrical Engineering and Computer SciencesUniversity of California at Berkeley, Berkeley, CA 94704, USA
fsteveng,[email protected]
ABSTRACT
Understanding the human ability to reliably process anddecode speech across a wide range of acoustic conditionsand speaker characteristics is a fundamental challenge forcurrent theories of speech perception. Conventional speechrepresentations such as the sound spectrogram emphasizemany spectro-temporal details that are not directly ger-mane to the linguistic information encoded in the speechsignal and which consequently do not display the perceptualstability characteristic of human listeners. We propose anew representational format, the modulation spectrogram,that discards much of the spectro-temporal detail in thespeech signal and instead focuses on the underlying, stablestructure incorporated in the low-frequency portion of themodulation spectrum distributed across critical-band-likechannels. We describe the representation and illustrate itsstability with color-mapped displays and with results fromautomatic speech recognition experiments.
1. INTRODUCTION
Human listeners are able to reliably decode phonetic infor-mation carried by the speech signal across a wide rangeof acoustic conditions and speaker characteristics. Thisperceptual stability is not captured by traditional repre-sentations of speech which tend to emphasize the minutespectro-temporal details of the speech signal. Speaker vari-ability and distortions such as spectral shaping, backgroundnoise, and reverberation that typically exert little or no in- uence on the intelligibility of speech drastically alter suchconventional speech representations as the sound spectro-gram. This disparity between perceptual stability and rep-resentational lability constitutes a fundamental challengefor models of speech perception and recognition. A speechrepresentation insensitive to speaker variability and acous-tic distortion would be a powerful tool for the study of hu-man speech perception and for research in speech codingand automatic speech recognition.A key step for representing speech in a stable fashion is to
focus on the elements of the signal encoding phonetic infor-mation. By suppressing phonetically irrelevant elements ofthe signal, the variability of the representation is reduced.There is signi�cant evidence that much of the phoneticinformation is encoded by slow changes in gross spectralstructure that characterize the low-frequency portion of themodulation spectrum of speech. In the late 1930's the de-velopers of the vocoder found that it was possible to synthe-size intelligible, high-quality speech based on a ten-channelspectral estimate with roughly 300-Hz resolution that waslow-pass �ltered at 25 Hz [1]. More recently, in a study on
the intelligibility of temporally-smeared speech, Drullmanand colleagues have demonstrated that modulations at ratesabove 16 Hz are not required for speech intelligibility [2]. Arepresentation that focuses on slow modulations in speechalso has compelling parallels to the dynamics of speech pro-duction, in which the articulators move at rates of 2{12 Hz[3], and to the sensitivity of auditory cortical neurons toamplitude-modulations at rates below 20 Hz [4].
2. THE MODULATION SPECTROGRAM
We have developed a new representational format forspeech, the modulation spectrogram, that displays and en-codes the signal in terms of the distribution of slow modula-tions across time and frequency. Although not intended asan auditory model, the representation captures many im-portant properties of the auditory cortical representationof speech. The modulation spectrogram represents modu-lation frequencies in the speech signal between 0 and 8 Hz,with a peak sensitivity at 4 Hz, corresponding closely to thelong-term modulation spectrum of speech. The modulationspectrogram is computed in critical-band-wide channels [5]to match the frequency resolution of the auditory system,incorporates a simple automatic gain control and empha-sizes spectro-temporal peaks.Figure 1 illustrates the signal processing procedure used
to produce the modulation spectrogram. Incoming speech,sampled at 8 kHz, is analyzed into approximately critical-band-wide channels via an FIR �lter bank. The �lters aretrapezoidal in shape, and there is minimal overlap betweenadjacent channels. Within each channel the signal envelopeis derived by half-wave recti�cation and low-pass �ltering(the half-power cuto� frequency is 28 Hz). Each channelenvelope signal is downsampled to 80 Hz and then normal-ized by the average envelope level in that channel measuredover the entire utterance. The modulations of the normal-ized envelope signals are analyzed by computing the FFTover a 250-ms Hamming window every 12.5 ms in order tocapture the dynamic properties of the signal. Finally, thesquared magnitudes of the 4-Hz coe�cients of the FFTs areplotted in spectrographic format, with log energy encodedby color. Note that the display portrays modulation energyfrom 0{8 Hz. The e�ective �lter response for the 4 Hz com-ponent is down by 10 dB at 0 and 8 Hz. A threshold is usedin the energy-to-color mapping: the peak 30 dB of the sig-nal is mapped to a color axis, while levels more than 30 dBbelow the global peak are mapped to the color for -30 dB.Bilinear smoothing is used to produce the �nal image.
3. REPRESENTATIONAL STABILITY
The modulation spectrographic representation of speech ismore stable than the conventional spectrographic represen-
FF
TN
ormalize by
long-term avg.
FIR filter bankCritical-band
lowpass
cutoff = 28 H
z
speechim
age
and bilinear interpolationLimiting to peak 30 dB
2100X
lowpass
cutoff = 28 H
z
100Xlong-term
avg.N
ormalize by
FF
T2
Figure
1.Diagram
oftheprocessin
gcurre
ntly
usedto
producemodulatio
nspectro
grams
tationin
lowsign
al-to-noise
ratio(SNR)andreverb
erant
condition
s.Several
processin
gstep
scon
tribute
tothissta-
bility.
Theem
phasis
ofmodulation
sin
theran
geof
0{8Hz
with
peak
sensitiv
ityat
4Hzacts
asamatch
ed�lter
that
passes
only
signals
with
temporal
dynam
icscharacteristic
ofspeech
.Thecritical-b
and-lik
efreq
uency
resolution
oftherep
resentation
expandstherep
resentation
ofthelow
-freq
uency,
high
-energy
portion
sof
thespeech
signal,
while
thethresh
oldingused
inthecolor
mappingem
phasizes
the
spectro-tem
poral
peak
sin
thespeech
signal
that
riseabove
thenoise
oor.
Figu
re2illu
stratesthestab
ilityof
themodulation
spec-
trographic
represen
tationof
speech
bycom
parin
gcon
ven-
tionalnarrow
-bandspectrogram
s1andmodulation
spectro-
gramsfor
cleanandnoisy
version
softheutteran
ce\T
ellme
abouttheThaibarb
ecue."
Thenoisy
samplewas
produced
bymixingtheclean
sample
with
pinknoise
ataSNR
of0dB.Both
themodulation
spectrogram
sandnarrow
-band
spectrogram
scover
approx
imately
thesam
eran
geof
fre-quencies.
How
ever,themodulation
spectrogram
frequency
axisisnonlinear
inaccord
ance
with
thehuman
spatial
fre-quency
coord
inates
describ
edin
[5].While
thenarrow
-bandspectrogram
oftheclea
nspeech
sampleclearly
portray
sfeatu
resofthespeech
signalsuch
asonsets,
formanttra
jectories,andharm
onicstru
cture,
these
features
are
allbutlost
inthenarrow
-bandspectrogram
ofthenoisy
speech
,where
only
afew
spectro-tem
poral
peak
sstan
doutabove
thenoise.
Incon
trastto
thespec-
trograp
hic
represen
tation,themodulation
spectrogram
oftheclean
speech
prov
ides
only
acoarse
pictu
reof
theen-
ergydistrib
utio
nin
timeandfreq
uency.
The�nedetails,
such
asharm
onicity,
arenot
preserved
.A
comparison
ofthemodulation
spectrogram
sfor
theclean
andnoisy
speech
samples
illustrates
thestab
ilityof
therep
resentation
.The
major
features
ofthemodulation
spectrogram
observ
edfor
cleanspeech
arepreserved
inthemodulation
spectrogram
forspeech
embedded
ininten
selevels
ofnoise.
1The
narrow
-band
spectro
gramswere
computed
by
pre-
emphasizin
gthe
speech
,sampled
at8
kHz,
with
the
�lter
H(z)=
1�0:97z�1,
then
perfo
rming512-pointFFTswith
a64-
msHammingwindow
anda16-m
swindow
step.Alower
thresh
-old
of-30dBwasapplied
intheenerg
y-to
-colormapping.
4.
AUTOMATIC
RECOGNITIONBASEDON
MODULATIONSPECTROGRAPHIC
FEATURES
Asim
ilarrep
resentation
alstab
ilityisobserved
forreverb
er-antspeech
,andhas
been
dem
onstrated
intests
with
anau-
tomatic
speech
recognition
system
.In
these
tests,theper-
formance
ofahybrid
hidden
Markov
model/m
ultilayer
per-
ceptron
(HMM/M
LP)recogn
izertrain
edon
cleanspeech
ismeasu
redon
cleanandreverb
erantspeech
.Theperfor-
mance
ofrecogn
izersusin
gdi�eren
tfron
t-endfeatu
reex-
tractionmeth
odsiscom
pared
onthetwotest
sets.Asid
efrom
thefron
t-endprocessin
g,therecogn
izersare
identical,
usin
gsim
ilarly-sized
MLPsfor
phonetic
prob
ability
estima-
tion,andthesam
eHMM
word
models
andclass
bigram
grammar
forspeech
deco
ding.
Furth
erdetails
ontherecog-
nition
experim
ents
areprov
ided
in[6].
Table1com
pares
theperform
ance
ofarecogn
izerusin
gfeatu
resbased
onthemodulation
spectrogram
2with
the
perform
ance
ofarecogn
izerthat
uses
PLPfeatu
res[7].
5.
THEIM
PORTANCEOFTHESYLLABLE
INSPEECHRECOGNITION
Acen
tralprob
lemin
speech
science
istheexplication
oftheprocess
bywhich
thebrain
isable
togo
fromsou
nd
tomean
ing.
Thetrad
itional
models
posit
acom
plex
and
somew
hat
arbitrary
seriesof
operation
sthat
advan
cefrom
theacou
sticsign
alto
phonem
icunits,
fromphonem
icunits
toword
s,andfrom
word
sto
mean
ingthrou
ghalan
guage's
grammar.
How
ever,even
acursory
exam
ination
ofthesta-
tisticalprop
ertiesof
speech
indicates
that
therelation
ship
betw
eensou
ndandsymbolisanythingbutarb
itrary.In-
stead,itappears
that
speech
isorgan
izedinto
syllab
le-likeunits
atboth
theacou
sticandlex
icallev
els,andthat
these
2Thefea
tures
are
computed
inquarter-o
ctavebands,
the
modulatio
ntra
nsfer
functio
nofthesystem
is atbetw
een0
and8Hz,
andnothresh
oldingis
applied
totheoutput.
The
most
importa
ntdi�eren
cebetw
eenthese
features
istheabsen
ceofthresh
olding.Ifthresh
oldingisused
forautomatic
recognitio
n,
thereco
gnitio
nperfo
rmance
onclea
nspeech
degrades.
How
ever,
thesta
bility
oftherep
resentatio
nalform
atisenhanced
bysome
degree
ofthresh
olding.
−1.5
−1
−0.5
0
0 0.2 0.4 0.6 0.8 1 1.2225
470
890
1660
3500
Time (s)
Fre
quen
cy (
Hz)
clean speech: modulation spectrogram
−1.5
−1
−0.5
0
0 0.2 0.4 0.6 0.8 1 1.2225
470
890
1660
3500
Time (s)
Fre
quen
cy (
Hz)
0 dB SNR speech: modulation spectrogram
−1.5
−1
−0.5
0
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Time (s)
Fre
quen
cy (
Hz)
clean speech: narrowband spectrogram
−1.5
−1
−0.5
0
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Time (s)
Fre
quen
cy (
Hz)
noisy speech: narrowband spectrogram
These patterns are far more clearly delineated in the original color versions, which are available in the CD-ROM version ofthe proceedings and at http://www.icsi.berkeley.edu/~bedk/ICASSP97_fig2_color.gif
Figure 2. A comparison of the modulation spectrogram and narrow-band spectrogram for clean and noisyspeech.
syllable-like units are the basis for lexical access from theacoustics of the speech signal.It has been previously suggested that the broad peak at
4 Hz in the modulation spectrum corresponds to the aver-age syllable rate [8]. Recently, we have found a more speci�ccorrelation between the distribution of low-frequency mod-ulations in speech and the statistical distribution of syl-lable durations in spoken discourse [9]. It has also beenshown that the concentrations of energy in the modulationspectrographic display correspond to syllabic nuclei. Thus,it appears that the modulation spectrogram robustly ex-tracts information pertaining to the syllabic segmentationof speech, and that this information is of some utility inrecognizing speech under adverse acoustic conditions [10].Two common objections to a syllabic representation of
English are the relatively complex and heterogeneous sylla-ble structure of English and the large number of syllablesrequired to cover the lexical inventory. However, these the-oretical concerns are not borne out in practice. In spokenEnglish, over 80% of the syllables are of the canonical CV,CVC, VC, and V forms, and many of the remainder reduceto these formats by processes of assimilation and reduction.In written English, only 12 syllables comprise over 25% ofall syllable occurrences, and 339 syllables account for 75%of all syllable occurrences [11]. Spoken English employs asimilarly reduced syllabic inventory [12, 13].The robust encoding of syllabic structure by low-
frequency modulations in speech, the sensitivity of the hu-man auditory system to these modulations, and the statis-tics demonstrating that, in practice, English has a relativelysimple syllabic structure and relies on a small subset of thepossible syllables all support a model of real-time humanspeech perception in which auditory mechanisms parse thespeech signal into syllable-like units and a core vocabularyof a few hundred, highly familiar syllables support e�cientlexical access. This model is described in more detail in[14].
6. CONCLUSIONS
We have developed a new representational format for speechthat captures many important properties of the auditorycortical representation of speech, namely selectivity for theslow modulations in the signal that encode phonetic in-formation, critical-band frequency analysis, automatic gaincontrol, and sensitivity to spectro-temporal peaks in thesignal. These signal processing strategies produce a repre-sentation with greater stability in low SNR and reverberantconditions than conventional speech representations. Theenhanced stability of the modulation spectrogram providesa potentially useful tool for research in human speech per-ception, speech coding, and automatic speech recognition.
REFERENCES
[1] Homer Dudley. Remaking speech. JASA, 11(2):169{177, October 1939.
[2] Rob Drullman, Joost M. Festen, and Reinier Plomp.E�ect of temporal envelope smearing on speech recep-tion. JASA, 95(2):1053{1064, February 1994.
[3] Caroline L. Smith, Catherine P. Browman, Richard S.McGowan, and Bruce Kay. Extracting dynamic param-eters from speech movement data. JASA, 93(3):1580{1588, March 1993.
[4] Christoph E. Schreiner and John V. Urbas. Represen-tation of amplitude modulation in the auditory cortex
PLP mod. spectrogram
clean reverb clean reverb
substitutions 9.2% 33.5% 11.8% 37.7%
deletions 3.2% 33.8% 3.5% 25.9%
insertions 3.5% 2.7% 2.1% 2.6%
total 15.8% 70.1% 17.8% 66.1%
Table 1. Comparison of PLP and modulation spec-trographic features for recognition of clean and re-verberant speech. For the number of words in thetest set (2426) the di�erence in performance onclean speech is not statistically signi�cant, whilethe di�erence in performance on the reverberantspeech is. Note that the performance improvementin reverberation for the modulation spectrographicfeatures over PLP features comes almost entirelyfrom a 23% relative reduction in the deletion rate.
of the cat. I. The anterior auditory �eld (AAF). Hear-ing Research, 21(3):227{241, 1986.
[5] Donald D. Greenwood. Critical bandwidth and thefrequency coordinates of the basilar membrane. JASA,33:1344{1356, 1961.
[6] Brian E. D. Kingsbury and Nelson Morgan. Recog-nizing reverberant speech with RASTA-PLP. In Proc.ICASSP-97. IEEE, 1997.
[7] Hynek Hermansky. Perceptual linear predictive (PLP)analysis of speech. JASA, 87(4):1738{1752, April 1990.
[8] Tammo Houtgast and Herman J. M. Steeneken. A re-view of the MTF concept in room acoustics and its usefor estimating speech intelligibility. JASA, 77(3):1069{1077, March 1985.
[9] Steven Greenberg, Joy Hollenback, and Dan Ellis. In-sights into spoken language gleaned from phonetictranscription of the Switchboard corpus. In Proc.ICSLP-96, 1996.
[10] Su-Lin Wu, Michael L. Shire, Steven Greenberg, andNelson Morgan. Integrating syllable boundary infor-mation into speech recognition. In Proc. ICASSP-97.IEEE, 1997.
[11] Godfrey Dewey. Relative Frequency of English SpeechSounds, volume 4 of Harvard Studies in Education.Harvard University Press, Cambridge, 1923.
[12] Norman R. French, Charles W. Carter, Jr., and WalterKoenig, Jr. The words and sounds of telephone conver-sations. The Bell System Technical Journal, IX:290{325, April 1930.
[13] Steven Greenberg, Joy Hollenback, and Dan Ellis. TheSwitchboard transcription project. Technical report,International Computer Science Institute, 1997.
[14] Steven Greenberg. Understanding speech understand-ing: Towards a uni�ed theory of speech perception.In William Ainsworth and Steven Greenberg, editors,Proc. of the ESCA Workshop on the Auditory Basis ofSpeech Perception, pages 1{8. ESCA, 1996.