Speech Preprocessing, SpeechProduction, Cepstrum
Jan Cernocky, Valentina Hubeika
{cernocky|ihubeika}@fit.vutbr.cz
FIT BUT Brno
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno1/46
Agenda
• Speech Parameterization
– Preprocessing– Basic Parameters: short-time energy, zero crossing rate.
• Speech production and its model.
• Spectrogram.
• Separation of excitation and modification – cepstrum.
• Approximation of cepstra according to the human auditory system’s response– MFCC.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno2/46
PARAMETERIZATION
• Goal: express a signal on a limited number of values – a) “parameterization”, b)
“feature extraction”.
• a) representation based on findings in signals processing (filter banks, Fourier
transform, etc.) ⇒ non-parametric representation.
• b) representation based on findings about speech production ⇒ parametric
representation.
BUT:
• b) also makes use of the techniques of non-parametric representation, thus difficult
(sometimes unfavorable) to distinguish between the two groups.
• The calculated values are anyway usually called parameters.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno3/46
Parameters
• scalar per frame – one number calculated from a speech frame (short-time energy or
zero crossing rate).
• vector per frame – a set of numbers (vector) calculated from a speech frame. When
having a sequence of frames, parameters are usually stored in matrices.
time
index ...
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno4/46
PRE-PROCESSING
Mean Normalization
Direct current offset (dc-offset) carries no useful information. Moreover, can carry
disturbing information (when calculating energy).
0 200 400 600 800 1000 1200 1400 1600 1800−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5x 10
4
n
s(n)
dc-offset removal!
s′[n] = s[n] − µs, µs must be estimated. (1)
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno5/46
Mean Value Off-Line
Here, equivalent to the average value:
s =1
N
N∑
n=1
s[n] (2)
0 200 400 600 800 1000 1200 1400 1600 18005256.5
5257
5257.5
5258
5258.5
5259
n
mea
n of
f−lin
e
0 200 400 600 800 1000 1200 1400 1600 1800−2
−1.5
−1
−0.5
0
0.5
1
1.5
2x 10
4
n
s’(n)
offl
ine
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno6/46
Mean Value On-Line
The whole signal is not (yet) available: either is too long or there is a flow of new values..
s[n] = γs[n − 1] + (1 − γ)s[n], (3)
where γ −→ 1. This is equivalent to passing a signal through a filter with the impulse
response:
h = [(1 − γ) (1 − γ)γ (1 − γ)γ2 . . .]. (4)
Defined as the geometric progression: the initial element is a0 = 1 − γ and the quotient is
q = γ. The sum is thus:∞∑
n=0
h[n] =a0
1 − q=
1 − γ
1 − γ= 1, (5)
(this is what we originally expected ,).
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno7/46
Example, γ = 0.99 (see the first computer lab):
0 200 400 600 800 1000 1200 1400 1600 18000
1000
2000
3000
4000
5000
6000
n
mea
n on
−lin
e, γ
=0.9
9
0 200 400 600 800 1000 1200 1400 1600 1800−2
−1.5
−1
−0.5
0
0.5
1
1.5
2x 10
4
n
s’(n)
onl
ine
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno8/46
Preemphasis
Increases the magnitude of the higher frequencies with respect to the magnitude of the
lower frequencies. Equalization of the speech frequency characteristics (the magnitude
decreases towards higher frequencies). Rather a historical operation.
A simple first-order FIR filter:
H(z) = 1 − κz−1, (6)
where κ ∈ [0.9, 1]. Calculated difference between the two neighboring samples. The
magnitude frequency characteristics for κ=0.95:
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno9/46
0 0.1 0.2 0.3 0.4 0.50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
normalized frequency
|H(f)
|, κ=
0.95
Passing through the defined filter:
s′[n] = s[n] − κs[n − 1] (7)
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno10/46
0 20 40 60 80 100 120 140 160 180−2
−1.5
−1
−0.5
0
0.5
1
1.5
2x 10
4
n
orig
inal
0 20 40 60 80 100 120 140 160 180−1
−0.5
0
0.5
1
1.5x 10
4
n
orig
inal
⇒ The processed signal (after applying preemphasis) contains of more higher frequencies.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno11/46
FRAMES
• Why?
• Speech signal is considered as random, parameter estimation methods require
stationary signals.
• Thus dividing the signal into shorter segments (segments, micro-segments, frames)
within which the signal behaves (we hope) as stationary.
• Frame parameters: length lram, overlap pram, frame shift sram = lram − pram.
Frame Length
1. short enough to assume the signal (within the given length) is stationary.
2. BUT: long enough to provide accurate estimation of the desired parameters (features).
⇒ trade off (momentum of the articulation tract), typical length 20–25 ms (160–200
samples for Fs =8000 Hz).
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno12/46
Overlap
• small or none: , fast time shift in the signal, low memory/processor demands, / the
difference of the parameter values of the neighboring frames can be significant.
• large: , slow time shift, smooth change in the parameter values, / high
memory/processor demands, alike parameter values (violates the independency
assumption!).
⇒ tradeoff, typical length 10 ms, thus 100 frames per second, centi-second vectors.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno13/46
How many frames per segment of the length N ?
No overlap, pram = 0
lram
N
ramp =0
Nram =
⌊
N
lram
⌋
, (8)
. . . ⌊·⌋ denotes the operation ’floor’.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno14/46
Frames overlap, pram 6= 0
lram
ramp
N
rams
Nram = 1 +
⌊
N − lram
sram
⌋
(9)
. . . the signal must be at least one frame long.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno15/46
Signal Segmentation - Windowing Function
Select a frame of a signal using a window - window(ing) function:
Rectangular – no change of the signal, selection only:
w[n] =
1 pro 0 ≤ n ≤ lram − 1
0 otherwise(10)
Hamming – suppresses the signal at the sides of the window, selection and weighting:
w[n] =
0.54 − 0.46 cos2πn
lram − 1pro 0 ≤ n ≤ lram − 1
0 otherwise(11)
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno16/46
How does windowing change the spectrum of the selected segment? A product in time
domain corresponds to convolution of the speech spectrum with the window spectrum.
X(f) = S(f) ⋆ W (f) (12)
Comparison of the rectangular and the Hamming window:
−50 0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
rect
angu
lar,
l ram
=160
−0.05 0 0.050
20
40
60
80
100
120
140
160
normalized frequency
rect
angu
lar −
spe
ctru
m
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno17/46
−50 0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
Ham
min
g, l
ram
=160
−0.05 0 0.050
10
20
30
40
50
60
70
80
90
normalized frequency
Ham
min
g −
spec
trum
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno18/46
BASIC PARAMETERS OF SPEECH SIGNAL
all the parameters will be derived for single frames. For each frame:
• scalar ⇒ a row vector.
• vector ⇒ matrix, columns contain dimensions of the parameter vector, rows contain a
sequence of values in a particular dimension over time (time in frames).
Average Short-Time Energy
E =1
lram
lram−1∑
n=0
x2[n] (13)
• speech activity detector.
• separation of phonemes to voiced (high energy) and unvoiced (low energy).
• often we use log-energy.
• careful with noise and low-energy phonemes.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno19/46
Example: “letajıcı prase”
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−3
−2
−1
0
1
2
3
x 104
n
s(n)
0 20 40 60 80 100 1200
2
4
6
8
10
12
14
16
x 107
frames
lin energy
0 20 40 60 80 100 1206
8
10
12
14
16
18
frames
log energy
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno20/46
Zero-Crossing Rate
. . . rate of sign changes of the signal within a given frame.
Z =1
2
lram−1∑
n=1
| sign x[n] − sign x[n − 1]|, (14)
where sign (x) is the sign function defined as:
sign x[n] =
+1 pro x[n] ≥ 0
−1 pro x[n] < 0(15)
How does it work? The function | signx[n] − sign x[n − 1]| results in 2 when there is a
change in the sign between the samples x[n − 1] and x[n]:
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno21/46
0 20 40 60 80 100 120 140 160−8000
−6000
−4000
−2000
0
2000
4000
6000
8000
10000
12000
n
zero cro
ssings
• distinguishing between the voiced (low zero-crossing rate) and unvoiced (high rate,
rather like in noise).
• very sensitive to noise. . .
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno22/46
Example: “letajıcı prase”
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−3
−2
−1
0
1
2
3
x 104
n
s(n)
0 20 40 60 80 100 120
10
20
30
40
50
60
70
frames
zero cros
sings
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno23/46
HUMAN VOCAL APPARATUS AND ITS MODEL
(Adopted from Wikipedia )
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno24/46
Organs and their Models in Digital Processing
• lungs — energy source — signals: none
• larynx — energy modulation – signals: excitation.
– opened vocal cords — noise.
– vibrating vocal cords — periodic signal (tone). fundamental frequency:
males 90–120 Hz
females 150–300 Hz
children 350–400 Hz
• vocal (articulatory) tract — modification tract — signals: filter.
– pharynx.– velum.– tongue.– oral and nasal cavity.– teeth.– lips.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno25/46
Model
±°²¯¡¡µ
±°²¯
±°²¯
¡¡µ
-
-
?
6
- -currentsource
Σ
= vibrating vocal cords
= lungs
= open vocal cords
= articulatory tract
impulsgenerator
noisegenerator
lineartransfersystem(filter)
speech
Transfer system: linear filter - usually IIR.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno26/46
Vocal Tract Model in Time and Frequency Domain
cca -12 dB/okt.
h(t)
t
b)
G(f)d)
f f
H(f)
f
F1F2
F3
f)
F0
t t
s(t)c)a)
g(t) T0
e)S(f)
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno27/46
Top part – time behaviour, bottom part – spectrum.
• a) and d) excitation: T0 is the period, F0 is the fundamental frequency (pitch).
• b) and e) articulation tract: F1 to F3 are the formants (resonance frequency of the
vocal tract), are given by the physical configuration of the vocal tract.
• c) and f) the resulting signal and its spectrum.
The resulting signal is given in the time domain by convolution:
s(t) = g(t) ⋆ h(t) =
∫ +∞
−∞
g(τ)h(t − τ)dτ. (16)
Convolution in time domain corresponds to product in frequency:
S(f) = G(f)H(f). (17)
A relevant task in speech processing is de-convolution; the goal is to separate excitation
and modification.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno28/46
SPECTROGRAM
One spectrum is not enough (speech is non-nonstationary) ⇒ representation of the
spectrum (strictly speaking PSD) behaviour over time:
• segment speech into frames.
• estimate the PSD for each frame, usually using DFT.
• depict:
– horizontal axes represents time (“rough” time in frames).– vertical axes represents frequency.– color represents energy.
Depending on the frame length we talk about:
• long-term spectrogram.
• short-term spectrogram.
/ Drawback of DFT: Fine scale in frequency and time domain cannot be satisfied
simultaneously
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno29/46
long-term: specgram(s,256,8000,hamming(256),200);
0 0.2 0.4 0.6 0.8 1 1.20
500
1000
1500
2000
2500
3000
3500
4000
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno30/46
short-term: specgram(s,256,8000,hamming(50));
0 0.2 0.4 0.6 0.8 1 1.20
500
1000
1500
2000
2500
3000
3500
4000
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno31/46
CEPSTRUM
. . . separates excitation from modification – convenient for encoding; dropping excitation
frequency in speech processing (excitation carries information dependent on speaker,
mood,. . . )
What can we do.. 1: filter off frequency lower than 400 Hz and get rid of the fundamental
frequency. . . BAD IDEA:
0 500 1000 1500 2000 2500 3000 3500 4000100
120
140
160
180
200
220
240
260
280
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno32/46
• fundamental frequency folds are present
along the whole spectrum.
• we can loose the fist formant.
• land line band starts on 300 Hz and we
still can recognize the pitch.
• . . . so we need a better approach.
⇒ Cepstrum
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno33/46
Challenge
Excitation e(t) is convoluted with the filter (modification) impulse response:
s(t) = g(t) ⋆ h(t) =
∫ +∞
−∞
g(τ)h(t − τ)dτ, (18)
which in frequency domain corresponds to product:
S(f) = G(f)H(f). (19)
we cannot well separate the two components in either domain. Solution: non-linearity,
which can translate product to summation.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno34/46
Definition of Cepstrum
lnG(f) =
+∞∑
n=−∞
c(n)e−j2πfn (20)
The c(n) values are the cepstral coefficients. Since G(f) is an even function, c(n) are
real and the following holds:
c(n) = c(−n) (21)
The sum in the equation is the definition of DFT, hence we can compute the c(n) as:
c(n) = F−1 [lnG(f)] (22)
DFT-cepstrum
c(n) = F−1{
ln |F [s(n)]|2}
, (23)
• spectrum −→ cepstrum.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno35/46
Can it really “break” convolution?
s(n) = e(n) ⋆ h(n), (24)
S(f) = E(f)H(f) a thus |S(f)|2 = |E(f)|2 |H(f)|2. (25)
For the cepstrum calculation we make use of the linearity of the inverse Fourier transform:
F−1(a + b) = F−1(a) + F−1(b).
It results in:
c(n) = F−1{
ln[|E(f)|2 |H(f)|2]}
= F−1{
ln |E(f)|2 + ln |H(f)|2}
= (26)
= F−1{
ln |E(f)|2}
+ F−1{
ln |H(f)|2}
= ce(n) + ch(n) (27)
(28)
Convolution becames summation. The coefficients ce(n) and ch(n) are separable in
frequency, which allows to separate them by windowing.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno36/46
signal, |F [s(n)]|2, ln |F [s(n)]|2, F−1{
ln |F [s(n)]|2}
.
50 100 150 200 250 300 350 400 450 500
−1.5
−1
−0.5
0
0.5
1
1.5
x 104
0 500 1000 1500 2000 2500 3000 3500 40000
5
10
15x 10
11
0 500 1000 1500 2000 2500 3000 3500 400012
14
16
18
20
22
24
26
28
30
50 100 150 200 250 300 350 400 450 500
0
2
4
6
8
10
12
14
16
18
20
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno37/46
0 20 40 60 80 100 120 140 160 180 200
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
excitation
1st multipleof lag
2nd multipleof lag
filter
c0 is log−energy
For the sampling frequency Fs =8000 Hz, we can separate excitation from modification in
frequency domain using threshold of 30.
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno38/46
Excitation only – set the cepstra related to modification to zero:
modified cepstrum, ln |F [s(n)]|2, |F [s(n)]|, signal (after IDFT, phases of the
original signal are used).
50 100 150 200 250 300 350 400 450 500
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0 500 1000 1500 2000 2500 3000 3500 4000−7
−6
−5
−4
−3
−2
−1
0
1
2
3
0 100 200 300 400 500 6000
0.5
1
1.5
2
2.5
3
3.5
4
50 100 150 200 250 300 350 400 450 500
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno39/46
Modification only – set the cepstra related to excitation to zero:
modified cepstrum, ln |F [s(n)]|2, |F [s(n)]|, signal (after IDFT, phases are set to
zero).
50 100 150 200 250 300 350 400 450 500
0
2
4
6
8
10
12
14
16
18
20
0 500 1000 1500 2000 2500 3000 3500 400016
17
18
19
20
21
22
23
24
25
26
0 100 200 300 400 500 6000
0.5
1
1.5
2
2.5
3
3.5x 10
5
0 10 20 30 40 50 60 70 80 90−2
−1
0
1
2
3
4
5
x 104
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno40/46
Mel-frequency cepstrum – MFCC
• DFT has equivalent frequency resolution along the axes.
• Human ear has higher resolution on lower frequencies than on higher frequencies.
• We want to adjust cepstrum to human hearing.
how do we do it?
• Place filters along the frequency axes non-linearly, calculate the energy on the output,
use the calculated values instead of DFT when calculating cepstra.
• Calculate non-linear frequency axes and place the filters on the modified axes linearly,
. . .
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno41/46
Convert Hertz to Mel (to transform the frequency axes):
FMel = 2959 log10(1 +FHz
700) (29)
0 500 1000 1500 2000 2500 3000 3500 40000
500
1000
1500
2000
2500
frequency [Hz]
freq
uenc
y [M
el]
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno42/46
When linearly placed on the Mel axes, the filters correspond to the non-linearly placed
filters on the Hertz axes:
m m m m m m1 2 3 4 5 6
frequency
1
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno43/46
Energy estimation:
1. construct filter bank, filter the input signal in time domain and calculate
energy:∑
n s2i (n) . . . TOO DIFFICULT.
2. apply DFT, power, multiply by a triangular window and add up. (used in the HTK
toolkit).
Inverse FT can be realized by the discrete cosine transform (DCT) . . . (without derivation:
makes use of symmetry of the spectrum and the fact that the result should be real):
cmf (n) =
K∑
i=1
log mk cos[
n(k − 0.5)π
K
]
(30)
⇒ Mel-frequency cepstral coefficients (MFCC)
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno44/46
Short
Term
DFT
Mel
Filter
Bank
.
.
.
.
.
ln
Sl(1,t)
Sl(2,t)
Sl(23,t)
InputSpeech DCT
C1
C2
C13
| . |
| . |
| . | ln
ln
.
.
.
.
.
S(1,t)
S(2,t)
S(129,t)
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno45/46
0 5 10 15 20 25−1
−0.5
0
0.5
1a) Segment of speech signal for vowel ’iy’
time [ms]0 5 10 15 20 25
−0.5
0
0.5b) Speech segment after preemphasis and windowing
time [ms]
0 500 1000 1500 2000 2500 3000 3500 40000
1
2
3
4
5
6c) Fourier spectrum of speech segment
frequency [Hz]1 3 5 7 9 11 13 15 17 19 21 23
0
5
10
15
20d) Filter bank energies − smoothed spectrum
band number
1 3 5 7 9 11 13 15 17 19 21 23−4
−3
−2
−1
0
1
2
3
4e) Log of filter bank energies
band number2 4 6 8 10 12
−5
0
5f) Mel frefuency cepstral coefficients
mel quefrency
Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno46/46