a 12-week project in speech coding and recognition by fu-tien hsiao and vedrana andersen
TRANSCRIPT
Overview An Introduction to Speech Signals
(Vedrana) Linear Prediction Analysis (Fu) Speech Coding and Synthesis (Fu) Speech Recognition (Vedrana)
AN INTRODUCTION TO SPEECH SIGNALS
Speech Production Flow of air from lungs Vibrating vocal cords Speech production cavities Lips Sound wave Vowels (a, e, i), fricatives (f, s, z) and
plosives (p, t, k)
AN INTRODUCTION TO SPEECH SIGNALS
Speech Signals Sampling frequency 8 — 16 kHz Short-time stationary assumption
(frames 20 – 40 ms)
AN INTRODUCTION TO SPEECH SIGNALS
Model for Speech Production Excitation (periodic, noisy) Vocal tract filter (nasal cavity, oral
cavity, pharynx)
AN INTRODUCTION TO SPEECH SIGNALS
Voiced and Unvoiced Sounds Voiced sounds,
periodic excitation, pitch period
Unvoiced sounds, noise-like excitation
Short-time measures: power and zero-crossing
AN INTRODUCTION TO SPEECH SIGNALS
Frequency Domain
Pitch, harmonics (excitation) Formants, envelope (vocal tract filter) Harmonic product spectrum
AN INTRODUCTION TO SPEECH SIGNALS
Speech Spectrograms
Time varying formant structure Narrowband / wideband
LINEAR PREDICTION ANALYSIS
Categories Vocal Tract Filter Linear Prediction Analysis
Error Minimization Levison-Durbin Recursion
Residual sequence u(n)
LINEAR PREDICTION ANALYSIS
Vocal Tract Filter(1) Vocal tract filter
If we assume an all poles filter?
)(zS
Input: periodic impulse train
Output: speech
)(
)()(
zU
zSzH
g
)(zU g
)(zH
LINEAR PREDICTION ANALYSIS
Vocal Tract Filter(2) Auto regressive model:
(all poles filter)
where p is called the model order Speech is a linear combination of past
samples and an extra part, Aug(z)
)()(...)2()1()(
)()(...)()()(
)(
)(
1)(
21
22
11
1
nAupnsansansans
zAUzSzazSzazSzazS
zU
zS
za
AzH
gp
gp
p
gp
kk
k
LINEAR PREDICTION ANALYSIS
Linear Prediction Analysis(1) Goal: how to find the coefficients ak
in this all poles model?
all poles model
speech, s(n)
impulse, Aug(n)
Physical model v.s. Analysis system
ak here is fixed, but unknown!
we try to find αk to estimate ak
?
error, e(n)
LINEAR PREDICTION ANALYSIS
Linear Prediction Analysis(2) What is really inside the ? box? A predictor (P(z), FIR filter) inside,
where ŝ(n)= α1s(n-1)+α2s(n-2)+… + αps(n-p)
If αk ≈ ak , then e(n) ≈ Aug(n)
predictive error, e(n)=s(n)- ŝ(n)
P(z)original s(n)
-
A(z)=1-P(z)
predicitve ŝ(n)
LINEAR PREDICTION ANALYSIS
Linear Prediction Analysis (3) If we can find a predictor generating
a smallest error e(n) which is close to Aug(n), then we can use A(z) to estimate filter coefficients.
e(n)
≈Aug(n)
1 / A(z) ŝ(n)
very similar to vocal tract model
LINEAR PREDICTION ANALYSIS
Error Minization(1) Problem: How to find the minimum error? Energy of error:
, where e(n)=s(n)- ŝ(n) = function(αi)
For quadratic function of αi we can find the smallest value by for each
n
neE )(2
0/ iE i
By differentiation,
We define that, where
This is actually an autocorrelation of s(n)
LINEAR PREDICTION ANALYSIS
Error Minization(2)
nn
p
kk insknsinsns )()()()(
1
n
insknski )()(),( pipk 1,1
a set of linear equations
LINEAR PREDICTION ANALYSIS
Error Minization(3) Hence, let’s discuss linear equations
in matrix:
Linear prediction coefficient is our goal.
How to solve it efficiently?
)(
)2(
)1(
)0()1(
)0()1(
)1()1()0(
2
1
pr
r
r
rpr
rr
prrr
p
LINEAR PREDICTION ANALYSIS
Levinson-Durbin Recursion(1) In the matrix, LD recursion method is
based on following characteristics: Symmetric Toeplitz
Hence we can solve matrix in O(p2) instead of O(p3)
Don’t forget our objective, which is to find αk to simulate the vocal tract filter.
LINEAR PREDICTION ANALYSIS
Levinson-Durbin Recursion(2)
In exercise, we solve matrix by ‘brute force’ and L-D recursion. There is no difference of corresponding parameters
Error energy v.s. Predictor order
LINEAR PREDICTION ANALYSIS
Residual sequence u(n) After knowing filter
coefficients, we can find residual sequence u(n) by inversely filtering computation.
Try to compare original s(n)
residual u(n)
u(n)A(z)
s(n)
SPEECH CODING AND SYNTHESIS
Categories Analysis-by-Synthesis Perceptual Weighting Filter Linear Predictive Coding
Multi-Pulse Linear Prediction Code-Excited Linear Prediction (CELP) CELP Experiment
Quantization
SPEECH CODING AND SYNTHESIS
Analysis-by-Synthesis(1) Analyze the speech by estimating a
LP synthesis filter Computing a residual sequence as a
excitation signal to reconstruct signal
Encoder/Decoder : the parameters like LP synthesis filter, gain, and pitch are coded, transmitted, and decoded
SPEECH CODING AND SYNTHESIS
Analysis-by-Synthesis(2) Frame by frame
Without error minimization: With error minimization:
LP parameters
Excitation parameters
ENCODER
To channele(n)
s(n)
Excitation Generator
LP Synthesis Filter
ŝ(n)
-
Error Minimization
LP analysis
SPEECH CODING AND SYNTHESIS
Perceptual Weighting Filter(1) Perceptual masking effect:
Within the formant regions, one is less sensitive to the noise
Idea: designing a filter that de-emphasizes the error in the formant region
Result: synthetic speech with more error near formant peaks but less error in others
SPEECH CODING AND SYNTHESIS
Perceptual Weighting Filter(2)
In frequency domain: LP syn. filter v.s. PW filter Perceptual weighting
coefficient: α = 1, no filtering. α decreases, filtering more optimalα depends on perception
)(
)(
)(1
)(1)(
zA
zAzP
zPzQ
SPEECH CODING AND SYNTHESIS
Perceptual Weighting Filter(3) In z domain, LP filter v.s. PW filter
Numerator: generating the zeros which are the original poles of LP synthesis filter
Denominator: placing the poles closer to the origin. α determines the distance
SPEECH CODING AND SYNTHESIS
Linear Predictive Coding(1) Based on above methods, PW filter
and analysis-by-synthesis If excitation signal ≈ impulse train,
during voicing, we can get a reconstructed signal very close to the original
More often, however, the residue is far from the impulse train
SPEECH CODING AND SYNTHESIS
Linear Predictive Coding(2) Hence, there are many kinds of coding
trying to improve this Primarily differ in the type of
excitation signal Two kinds:
Multi-Pulse Linear Prediction Code-Excited Linear Prediction
(CELP)
SPEECH CODING AND SYNTHESIS
Multi-Pulse Linear Predcition(1) Concept: represent the residual
sequence by putting impulses in order to make ŝ(n) closer to s(n).
s(n)
Error Minimizatio
n
Excitation
Generator
LP Synthesis
Filter
PW Filter
ŝ(n)-
Multi-pulse, u(n)
LP Analysis
SPEECH CODING AND SYNTHESIS
Multi-Pulse Linear Predcition(2) s1 Estimate the LPC filter without
excitation s2 Place one impulse (placement and amplitude) s3 A new error is determined s4 Repeat s2-s3 until reaching a desired min error
multi-pulse
s1
original
synthetic
errors2,3 s
4
SPEECH CODING AND SYNTHESIS
Code-Excited Linear Prediction(1) The difference:
Represent the residue v(n) by codewords (exhaustive searching) from a codebook of zero-mean Gaussian sequence
Consider primary pitch pulses which are predictable over consecutive periods
Multi-pulse generator
SPEECH CODING AND SYNTHESIS
Code-Excited Linear Prediction(2)
LP synthesis
filter
LP analysis
-
PW filter
Error minimizatio
n
Gaussian excitation codebook
s(n)ŝ(n)
s(n)
LP parameters
u(n)Pitch synthesis
filter
Pitch estimate
v(n)
SPEECH CODING AND SYNTHESIS
CELP Experiment(1) An experiment of
CELP Original (blue) : Excitation signal
(below): Reconstructed (green) :
SPEECH CODING AND SYNTHESIS
CELP Experiment(2) Test the quality for different settings:1. LPC model order
Initial M=10 Test M=2
2. PW coefficient
SPEECH CODING AND SYNTHESIS
CELP Experiment(3)
3. Codebook (L,K) K: codebook size K influences the computation time
strongly. if K: 1024 to 256, then time: 13 to 6 sec Initial (40,1024) Test (40,16)
L: length of the random signal L determines the number of subblock in
the frame
SPEECH CODING AND SYNTHESIS
Quantization With quantization,
16000 bps CELP 9600 bps CELP
Trade-offBandwidth efficiency v.s. speech quality
SPEECH RECOGNITION
Dimensions of Difficulty Speaker dependent / independent Vocabulary size (small, medium,
large) Discrete words / continuous
utterance Quiet / noisy environment
SPEECH RECOGNITION
Feature Extraction Overlapping frames Feature vector for each frame
Mel-cepstrum, difference cepstrum, energy, diff. energy
SPEECH RECOGNITION
Vector Quantization Vector quantization
K-means algorithm Observation sequence for the whole
word
SPEECH RECOGNITION
Hidden Markov Model (2) Probability of transition
State transition matrix
State probability vector
State equation
SPEECH RECOGNITION
Hidden Markov Model (3) Probability of observing
Observation probability matrix
Observation probability vector
Observation equation
SPEECH RECOGNITION
Hidden Markov Model (4) Discrete observation hidden Markov
model
Two HMM problems Training problem Recognition problem
SPEECH RECOGNITION
Recognition using HMM (1) Determining the probability that a given HMM
produced the observation sequence
Using straightforward computation – all possible paths, ST
3 3
2 2 2 2
3 3 3
11111
2states
time
SPEECH RECOGNITION
Recognition using HMM (2) Forward-backward algorithm, only the forward part Forward partial observation
Forward probability
i
SPEECH RECOGNITION
Training HMM No known analytical way Forward-backward (Baum-Welch) reestimation, a hill-
climbing algorithm Reestimates HMM parameters in such a way that
Method: Uses and to compute forward and backward
probabilities, calculates state transition probabilities and observation probabilities
Reestimates the model to improve probability
Need for scaling