reporter: shih-hsiang( 士翔 )
Post on 05-Jan-2016
46 Views
Preview:
DESCRIPTION
TRANSCRIPT
Reporter: Shih-Hsiang(士翔 )
Introduction
• Speech signal carries information from many sources– Not all information is relevant or important for speech recognition– Feature extraction (the first crucial step)
• Acoustic features may greatly affect the performance of a speech recognizer– Discriminability– Robustness– Complexity
• MFCCs are used almost as “standard” acoustic parameters in currently available speech recognition systems– Not to cope well with noisy speech– Wiener filtering, spectral subtraction, RASTA, PMC, MLLR …etc.
• In this paper, they present differential power spectrum (DPS) for speech recognition
Definition of the differential power spectrum
y(t) : received speech signals(t) : original clean speech signalh(t) : impulse response of the transmission channelx(t) : the noise-free speech signalv(t) : ambient noise
(0≤n<N, where N is the frame length)
assume
power spectrum
ω : radian frequencyry(τ) : the short-time autocorrelation
Definition of the differential power spectrum (cont.)
assume noise and speech signal are mutually uncorrelated
Differential power spectrum (DPS)
assume noise and speech signal are mutually uncorrelated
(continuous frequency domain)
Definition of the differential power spectrum (cont.)
Its discrete counterpart can be approximated in terms of following difference equation
where P and O are the orders of the differential equation bl’s some real-valued weighting coefficients 0≤k<K, here K is the length of FFT
Definition of the differential power spectrum (cont.)
D(k) = Y(k) – Y(k+1)
Representing DPS into speech features
• Three problems– The selection of proper orders of the difference equations
– The determination of weights bl’s
– How DPS should be converted into a few parameters
• An optimal solution to any of the three listed problems is difficult to achieve
• For the first two problems, they proposed three special forms– DPS1: D(k) = Y(k) – Y(k+1)– DPS2: D(k) = Y(k) – Y(k+2)– DPS3: D(k) = Y(k-2) + Y(k-1) – Y(k+1) - Y(k+2)
• The third problem is converting DPS into cepstral coefficients– An absolute operation to make negative parts positive– The magnitude of DPS is passed through a mel-frequency filter bank– Logarithmic filter bank outputs are compressed into a feature vector
Representing DPS into speech features (cont.)
Comparison with the cepstral liftering technique
• If xi is the i-th cepstral coefficient, then the corresponding liftered cepstral coefficient is given by
• Various types of lifters are proposed in the literatureiii xwy where Wi define the lifter
iwrLinerLifte i :
iiwifterStaticialL ̂/1:
)sin(2
1:D
iDwLifterSinusoidal i
)2
exp(:2
2
r
iiwlLifterExponentia s
i
Comparison with the cepstral liftering technique (cont.)
Type of lifterSNR in dB
∞ 30 25 20 15
No Lifter 93.0 70.6 55.9 37.2 24.0
Lin. Lifter 94.0 90.8 86.6 80.1 70.6
Stat. Lifter 93.9 86.7 78.3 68.3 55.3
Sin. Lifter 94.5 85.9 78.9 68.7 51.5
Exp. Lifter 94.3 90.1 85.1 78.9 68.1
Effect of cepstral liftering on the performance of a DTW-based speech recognizer
Comparison with the cepstral liftering technique (cont.)
• But liftering has no effect in the recognition process
)ˆ(ˆ)ˆ()ˆ,ˆ;( 1 xxxxxxd xt
y Mahalanobis distance - HMM
)ˆ(ˆ)ˆ()ˆ,ˆ;( 1 yyyyyyd yt
y Mahalanobis distance liftered cepstral cofficients are used
Weighted Matrix
t
xy WWxWyWxy ˆˆ,ˆˆ,
)ˆ,ˆ;()ˆ,ˆ;( yx yydxxd
Comparison with the cepstral liftering technique (cont.)
• In DPS based cepstrum
Comparison with the spectral subtraction
• SS can be formulated as
• For speech recognition, it was found that SS operated in each band-pass filter could yield more consistent improvement for MFCC features against noise
β :spectral flooringα:controls the amount of noise subtracted from the noisy signal
EY(k) is the output of the kth band-pass filter when Y(k) is passed though the filter
attackdecay
Experiments
• In this paper they conduct a number of speech recognition experiments– Isolated speech recognition– SNR improvement– Connected digits recognition– Phone recognition– Evaluation on AURORA task
Experiments - Isolated speech recognition
• TI46 database – an isolated spoken words database (TI)– 16 speakers (8 males / 8 females)– Vocabulary consists
• 10 isolated digits from ‘ZERO’ to ‘NINE’
• 26 isolated English alphabets from ‘A’ to ‘Z’
• 10 isolated words including “ENTER, ERASE, GO, HELP, NO, RUBOUT, REPEAT, STOP, START, YES”
– 26 utterances of each word from each speaker (10 training /16 testing)
• In this experiment, four sets of features are considered– MFCC– DPSCC1– DPSCC2– DPSCC3
Experiments - Isolated speech recognition (cont.)
• The DPS based features can at least yield comparable performance as the standard MFCCs
• For both MFCCs and DPSCCs, the inclusion of dynamic and acceleration features can greatly augment the performance
Experiments - SNR improvement
• Clean speech signals are taken from the TI46 database• Take Lynx noise from the NOISEX database• Power spectrum based
• DPS based
Experiments - SNR improvement (cont.)
Tge average SNRD is approximately 4 dB higher than SNRY
Experiments - Connected digits recognition
• TI connected digits database – contains digits string uttered by adult and child speakers– Vocabulary consists
• 11 words - 10 digits and an “oh”
– Each speaker uttered 77 sequences of these words
• Add some noise to the speech signal in the test set, and the training speech is kept clean– wide-band stationary speech noise, machine-gun noise, Lynx noise
• Four sets of feature vectors are investigated– MFCC– DPSCC– MFCC + CMN– DPSCC + CMN
Experiments - Connected digits recognition (cont.)
• Compared with MFCCs, it yields at least comparable performance in clean conditions
• In most strong noise conditions, DPSCC outperforms MFCC
• CMN is effective to augment the robustness of both
Experiments - Phone recognition
• TIMIT phoneme based continuous speech database– Contains a total of 6300 sentences– 10 sentences spoken by each of 630 speakers from 8 major dialect
regions of the US– Perform phonetic recognition on the database over the set of 39
classes that are commonly used for evaluation
• Add some noise to the speech signal in the test set, and the training speech is kept clean– wide-band stationary speech noise, machine-gun noise, Lynx noise
• Two feature sets are used– MFCC+CMN (39 coefficients)– DPSCC+CMN (39 coefficients)
Experiments - Phone recognition (cont.)
• The MFCC and the DPSCC features yield comparable result in clean and weak noise conditions.
• DPSCC features slightly outperform the MFCC features in strong noise conditions
Experiments - Evaluation on AURORA task
• Noise signals are recorder at different places– suburban train, babble, car, exhibition hall, restaurant, street, airport a
nd train station
• Two training modes are defined– Training on clean data only
• 8440 utterances (55 male / 55 female)
• Signals are filtered with the G.712 characteristic without noise added
– Training on clean as well as noisy data (multi-condition)• 8440 utterances and split into 20 subsets (with 422 utterances)
• Suburban train, babble, car, and exhibition hall noises are added to 20 subsets at 5 different SNRs (20, 15, 10, 5 dB and the clean condition)
• Three test sets are defined– Test Set A 、 Test Set B 、 Test Set C
Experiments - Evaluation on AURORA task (cont.)
• With the use of CMN, the average word error rate is reduced 8.8%
• SS used together with the CMN, it increases the average performance by 19.3%
• The DPS based cepstrum outperforms MFCC. It also yields a slightly better performance than SS
Discussion and conclusion
• DPS can also preserve spectral information to discriminate among different linguistic units (e.g. phonemes and words)
• DPS had a higher SNR than the power spectrum, specially for voiced frames– DPS based features should be more resilient to noise than the power
spectrum based feature
• The DPSCC can yield at least comparable performance when compared to the conventional MFCCs.– In most cases, it outperforms MFCC
• Compared to the estimation of MFCC, the extraction of DPSCC requires (K/2-1) more addition (subtraction) and absolute operations for each frame signal– This increase in computational complexity is negligible for today’s
computer
top related