zhiyao duan , gautham j. mysore , paris smaragdis 1. eecs department, northwestern university...
DESCRIPTION
Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments. 1. 2. 2,3. Zhiyao Duan , Gautham J. Mysore , Paris Smaragdis 1. EECS Department, Northwestern University 2. Advanced Technology Labs, Adobe Systems Inc. - PowerPoint PPT PresentationTRANSCRIPT
Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments
Zhiyao Duan , Gautham J. Mysore , Paris Smaragdis1. EECS Department, Northwestern University2. Advanced Technology Labs, Adobe Systems Inc.3. University of Illinois at Urbana-Champaign
Presentation at Interspeech on September 11, 2012122,3
Speech Enhancement by Online Non-negative Spectrogram Decomposition inNon-stationary Noise EnvironmentsClassical Speech EnhancementTypical algorithmsSpectral subtractionWiener filteringStatistical-model-based (e.g. MMSE)Subspace algorithmsPropertiesDo not require clean speech for training (Only pre-learn the noise model)
Online algorithm, good for real-time apps
Cannot deal with non-stationary noiseMost of them model noise with a single spectrum
Keyboardnoise
Bird noise2Non-negative Spectrogram Decomposition (NSD)Uses a dictionary of basis spectra to model a non-stationary sound source
DictionaryActivation weightsSpectrogram of keyboard noiseDecomposition criterion: minimize the approximation error (e.g. KL divergence)3NSD for Source Separation
Noise dict.
Speech dict.Noise weightsSpeech weightsKeyboard noise + Speech
Speech dict.Speech weightsSeparated speech4Semi-supervised NSD for Speech EnhancementPropertiesCapable to deal with non-stationary noise
Does not require clean speech for training (Only pre-learns the noise model)
Offline algorithmLearning the speech dict. requires access to the whole noisy speech
Noisy speechActivation weightsNoise dict. (trained)
Speech dict.Separation
Noise dict.Noise-only excerptActivation weightsTraining
5Objective: decompose the current mixture frameConstraint on speech dict.: prevent it overfitting the mixture frame
Proposed Online Algorithm
Noise weights(weights of previous frames were already calculated)Speech weights
Weights of current frame6
Speech dict. Noise dict. (trained)Weighted buffer frames(constraint)
Current frame(objective)EM Algorithm for Each Frame7
Frame tFrame t+1
?E step: calculate posterior probabilities for latent componentsM step: a) calculate speech dictionaryb) calculate current activation weights
Update Speech Dict. through PriorEach basis spectrum is a discrete/categorical distributionIts conjugate prior is a Dirichlet distributionThe old dict. is a exemplar/guide for the new dict.
Prior strengthM step to calculate the speech basis spectrum:
Calculation from decomposing spectrogram
(likelihood part) (prior part)88Prior Strength Affects Enhancement10020#iterationsPrior determinesLikelihood determinesLess noise &More distorted speechBetter noise reduction &Stronger speech distortionMore restricted speech dict.9ExperimentsNon-stationary noise corpus: 10 kindsBirds, casino, cicadas, computer keyboard, eating chips, frogs, jungle, machine guns, motorcycles and oceanSpeech corpus: the NOIZEUS dataset [1]6 speakers (3 male and 3 female), each 15 secondsNoisy speech5 SNRs (-10, -5, 0, 5, 10 dB)All combinations of noise, speaker and SNR generate 300 filesAbout 300 * 15 seconds = 1.25 hours
[1] Loizou, P. (2007), Speech Enhancement: Theory and Practice, CRC Press, Boca Raton: FL.
10Comparisons with Classical Algorithms
KLT: subspace algorithmlogMMSE: statistical-model-basedMB: spectral subtractionWiener-as: Wiener filteringbetter
PESQ: an objective speech quality metric, correlates well with human perceptionSDR: a source separation metric, measures the fidelity of enhanced speech to uncorrupted speech11
betterbetter12ExamplesSpectral subtractionWiener filteringStatistical-model-basedSubspace algorithmProposedPESQ1.411.031.130.932.14SDR (dB)1.820.270.700.189.62
Keyboard noise: SNR=0dB
Larger value indicates better performance13Noise Reduction vs. Speech DistortionBSS_EVAL: broadly used source separation metricsSignal-to-Distortion Ratio (SDR): measures both noise reduction and speech distortionSignal-to-Interference Ratio (SIR): measures noise reductionSignal-to-Artifacts Ratio (SAR): measures speech distortion
better14ExamplesBird noise: SNR=10dB
SDR: measures both noise reduction and speech distortionSIR: measures noise reductionSAR: measures speech distortion
Larger value indicates better performance15ConclusionsA novel algorithm for speech enhancementOnline algorithm, good for real-time applicationsDoes not require clean speech for training (Only pre-learns the noise model)Deals with non-stationary noise
Updates speech dictionary through Dirichlet priorPrior strength controls the tradeoff between noise reduction and speech distortionClassical algorithmsSemi-supervised non-negative spectrogram decomposition algorithm16
Complexity and Latency18Parameters19Buffer FramesThey are used to constrain the speech dictionaryNot too many or too oldWe use 60 most recent frames (about 1 second long)They should contain speech signals
How to judge if a mixture frame contains speech or not (Voice Activity Detection)?20Skip this slide if not enough time20Voice Activity Detection (VAD)Decompose the mixture frame only using the noise dictionaryIf reconstruction error is large Probably contains speechThis frame goes to the bufferSemi-supervised separation (the proposed algorithm)If reconstruction error is smallProbably no speechThis frame does not go to the bufferSupervised separation21
Noise dict. (trained)
Speech dict. (up-to-date) Noise dict. (trained)
Skip this slide if not enough time21