zhiyao duan , gautham j. mysore , paris smaragdis 1. eecs department, northwestern university...

Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments

Zhiyao Duan , Gautham J. Mysore , Paris Smaragdis1. EECS Department, Northwestern University2. Advanced Technology Labs, Adobe Systems Inc.3. University of Illinois at Urbana-Champaign

Presentation at Interspeech on September 11, 2012122,3

Speech Enhancement by Online Non-negative Spectrogram Decomposition inNon-stationary Noise EnvironmentsClassical Speech EnhancementTypical algorithmsSpectral subtractionWiener filteringStatistical-model-based (e.g. MMSE)Subspace algorithmsPropertiesDo not require clean speech for training (Only pre-learn the noise model)

Online algorithm, good for real-time apps

Cannot deal with non-stationary noiseMost of them model noise with a single spectrum

Keyboardnoise

Bird noise2Non-negative Spectrogram Decomposition (NSD)Uses a dictionary of basis spectra to model a non-stationary sound source

DictionaryActivation weightsSpectrogram of keyboard noiseDecomposition criterion: minimize the approximation error (e.g. KL divergence)3NSD for Source Separation

Noise dict.

Speech dict.Noise weightsSpeech weightsKeyboard noise + Speech

Speech dict.Speech weightsSeparated speech4Semi-supervised NSD for Speech EnhancementPropertiesCapable to deal with non-stationary noise

Does not require clean speech for training (Only pre-learns the noise model)

Offline algorithmLearning the speech dict. requires access to the whole noisy speech

Noisy speechActivation weightsNoise dict. (trained)

Speech dict.Separation

Noise dict.Noise-only excerptActivation weightsTraining

5Objective: decompose the current mixture frameConstraint on speech dict.: prevent it overfitting the mixture frame

Proposed Online Algorithm

Noise weights(weights of previous frames were already calculated)Speech weights

Weights of current frame6

Speech dict. Noise dict. (trained)Weighted buffer frames(constraint)

Current frame(objective)EM Algorithm for Each Frame7

Frame tFrame t+1

?E step: calculate posterior probabilities for latent componentsM step: a) calculate speech dictionaryb) calculate current activation weights

Update Speech Dict. through PriorEach basis spectrum is a discrete/categorical distributionIts conjugate prior is a Dirichlet distributionThe old dict. is a exemplar/guide for the new dict.

Prior strengthM step to calculate the speech basis spectrum:

Calculation from decomposing spectrogram

(likelihood part) (prior part)88Prior Strength Affects Enhancement10020#iterationsPrior determinesLikelihood determinesLess noise &More distorted speechBetter noise reduction &Stronger speech distortionMore restricted speech dict.9ExperimentsNon-stationary noise corpus: 10 kindsBirds, casino, cicadas, computer keyboard, eating chips, frogs, jungle, machine guns, motorcycles and oceanSpeech corpus: the NOIZEUS dataset [1]6 speakers (3 male and 3 female), each 15 secondsNoisy speech5 SNRs (-10, -5, 0, 5, 10 dB)All combinations of noise, speaker and SNR generate 300 filesAbout 300 * 15 seconds = 1.25 hours

[1] Loizou, P. (2007), Speech Enhancement: Theory and Practice, CRC Press, Boca Raton: FL.

10Comparisons with Classical Algorithms

KLT: subspace algorithmlogMMSE: statistical-model-basedMB: spectral subtractionWiener-as: Wiener filteringbetter

PESQ: an objective speech quality metric, correlates well with human perceptionSDR: a source separation metric, measures the fidelity of enhanced speech to uncorrupted speech11

betterbetter12ExamplesSpectral subtractionWiener filteringStatistical-model-basedSubspace algorithmProposedPESQ1.411.031.130.932.14SDR (dB)1.820.270.700.189.62

Keyboard noise: SNR=0dB

Larger value indicates better performance13Noise Reduction vs. Speech DistortionBSS_EVAL: broadly used source separation metricsSignal-to-Distortion Ratio (SDR): measures both noise reduction and speech distortionSignal-to-Interference Ratio (SIR): measures noise reductionSignal-to-Artifacts Ratio (SAR): measures speech distortion

better14ExamplesBird noise: SNR=10dB

SDR: measures both noise reduction and speech distortionSIR: measures noise reductionSAR: measures speech distortion

Larger value indicates better performance15ConclusionsA novel algorithm for speech enhancementOnline algorithm, good for real-time applicationsDoes not require clean speech for training (Only pre-learns the noise model)Deals with non-stationary noise

Updates speech dictionary through Dirichlet priorPrior strength controls the tradeoff between noise reduction and speech distortionClassical algorithmsSemi-supervised non-negative spectrogram decomposition algorithm16

Complexity and Latency18Parameters19Buffer FramesThey are used to constrain the speech dictionaryNot too many or too oldWe use 60 most recent frames (about 1 second long)They should contain speech signals

How to judge if a mixture frame contains speech or not (Voice Activity Detection)?20Skip this slide if not enough time20Voice Activity Detection (VAD)Decompose the mixture frame only using the noise dictionaryIf reconstruction error is large Probably contains speechThis frame goes to the bufferSemi-supervised separation (the proposed algorithm)If reconstruction error is smallProbably no speechThis frame does not go to the bufferSupervised separation21

Noise dict. (trained)

Speech dict. (up-to-date) Noise dict. (trained)

Skip this slide if not enough time21

zhiyao duan , gautham j. mysore , paris smaragdis 1. eecs department, northwestern university...

Documents

clean speech

model noise

speech dictionaryb

trained speech

speech basis spectrum

noise model online algorithm

source separation noise

dball combinations of