voice activity detection using single frequency filtering

Voice Activity Detection (VAD)in presence of Noise

Tejus Adiga MDepartment of Electronics and Communications

NMAMIT, Nitte.

Presented By:

Voice Activity Detection (VAD)• Definition of VAD

• Task of locating speech segment boundaries in input signal corrupted by noise.

• Task of classifying the given frame as Speech and Noise frame.

• Problem Statement

• Given an input frame vector the VAD problem considers detecting the presence of speech in a 𝑥signal which is corrupted by different kinds of noise. Assuming that the speech signals and the

noise are additive, the VAD module has to decide in favor of the two hypotheses:

May 1, 2023 Department of Electronics and Communications, NMAMIT, Nitte.2

𝐻0 = ∶ 𝑥 𝑁𝐻1 = + ∶ 𝑥 𝑆 𝑁 (1)

May 1, 2023 3Department of Electronics and Communications, NMAMIT, Nitte.

Artifacts of VAD• Front End Clipping (FEC)

• Occurs at transition from Noise to Speech

• Mid Speech Clipping (MSC)

• Speech frame misclassified as Noise

• Over Clipping

• Occurs at transition from Speech to Noise

• Noise detected as Speech (NDS)

• High energy noise frames detected as speech

May 1, 2023 Department of Electronics and Communications, NMAMIT, Nitte. 4

Applications of VAD• Discontinuous Transmission in Speech communication Systems.

• Encode and transmit only speech frames.• Switch off transmitter during Non-Speech frames to minimize power consumption.• Example: GSM Audio Codec.

• Automatic Speech/Speaker Recognition.• Apply recognition algorithms only on speech segments.• Example: Apple’s Siri, Microsoft’s Cortana, Google Voice.

• Speech Encoding.• Encode speech frames at high bitrate and Non Speech frames at low bitrate.• Increases Compression Ratio.• Example: ITU G.729 Audio codec.

• Speech Enhancement and Noise Reduction systems.• Non-Stationary noise statistics are computed and used for future audio frames.

May 1, 2023

Literature Survey - Introduction

Department of Electronics and Communications, NMAMIT, Nitte. 5

• Speech signal is corrupted by environmental noise additively. Resulting

signal is given by

Where is the clean speech signal and is the additive noise.

• VAD algorithms try to estimate the statistical parameters of noise and

classify the given audio frame as speech or noise.


Time Domain AlgorithmsVAD using Short term Signal Energy

• Short Term Signal Energy is given by

where is the Energy of the frame, is the input audio signal. is the Window of length N samples.

• Training Phase: Computes the average Energy levels of Noise and store it as training data.

• Detection Phase: If energy level of given frame is greater than noise classify frame as Speech frame.

Else noise frame.

• If speech frame is classified as noise then use that frame to update the training data energy level.

(2)

Typical Window duration is 20ms. Within window of 20ms speech signal appears to be Stationary. For Audio signal sampled at 16KHz Window length is 320 samples. For 8KHz, Window length is 160 samples.


Time Domain AlgorithmsVAD using Zero Crossing Detector

• Number of zero crossings per audio frame in speech is lesser than noise.

• Typical Zero crossings in Speech frame of 10ms is 5 to 15.

Where is the input signal frame and is the window of length N.

• Frame is classified as speech if is greater than threshold.

(3)

(4)


Frequency Subband Distance measure (FSDM) method• Speech frames have significant Power Difference between low frequency and high frequency sub

bands.• Non Speech frames have Relatively Uniform power distribution.• FSDM metric is given by

where is the input audio frame, N is the length of the frame.

• FSDM feature can be improved by weighing it with the Power Envelope of

(5)

(6)


Frequency Subband Distance measure (FSDM) method

• Smoothened FSDM coefficients are further smoothened using Median Filter for better decision.

• Sort over Nf frames in ascending order.

• The adaptive threshold is set to if

where is constant, is the sorted index.

(7)


Long Term Spectral Flatness Measure (LSFM) method• The LSFM metric is computed over discrete Frequencies from 500 Hz to 4000 Hz as

Where is the Geometric mean and is the Arithmetic mean of the Power Spectral Density (PSD)

Where R is the number of frames used to compute LSFM metric.

(8)

(9)

(10)


Long Term Spectral Flatness Measure (LSFM) method

is the Power Spectral Density of input

given by

Where is the FFT of .

• Long Term Spectral Flatness Measure (LSFM) is

computed over R frames of input signal.

• LSFM over R frames appears to be relatively flat for

speech frames whereas non speech frames

averaged over R frames has significant peaks.

(11)

Fig 1: LSFM feature for different values of R and m


Single Frequency Filtering (SFF)

Training Phase:• Noise provides a floor for speech at discrete frequencies from = 300Hz to 3600Hz at interval of 20Hz.• Floor Weights

Where is the mean of noise frames.

• A quantity is computed as

Where M = 64, is variance of Noise and is mean.

(12)

(13)

• Then a threshold is found

• The Dynamic range of the signal for every frame of 300ms over 10ms shift is computed

• Depending on the values of the is smoothened over a smoothing length of to get .

• The VAD decision is done as



(14)

(15)

(16)

(17)


Single Frequency Filtering (SFF)• The input signal s is differentiated

• is multiplied by a complex sinusoid of normalized frequency .

Where and is the sampling frequency.

• The signal is passed through a single pole filter whose system function is given by

• Where This is to ensure that filter is stable.

• The output of the single pole filter is

(12)

(13)

(14)

(15)


Single Frequency Filtering (SFF)• From the envelope is found out at every frequency from k = 300Hz to 3600Hz.

where and are the real and imaginary components of .

• Envelope is multiplied with Noise Floor Weights to reduce the effect of noise.

• Depending on the values of the is smoothened over a smoothing length of to get .

• The VAD decision is done as

(16)

(16)

(17)



Fig 2: Visualization of SFF approach.(a) Clean speech. (b) Speech degraded by noise. (c) Envelope of degraded signal. (d) Weighted envelopes. (e) Envelope of clean speech signal.


Adaptive Noise Cancellation/SupressionSpectral Subtraction• Time domain speech degradation by noise given by

Where is the clean speech signal andis the noise frame.

• In Frequency domain this is equivalent to adding the spectrum of speech and noise.

Where X(f) is the power spectrum of input signal and N(f) is the power spectrum of Noise signal.

• Power spectrum estimate of the original signal is obtained by subtracting estimated from.

Where is the strength factor for Noise subtraction.

• can be estimated with a-priori knowledge of the surrounding noise.

(18)

(19)

(20)


Simulation of Single Frequency Filtering

• Generation of Test Vector for out of Band Noise (Speech + Noise)• Generate White Noise in any Audio Editing Tool (Ex: Audacity).• Apply High Pass Filter to White Noise with pass band greater than 4KHz.• Mix Clean speech and Out of band Noise. • Noise and Speech Sampled at 16kHz, 1 Channel, 32 bits floating point samples.

Fig 3: Spectrum of Generated White Noise from Audacity and Out of band Noise.


Simulation of Single Frequency Filtering• Simulation Environment:

• System Configuration : MacBook Pro Mid 2015, Intel Core i7 2.2 GHz processor, 16 GB Ram running Mac OSX 10.11 EI Caption.

• Language : C++ with C++14 support and Standard Template Library (STL) with C++14 enhancements.

• Compiler and IDE : LLVM (Clang) Compiler on XCode 7.3.

• Audio Editor Tool : Audacity Audio Editor and Wave Pad Editor.

• Windowing• Hanning window of duration 10ms (320 samples for 16KHz).

• Training Phase:• Noise frames of duration 30 seconds are used to generate floor weights and initial threshold.

• Detection Phase:• Input frames of duration 20ms (320 samples) is subjected to SFF VAD.

• If frame is declared as noise then that frame is used for updating of Noise floor weights and threshold.


Simulation Results: 1. Clean Speech Input(a) Clean Speech

(e) Clipped signal guided by VAD output

(d) Output of SFF VAD (VAD Decision)Value > 0 => Speech FrameValue < 0 => Noise Frame


Simulation Results: 2. Out of Band Noise(a) Clean Speech (b) Out of band Noise

(c) Speech + Noise

(e) Clipped signal guided by VAD output(d) Output of SFF VAD


Simulation Results: 3. Pink Noise Input

(a) Clean Speech (b) Pink Noise

(c) Speech + Noise

(e) Clipped signal guided by VAD output(d) Output of SFF VAD


Simulation Results: 3. Pink Noise Input with Noise Cancellation

(e) Clipped signal guided by VAD output without NC

(a) Speech + Noise

(f) Clipped signal guided by VAD output with NC

(c) Output of SFF VAD without NC

(b) Output of Noise Canceller

(d) Output of SFF VAD with NC


Simulation Results: 5. Brownian Noise Input(a) Clean Speech

(b) Brownian Noise

(c) Speech + Noise


Simulation Results: 5. Brownian Noise Input with Noise Cancellation


(a) Speech + Noise






Simulation Results: 6. White Noise Input(a) Clean Speech

(b) White Noise

(c) Speech + Noise


Simulation Results: 6. White Noise Input with Noise Cancellation


(a) Speech + Noise






Conclusion• Observations with clean speech input

• High Degree of accuracy.• A Non-Stationary Noise transient can deviate the VAD decision.• FEC and Over Clipping observed for a unvoiced frames energy all the bands of unvoiced frame is low.

• Observations with speech with out of band noise.• For out of band noise input the results are close to clean speech input.• VAD Decision independent of Noise Power.

• Observations with speech mixed with noise• Speech mixed with White, Pink and Brownian noise performance are lower than the clean speech input as

significant noise power present in the speech frequency band.• Higher the SNR, more accurate is the VAD decision.

• Observations with speech mixed with noise with Noise Suppression• Significant improvement in VAD output is seen for input corrupted with Pink, Brownian and White noise.• Improvements in the areas where noise detected as speech artifact is reduced.


Future Work• As observed with SFF sometimes the noise frames are detected as

speech frames. Hence detailed study to eliminate such artifacts. • When compared to simple time domain methods SFF method reduce

the artifacts like Front End Clipping, Mid Speech Clipping, Noise detected as Speech. However, there is scope for improvement in further reducing these artifacts.• At low SNR environment if speech is corrupted by white noise, the

Autocorrelation of white noise can be utilized for improving the VAD decision.

[1] Jongseo Sohn, Nam Soo Kim and Wonyong Sung, “A Statistical Model-Based Voice Activity Detection”, IEEE Signal Processing Letters, VOL. 6, NO. 1, pp 1-4, January 1999.

[2] G. Aneeja, and B. Yegnanarayana, “Single Frequency Filtering Approach for Discriminating Speech and Nonspeech” IEEE/ACM Transactions On Audio, Speech, And Language Processing, VOL. 23, NO. 4, pp 705-717, April 2015.

[3] Oren Rosen, Saman Mousazadeh and Israel Cohen, “Voice Activity Detection In Presence of Transient Noise Using Spectral Clustering And Diffusion Kernels”, Proceedings of IEEE Electrical and Electronics Engineers in Israel (IEEEI), 2014.

[4] Srikanth Nagisetty, Zongxian Liu, Takuya Kawashima, Hiroyuki Ehara, Xuan Zhou, Bin Wang, Zexin Liu, Lei Miao, Jon Gibbs, Lasse Laaksonen, Venkatraman Atti, Vivek Rajendran, Venkatesh Krishnan, Hosang Sung and Kihyun Choo, “Low Bit Rate High-Quality MDCT Audio Coding Of The 3gpp Evs Standard”, Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp 5883-5887, 2015.

[5] Sreekumar K.T, Kuruvachan K. George, Arunraj K and C. Santhosh Kumar, “Spectral Matching Based Voice Activity Detector for Improved Speaker Recognition”, Proceedings of International Conference on Power, Signals, Controls and Computation, January 2014.

[6] Wei Shi, Yuexian Zou and Yi Liu, “Long-Term Auto-Correlation Statistics based Voice Activity Detection for Strong Noisy Speech”, Proceedings of IEEE China Summit and International Conference on Signal and Information Processing, pp 100-104, 2014.

[7] Chong Feng and Chunhui Zhao, “Voice Activity Detection Based On Ensemble Empirical Mode Decomposition And Teager Kurtosis”, Proceedings of International Conference on Signal Processing, pp 455-460, 2014.


References

[8] M. H. Moattar and M. M. Homayounpour, “A Simple but Efficient Real-Time Voice Activity Detection Algorithm”, Proceedings of European Signal Processing Conference, pp 2549-2553, 2009.

[9] Hongzhi Wang, Yuchao Xu and Meijing Li, “Study on the MFCC Similarity-based Voice Activity Detection Algorithm”, Proceedings of Artificial Intelligence, Management Science and Electronic Commerse, pp 4391 – 4394, Aug. 2011.

[10] Tuan V. Pham and Gernot Kubin, “Comparison between DFT- and DWT-Based Speech/Non-speech Detection for Adverse Environments”, Proceedings of International Conference on Advanced Technologies for Communications (ATC), pp 299-302, 2011.

[11] Yanna Ma and Akinori Nishihara, “Efficient voice activity detection algorithm using long-term spectral flatness measure”, European Association for Signal processing Journal on Audio, Speech, and Music Processing, NO. 1, pp 1-18, 2013.

[12] Michael Grimm and Kristian Kroschel, “Robust Speech Recognition and Understanding”, I-Tech Education and Publishing, June 2007.

[13] Lawrence Rabiner and Ronald W. Schafer, “Digital Processing of Speech Signals”, Pearson, Fourth Edition, January 2007.


References

voice activity detection using single frequency filtering

Engineering