combining auditory preprocessing and bayesian estimation ...swoh.web.engr.illinois.edu › ... ›...

Kanru Hua (IE598 Final Presentation) 1

Combining Auditory Preprocessing and Bayesian Estimation for Robust Formant Tracking

(Gläser et al., 2010)


Background

● In speech processing context, formants are resonances of the vocal tract.

● Formant frequencies have a close link to vowel quality.

● Applications: speech recognition/synthesis, speech enhancement, hearing aids, language learning tools, ...

time

freq

uenc

y

“Author of the danger trail, ...”


Architecture (simplified)

Auditory Filterbank Gender Detection

Enhancement

Bayesian MixtureFiltering

Bayesian Smoothing Bayesian Smoothing Bayesian Smoothing...

Speech Signal

Adaptive FrequencyRange Segmentation

F1 F2 FN


Bayesian Filtering

● Think of a generalized version of Kalman Filter

● Define belief/message as posterior probability

y1 y2 y3 y4

x1 x2 x3 x4

Observation(filterbank output)

Hidden State(formant freqs.)

(predict)

(update)


Bayesian Filtering

● Formants are not normally distributed - Kalman filter won’t work

● Particle filtering (non-parametric) – multi-modality not guaranteed

– To illustrate why, let’s suppose

● This leads us to mixture filtering, a techinque borrowed from the computer vision community.


Bayesian Mixture Filtering

To find the weights:

(Vermaak et al., 2003)

(each corresponds to a formant)


Bayesian Mixture FilteringA quick summary:

Propagation on each component (formant) is independent from the others

Re-weighting step is the only place where mixture components interact

Target belief and component beliefs(at time t)


Mixture Segmentation

● However, mixture filtering still does not prevent belief diffusion.

(mixture components could become over-general over time as they independently propagates; order of formants is unconstrained)

● Solution: introducing hard frequency boundaries R1, R2, … RM between formants/mixture components


Mixture Segmentation

● We need to modify & re-weight component beliefs to implement these hard boundaries

● Concretely, set out-of-range probabilities to zero while making sure the target distribution is kept unchanged

(accumulation)

(truncation & re-weighting)


Adaptive Segmentation

● The next step is to determine R1, R2, … RM, given component beliefs before segmentation

● We run a Viterbi search (in other words, dynamic programming) to find out the most likely segmentation

– State space: assignment from (discretized) frequency to mixture component

Other transitions have zero probability (disabled)


Bayesian Smoothing

● So far we are making predictions based on previous observations (y1, y2, …, yt) only. The re-weighted beliefs may still appear ambiguous.

● We mitigate this by incorporating observations from the reverse direction (if the whole sequence is known in advance), in a fashion similar to the backward pass in Kalman filtering.


Results

● Final formant frequency estimate:

● Evaluation:

– Tested on 34 and 56 sentences spoken by male and female speakers, respectively

– Added white/babble/car noise at 7 different signal-to-noise ratios


Results

● Compared against other formant tracking approaches

(percentage of error reduction)

● Time delay for real-time tracking: 120ms on Intel Q6600 @ 2.4GHz


Summary

● A two-stage formant tracking method– First stage: signal processing for feature extraction– Second stage: Bayesian filtering on features

● Challenge 1 – maintaining multi-modality– Solution: mixture tracking

● Challenge 2 - belief diffusion– Solution: adaptive frequency range segmentation

● Post processing: Bayesian smoothing (backward pass)● Drawbacks:

– Computationally expensive– Inevitable time delay for real-time tracking– Formant continuity not guaranteed


References

● Gläser, Claudius, et al. "Combining Auditory Preprocessing and Bayesian Estimation for Robust Formant Tracking." IEEE Transactions on Audio, Speech & Language Processing 18.2 (2010): 224-236.

● Vermaak, Jaco, Arnaud Doucet, and Patrick Pérez. "Maintaining multimodality through mixture tracking." Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003.

combining auditory preprocessing and bayesian estimation ...swoh.web.engr.illinois.edu › ... ›...

Documents