speech recognition - nyu computer scienceeugenew/asr13/lecture_12.pdf · 2014-01-08 · eugene...

Speech RecognitionLecture 12: Acoustic Model Adaptation

Eugene WeinsteinGoogle, NYU Courant Institute

[email protected]

mailto:[email protected]

mailto:[email protected]

Eugene Weinstein - Speech Recognition Courant Institute, NYUpage

Administrivia

HW0, HW1 solutions posted. HW2 solution will be posted soon.

Dec 12th class: Neural Networks in ASR.

Project presentations: Dec 19th, 5-7pm, WWH 1314.

• Time slot: 15 minutes (one-person project), 30 minutes (two-person project)

• Please send slides no later than Dec 18th (PDF, Keynote, Google Docs)

2


Sources of Variability

Speaker variability: accent, gender, vocal tract characteristics.

Environmental variability: noise conditions, competing speakers.

Channel variability: recording microphone electronics, on-board signal processing, compression, lossy transmission channels.

Speaker-specific models can cut error rates by factor of 2-3 with same quantities of training data (Woodland, 2001).

3


Terminology

“Speaker”: often used as a proxy for all variability that can be found in an utterance.

Speaker-independent (SI) system: trained on data from a variety of speakers/acoustic environments.

Speaker-dependent (SD) system: trained on data from a single speaker/acoustic environment.

Speaker-adaptive (SA) system: SI system adapted by using a small amount of SD training data: more practical.

• Should converge to SD system with lots of data.4


Adaptation Types

Supervised/unsupervised: is transcript known before the adaptation algorithm is applied?

• If unsupervised, first recognition pass estimates transcript.

Batch or online/incremental: is the entire utterance available or is it streaming in?

• For offline tasks, speaker information can be grouped across multiple utterances.

5

(Woodland, 2001)


VTLN

Vocal tract length normalization attempts to compensate for the variability in speech accounted for by physiological variations.

Ideally, a precise estimate of the transfer function of the vocal tract of the speaker would be used to compensate for speaker variability.

This is a difficult problem, approximation: assume content in all frequencies is transformed linearly by a single warping factor.

6

(Lee and Rose, 1998)


VTLN Formulation

For (speaker, utterance, time frame) , let output of short-term Fourier transform be .

Learn a single transform factor , and use it to compute a warped version of the frequency content: .

We compute MFCCs in the usual way from the warped Fourier transform.

At training time, we find the optimal warping factor for speaker i: .

7

Si,j,t(!)

S↵i,j,t(!) = Si,j,t(↵!)

↵i = arg max

↵

X

j

p✓(oj|xj)

(i, j, t)

↵


VTLN Details

No known closed form solution for warping factor, suggested algorithm is to sweep a range values: e.g., 13 values in range [0.88, 1.12].

Training algorithm: alternate Baum-Welch selection of model parameters and selection of maximum-likelihood warping factors.

• “Normalized” model is trained from warped features.

• Idea: this model represents features adjusted for some average vocal tract length.

8


VTLN at Recognition Time

First pass: estimate rough transcript using normalized model with unwarped features.

Select maximum-likelihood warping factor for this utterance.

Apply warping to the features, and retranscribe with better quality.

9


MAP Adaptation

Bayesian approach:

• View model parameters as a random variable.

• View unadapted model as the prior probability distribution for the parameters.

• Find the setting of the parameters maximizing the posterior probability of model given data.

For uninformative prior, converges to regular maximum likelihood estimation.

10

(Gauvain and Lee, 1998)

✓MAP = argmax

✓g(✓|x) = argmax

✓f(x|✓)g(✓)


MAP Adaptation Details

Formulations are given for means, variances, and mixture weights.

Adaptation dataset feature vectors:

In practice, often just the mean is updated.

is an interpolation parameter that gives bias between the prior mean and the MLE estimate of the mean given the adaptation data.

11

µ

jMAP =

⌧µ

j0 +

PMi=1 p✓(z = j|xi)xi

⌧ +PM

i=1 p✓(z = j|xi)

{x1, . . . , xM}

⌧


MAP Adaptation Notes

is empirically determined (e.g., using a validation set). Typical settings: [0, 20].

The more adaptation data, the closer we get to the MLE of the model parameters (i.e., ignoring the prior setting).

Only parameters of models that get aligned to a sizable number of feature frames get adapted. Data sparsity is a problem.

In practice, used more for domain adaptation rather than speaker adaptation.

12

⌧


MLLR

Maximum likelihood linear regression: linear transformation of Gaussian mean vectors and/or covariances to produce speaker-adaptive model.

Constrained MLLR (CMLLR):

Equivalent to transforming the features (with a scaling factor of when calculating Gaussian likelihoods):

Transformation parameters estimated with EM to maximize adaptation data likelihood.

13

µMLLR = Aµ0 + b ⌃MLLR = H⌃0H>

H = A

|A|x

CMLLRi = A�1

xi +A�1b


Regression Classes

Learning a single global transformation matrix is unlikely to yield good accuracy, and learning one for each Gaussian yields data sparsity issues.

Idea: learn a set of regression classes, each with its own transformation matrix.

Set of classes learned by doing divisive clustering on set of Gaussian components, thresholding cluster size by cluster size in number of data points.

14


Speaker-Adaptive Training (SAT)

Baseline MLLR approach: first train speaker-independent models, then adapt model parameters with MLLR using adaptation data.

SAT idea: learn parameters of the models with MLLR transforms in place.

With CMLLR, this amounts to transforming the source data (easier to implement).

• Also allows for MLLR to be used in conjunction with discriminative training (MMI).

MLLR can be combined with VTLN.15


References• V. Digilakis, D. Ritchev, L. Neumeyer. Speaker Adaptation Using Constrained Estimation of

Gaussian Mixtures. IEEE Transactions on Speech and Audio Processing, 3:357-366, 1995.

• M. J. F. Gales. The Generation and Use of Regression Class Trees for MLLR Adaptation. Technical Report CUED/F-INFENG/TR263, Cambridge University Engineering Department, August 1996

• M. J. F. Gales. Maximum Likelihood Linear Transformations for HMM-based Speech Recognition. Computer Speech & Language, 12:75-98, 1998.

• J. L. Gauvain, C. H. Lee. Maximum A-Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing 2:291-298, 1994.

• C.H. Lee, C. H. Lin, B. H. Juang. A study on speaker adaptation of the parameters of continuous density hidden Markov models. IEEE Transactions on Signal Processing. 39:806–814.

• L. Lee, R. Rose. A frequency warping approach to speaker normalization. IEEE Transactions on Speech and Audio Processing, 6(1):49-60, January1998.

16


References• C. J. Leggetter, P. C. Woodland. Maximum Likelihood Linear Regression for Speaker

Adaptation of Continuous Density Hidden Markov Models. Computer Speech & Language, 9:171-185, 1995.

• D. Pye, P. C. Woodland. Experiments in Speaker Normalisation and Adaptation for Large Vocabulary Speech Recognition. In Proceedings, International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Munich, Germany, 1997.

• P. C. Woodland, Speaker Adaptation for Continuous Density HMMs: A Review. In Proceedings of the ISCA Tutorial and Research Workshop on Adaptation Methods for Speech Recognition. Sophia Antipolis, France, August 2001.

17

speech recognition - nyu computer scienceeugenew/asr13/lecture_12.pdf · 2014-01-08 · eugene...

Documents