1 speech recognition and the em algorithm karthik visweswariah ibm research india
TRANSCRIPT
1
Speech recognition and the EM algorithm
Karthik Visweswariah
IBM Research India
2
Speech recognition: The problem
Input: Audio data with a speaker saying a sentence in English
Output: string of words corresponding to the words spoken
Data resourcesLarge corpus (thousands of hours) of audio recordings with associated text
3
Agenda (for next two lectures) Overview of statistical approach to speech recognition
Discuss sub-components indicating specific problems to be solved
Deeper dive into couple of areas with general applicability EM algorithm
Maximum likelihood estimation Gaussian mixture models EM algorithm itself Application to machine translation
Decision trees (?)
4
Evolution: 1960-present
Isolated digits: Filter bank analysis, time normalisation, dynamic programming
Isolated words, continuous digits: Pattern recognition, LPC analysis, clustering algorithms
Connected words: Statistical approaches, Hidden Markov Models
Continuous speech, Large Vocabulary Speaker independence Speaker adaptation Discriminative training Deep learning
5
Early attempts
Trajectory of formant frequencies (resonant frequencies of vocal tract)
Automatic speech recognition: Brief history of technology development, B. J. Huang et. al. 2006
6
Simpler problem Given weight of the person determine gender of the person
Clearly cannot be done deterministically
Model probabilistically Joint distribution: P(gender, weight) Bayes: P(gender | weight) = P(gender) P(weight |
gender)/P(weight) P(gender): Just count up gender of persons in database P(weight|gender)
Non-parametric: Histogram of weights for each gender Parametric: Assume Normal (Gaussian) distribution, estimate
mean/variance separately for each gender Choose gender with higher posterior probability given weight
7
How a speech recognizer works
Signal processing
Search: argmax P(w)P(x|w)
Acoustic model P(x|w)
Language model: P(w)
WordsAudio
Feature vectors: x
8
How speech recognizer works
Signal processing
Search: argmax P(w)P(x|w)
Acoustic model P(x|w)
Language model: P(w)
WordsAudio
Feature vectors: x
9
Feature extraction: How do we represent the dataUsually the most important step in data science or data analytics
Also a function of amount of data
Converts data into a vector of real numbers Represent documents for email classification into spam/non-spam
Counts of various characters (?) Counts of various words (?) All words equally important? Special characters used, colors used (?)
Predict attrition of employee: Performance, salary, … Should we capture salary change as percentage change
rather than absolute numbers Should we look at performance of the manager Salaries of team members?
Interacts with algorithm that is to be used downstream Is algorithm invariant to scale? Can the algorithm handle correlations in features What other assumptions?
Domain/background knowledge comes in here
10
Signal processing Input: Raw sampled audio, 11 KHz or 22 KHz on desktop,
8KHz for telephony
Output: 40 dimensional features, 60-100 vectors per second
Ideally: Different sounds represented differently Unnecessary variations removed
Noise Speaker Channel
Match modeling assumptions
11
Signal processing (contd.) Windowed FFT: sounds easier to distinguish in frequency space
Mel binning: Measure sensitivity to frequencies by listening experiments Sensitivity to a fixed difference in tone decreases with tone frequency
Log scale: Humans perceive volume on roughly a log scale
Decorrelate data (use DCT) <- Called MFCC upto this
Subtract mean: scale invariance, channel invariance
12
Signal processing (contd.) Model dynamics: Concatenate previous and next few
feature vectors
Project down to throw away noise/reduce computation (Linear/Fisher Discriminant Analysis)
Linear transform learned to match diagonal Gaussian modeling assumption
13
How speech recognizer works
Signal processing
Search: argmax P(w)P(x|w)
Acoustic model P(x|w)
Language model: P(w)
WordsAudio
Feature vectors: x
14
Language modeling
15
Language modeling
16
How speech recognizer works
Signal processing
Search: argmax P(w)P(x|w)
Acoustic model P(x|w)
Language model: P(w)
WordsAudio
Feature vectors: x
17
Acoustic modeling Need to model acoustic sequences given words P(x|w)
Obviously cannot create a model for every word
Need to break words into the fundamental sounds Cat K AE T - represent the pronunciation using phonemes At IBM we used 40-50 phonemes for English
Dictionaries Hand created lists of words with their alternate pronunciations
Handling new words Automatic generation of pronunciations from spellings
Clearly a tricky task for English e.g foreign names
18
Acoustic modeling (contd)
19
Acoustic modeling (contd.)
20
Acoustic modeling (contd.) Pronunciations change in continuous speech depending on
neighboring words “Give me” might sound more like “gimme”
Emission probabilities should depend on context Use a different distribution for each different context?
Even with 40 phonemes looking two phones to either side gives us 2.5 million possibilites => Way too many
Learn which contexts the acoustics is different Tie together contexts using a decision tree At each node allowed to ask questions about two (typically) phones to the left
and right Eg. Is the first phoneme to the right a glottal stop
Use entropy gain to grow a tree End up with 2000 to 10000 context dependent states from 120
context independent states
21
Acoustic modeling
22
How a speech recognizer works
Signal processing
Search: argmax P(w)P(x|w)
Acoustic model P(x|w)
Language model: P(w)
WordsAudio
Feature vectors: x
23
Search
24
Search Current approach is to precompile Language model,
dictionary, phone HMMs and decision tree into a complete graph Use Weighted Finite State Machine technology heavily
Complications Space of words is large (five gram language model) Context dependent acoustic models look across word boundaries Need to prune to keep perform search at reasonable speeds
Throw away states that are far enough below the best state
25
Speaker/condition dependent systems Humans can certainly do better with a little data: “adapt”
to an unfamilar accent or noise
With minutes of data we can certainly do better Could change our Acoustic models (Gaussian Mixture models)
based on the new data Can change the signal processing
Techniques described work even without supervision Do a speaker independent decode, and pretend that the obtained
word sequence is the truth
26
Adaptation
Signal processing
Search: argmax P(w)P(x|w)
Acoustic model P(x|w)
Language model: P(w)
WordsAudio
Feature vectors: x
27
Vocal tract length normalization Different speakers have different vocal tract lengths
Frequency stretched or squished
At test time estimate this frequency stretching/squishing and undo Just a single parameter, quantized to 10 different values Try each value and pick one that gives best likelihood
To get full benefit need to retrain in this canonical feature space Gaussian Mixture Models and decision trees benefit from being
trained in this “cleaned up” feature space
28
Adaptation of models
29
Adaptation of features
30
Improvements obtained Conversational telephony data, test set from a call center
Training data 2000 hours of Fisher data (0.7 billion frames of acoustic data)
Language model built with hundreds of millions of words from various sources
Including data from the domain of interest (call center conversations IT help desk)
Roughly 30 million parameters in the acoustic model System performance measured by word error rates
Speaker independent system: 34.0% Vocal Tract Length Normalized system: 29.0% Linear transform adaptation: 27.5% Discriminative feature space: 23.8% Discriminative training of model: 22.6%
Its hard work to improve on the best systems, no silver bullet!
31
Current state of the art Progress tracked on Switchboard (conversational
telephony test set)
System Word error rate
1995 “high performance HMM recognizer”
45%
Cambridge Univ. (2000) 19.3%
2004 IBM system 15.2%
2015 IBM system (with Deep Learning)
8%
Estimate of human performance
4%
Replace GMMs for acoustic modeling with deep networks
Source: http://arxiv.org/pdf/1505.05899v1.pdf
32
Conclusions
Gave a brief overview of various components in practical state of the art speech recognition systems
Speech recognition technology has relied on generative statistical models with parameters learned from data
Moved away from hand coded knowledge Discriminative estimation techniques are more expensive but give significant
improvements
Deep learning has shown significant gains for speech recognition
Speech recognition systems are good enough to support several useful applications
But they are still sensitive to variations that humans can handle with ease