1 speech recognition and the em algorithm karthik visweswariah ibm research india

1

Speech recognition and the EM algorithm

Karthik Visweswariah

IBM Research India

2

Speech recognition: The problem

Input: Audio data with a speaker saying a sentence in English

Output: string of words corresponding to the words spoken

Data resourcesLarge corpus (thousands of hours) of audio recordings with associated text

3

Agenda (for next two lectures) Overview of statistical approach to speech recognition

Discuss sub-components indicating specific problems to be solved

Deeper dive into couple of areas with general applicability EM algorithm

Maximum likelihood estimation Gaussian mixture models EM algorithm itself Application to machine translation

Decision trees (?)

4

Evolution: 1960-present

Isolated digits: Filter bank analysis, time normalisation, dynamic programming

Isolated words, continuous digits: Pattern recognition, LPC analysis, clustering algorithms

Connected words: Statistical approaches, Hidden Markov Models

Continuous speech, Large Vocabulary Speaker independence Speaker adaptation Discriminative training Deep learning

5

Early attempts

Trajectory of formant frequencies (resonant frequencies of vocal tract)

Automatic speech recognition: Brief history of technology development, B. J. Huang et. al. 2006

6

Simpler problem Given weight of the person determine gender of the person

Clearly cannot be done deterministically

Model probabilistically Joint distribution: P(gender, weight) Bayes: P(gender | weight) = P(gender) P(weight |

gender)/P(weight) P(gender): Just count up gender of persons in database P(weight|gender)

Non-parametric: Histogram of weights for each gender Parametric: Assume Normal (Gaussian) distribution, estimate

mean/variance separately for each gender Choose gender with higher posterior probability given weight

7

How a speech recognizer works

Signal processing

Search: argmax P(w)P(x|w)

Acoustic model P(x|w)

Language model: P(w)

WordsAudio

Feature vectors: x

8

How speech recognizer works

Signal processing




WordsAudio

Feature vectors: x

9

Feature extraction: How do we represent the dataUsually the most important step in data science or data analytics

Also a function of amount of data

Converts data into a vector of real numbers Represent documents for email classification into spam/non-spam

Counts of various characters (?) Counts of various words (?) All words equally important? Special characters used, colors used (?)

Predict attrition of employee: Performance, salary, … Should we capture salary change as percentage change

rather than absolute numbers Should we look at performance of the manager Salaries of team members?

Interacts with algorithm that is to be used downstream Is algorithm invariant to scale? Can the algorithm handle correlations in features What other assumptions?

Domain/background knowledge comes in here

10

Signal processing Input: Raw sampled audio, 11 KHz or 22 KHz on desktop,

8KHz for telephony

Output: 40 dimensional features, 60-100 vectors per second

Ideally: Different sounds represented differently Unnecessary variations removed

Noise Speaker Channel

Match modeling assumptions

11

Signal processing (contd.) Windowed FFT: sounds easier to distinguish in frequency space

Mel binning: Measure sensitivity to frequencies by listening experiments Sensitivity to a fixed difference in tone decreases with tone frequency

Log scale: Humans perceive volume on roughly a log scale

Decorrelate data (use DCT) <- Called MFCC upto this

Subtract mean: scale invariance, channel invariance

12

Signal processing (contd.) Model dynamics: Concatenate previous and next few

feature vectors

Project down to throw away noise/reduce computation (Linear/Fisher Discriminant Analysis)

Linear transform learned to match diagonal Gaussian modeling assumption

13


Signal processing




WordsAudio

Feature vectors: x

14

Language modeling

15

Language modeling

16


Signal processing




WordsAudio

Feature vectors: x

17

Acoustic modeling Need to model acoustic sequences given words P(x|w)

Obviously cannot create a model for every word

Need to break words into the fundamental sounds Cat K AE T - represent the pronunciation using phonemes At IBM we used 40-50 phonemes for English

Dictionaries Hand created lists of words with their alternate pronunciations

Handling new words Automatic generation of pronunciations from spellings

Clearly a tricky task for English e.g foreign names

18

Acoustic modeling (contd)

19

Acoustic modeling (contd.)

20

Acoustic modeling (contd.) Pronunciations change in continuous speech depending on

neighboring words “Give me” might sound more like “gimme”

Emission probabilities should depend on context Use a different distribution for each different context?

Even with 40 phonemes looking two phones to either side gives us 2.5 million possibilites => Way too many

Learn which contexts the acoustics is different Tie together contexts using a decision tree At each node allowed to ask questions about two (typically) phones to the left

and right Eg. Is the first phoneme to the right a glottal stop

Use entropy gain to grow a tree End up with 2000 to 10000 context dependent states from 120

context independent states

21

Acoustic modeling

22

How a speech recognizer works

Signal processing




WordsAudio

Feature vectors: x

23

Search

24

Search Current approach is to precompile Language model,

dictionary, phone HMMs and decision tree into a complete graph Use Weighted Finite State Machine technology heavily

Complications Space of words is large (five gram language model) Context dependent acoustic models look across word boundaries Need to prune to keep perform search at reasonable speeds

Throw away states that are far enough below the best state

25

Speaker/condition dependent systems Humans can certainly do better with a little data: “adapt”

to an unfamilar accent or noise

With minutes of data we can certainly do better Could change our Acoustic models (Gaussian Mixture models)

based on the new data Can change the signal processing

Techniques described work even without supervision Do a speaker independent decode, and pretend that the obtained

word sequence is the truth

26

Adaptation

Signal processing




WordsAudio

Feature vectors: x

27

Vocal tract length normalization Different speakers have different vocal tract lengths

Frequency stretched or squished

At test time estimate this frequency stretching/squishing and undo Just a single parameter, quantized to 10 different values Try each value and pick one that gives best likelihood

To get full benefit need to retrain in this canonical feature space Gaussian Mixture Models and decision trees benefit from being

trained in this “cleaned up” feature space

28

Adaptation of models

29

Adaptation of features

30

Improvements obtained Conversational telephony data, test set from a call center

Training data 2000 hours of Fisher data (0.7 billion frames of acoustic data)

Language model built with hundreds of millions of words from various sources

Including data from the domain of interest (call center conversations IT help desk)

Roughly 30 million parameters in the acoustic model System performance measured by word error rates

Speaker independent system: 34.0% Vocal Tract Length Normalized system: 29.0% Linear transform adaptation: 27.5% Discriminative feature space: 23.8% Discriminative training of model: 22.6%

Its hard work to improve on the best systems, no silver bullet!

31

Current state of the art Progress tracked on Switchboard (conversational

telephony test set)

System Word error rate

1995 “high performance HMM recognizer”

45%

Cambridge Univ. (2000) 19.3%

2004 IBM system 15.2%

2015 IBM system (with Deep Learning)

8%

Estimate of human performance

4%

Replace GMMs for acoustic modeling with deep networks

Source: http://arxiv.org/pdf/1505.05899v1.pdf

32

Conclusions

Gave a brief overview of various components in practical state of the art speech recognition systems

Speech recognition technology has relied on generative statistical models with parameters learned from data

Moved away from hand coded knowledge Discriminative estimation techniques are more expensive but give significant

improvements

Deep learning has shown significant gains for speech recognition

Speech recognition systems are good enough to support several useful applications

But they are still sensitive to variations that humans can handle with ease

1 speech recognition and the em algorithm karthik visweswariah ibm research india

Documents