introduction to automatic speech recognition. outline define the problem what is speech? feature...

35
Introduction to Automatic Speech Recognition

Upload: duane-oliver

Post on 25-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Introduction to Automatic Speech Recognition

Page 2: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Outline

Define the problemWhat is speech?Feature SelectionModels

Early methods Modern statistical models

Current State of ASRFuture Work

Page 3: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

The ASR Problem

There is no single ASR problemThe problem depends on many factors

Microphone: Close-mic, throat-mic, microphone array, audio-visual

Sources: band-limited, background noise, reverberation

Speaker: speaker dependent, speaker independent

Language: open/closed vocabulary, vocabulary size, read/spontaneous speech

Output: Transcription, speaker id, keywords

Page 4: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Performance Evaluation

Accuracy Percentage of tokens correctly recognized

Error Rate Inverse of accuracy

Token Type Phones Words* Sentences Semantics?

Page 5: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

What is Speech?

Analog signal produced by humansYou can think about the speech signal being decomposed into the source and filterThe source is the vocal folds in voiced speechThe filter is the vocal tract and articulators

Page 6: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Speech Production

Page 7: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Speech Production

Page 8: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Speech Production

Page 9: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Speech Visualization

Page 10: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Speech Visualization

Page 11: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Speech Visualization

Page 12: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Feature Selection

As in any data-driven task, the data must be represented in some formatCepstral features have been found to perform wellThey represent the frequency of the frequenciesMel-frequency cepstral coefficients (MFCC) are the most common variety

Page 13: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Where do we stand?

Defined the multiple problems associated with ASRDescribed how speech is producedIllustrated how speech can be represented in an ASR systemNow that we have the data, how do we recognize the speech?

Page 14: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Radio Rex

First known attempt at speech recognitionA toy from 1922Worked by analyzing the signal strength at 500Hz

Page 15: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Actual speech recognition systems

Originally thought to be a relatively simple task requiring a few years of concerted effort

1969, “Wither speech recognition” is published

A DARPA project ran from 1971-1976 in response to the statements in the Pierce article

We can examine a few general systems

Page 16: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Template-Based ASR

Originally only worked for isolated words Performs best when training and testing

conditions are best For each word we want to recognize, we

store a template or example based on actual data

Each test utterance is checked against the templates to find the best match

Uses the Dynamic Time Warping (DTW) algorithm

Page 17: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Dynamic Time Warping

Create a similarity matrix for the two utterances

Use dynamic programming to find the lowest cost path

Page 18: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Hearsay-II

One of the systems developed during the DARPA program

A blackboard-based system utilizing symbolic problem solvers

Each problem solver was called a knowledge group

A complex scheduler was used to decide when each KG should be called

Page 19: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Hearsay-II

Page 20: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

DARPA Results

The Hearsay-II system performed much better than the two other similar competing systems

However, only one system met the performance goals of the project The Harpy system was also a CMU built system In many ways it was a predecessor to the

modern statistical systems

Page 21: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Modern Statistical ASR

Page 22: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Modern Statistical ASR

Page 23: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Acoustic Model

For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes

Two methods are commonly used Multilayer perceptron (MLP) gives the likelihood

of a class given the data Gaussian Mixture Model (GMM) gives the

likelihood of the data given a class

Page 24: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Gaussian Distribution

Page 25: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Pronunciation Model

While the pronunciation model can be very complex, it is typically just a dictionary

The dictionary contains the valid pronunciations for each word

Examples: Cat: k ae t Dog: d ao g Fox: f aa x s

Page 26: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Language Model

Now we need some way of representing the likelihood of any given word sequence

Many methods exist, but ngrams are the most common

Ngrams models are trained by simply counting the occurrences of words in a training set

Page 27: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Ngrams

A unigram is the probability of any word in isolation

A bigram is the probability of a given word given the previous word

Higher order ngrams continue in a similar fashion

A backoff probability is used for any unseen data

Page 28: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

How do we put it together?

We now have models to represent the three parts of our equation

We need a framework to join these models together

The standard framework used is the Hidden Markov Model (HMM)

Page 29: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Markov Model

A state model using the markov property The markov property states that the future

depends only on the present state Models the likelihood of transitions between

states in a model Given the model, we can determine the

likelihood of any sequence of states

Page 30: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Hidden Markov Model

Similar to a markov model except the states are hidden

We now have observations tied to the individual states

We no longer know the exact state sequence given the data

Allows for the modeling of an underlying unobservable process

Page 31: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

HMMs for ASR

First we build an HMM for each phone Next we combine the phone models based

on the pronunciation model to create word level models

Finally, the word level models are combined based on the language model

We now have a giant network with potentially thousands or even millions of states

Page 32: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Decoding

Decoding happens in the same way as the previous example

For each time frame we need to maintain two pieces of information The likelihood of being at any state The previous state for every state

Page 33: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

State of the Art

What works well Constrained vocabulary systems Systems adapted to a given speaker Systems in anechoic environments without

background noise Systems expecting read speech

What doesn't work Large unconstrained vocabulary Noisy environments Conversational speech

Page 34: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Future Work

Better representations of audio based on humans

Better representation of acoustic elements based on articulatory phonology

Segmental models that do not rely on the simple frame-based approach

Page 35: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical

Resources

Hidden Markov Model Toolkit (HTK) http://htk.eng.cam.ac.uk/

CHIME ( a freely available dataset) http://spandh.dcs.shef.ac.uk/projects/chime/PCC

/datasets.html Machine Learning Lectures

http://www.stanford.edu/class/cs229/ http://www.youtube.com/watch?v=UzxYlbK2c7E