discriminative phonetic recognition with conditional random fields jeremy morris & eric...

12
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler- Lussier The Ohio State University Speech & Language Technologies Lab HLT-NAACL 2006 Computationally Hard Problems and Joint Inferen in Speech and Language Processing Workshop June 9, 2006

Upload: mercy-glenn

Post on 21-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

Discriminative Phonetic Recognition with Conditional

Random Fields

Jeremy Morris & Eric Fosler-Lussier

The Ohio State University

Speech & Language Technologies Lab

HLT-NAACL 2006 Computationally Hard Problems and Joint Inference

in Speech and Language Processing WorkshopJune 9, 2006

Page 2: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

2

Introduction

• Conditional Random Fields (CRFs) offer some benefits over traditional HMM models for sequence labeling– Direct model of the posterior probability of a label

sequence given an observation– Make no assumptions about independence of

observations

• The lack of an independence assumption make CRFs an attractive model for speech recognition

• We are interested in combining together arbitrary speech attributes to build a hypothesis of the observed speech

Page 3: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

3

Speech Attributes• Two different types of speech attributes:

– Phone classes are trained to indicate when a particular timeslice of speech is a particular phone (e.g. /t/, /v/ etc.)

– Phonological feature classes are trained to indicate when a particular timeslice of speech exhibits a particular phonological feature

/t/Manner: stop

Place of articulation: dentalVoicing: unvoiced

Page 4: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

4

Speech Attributes• Two different types of speech attributes:

– Phone classes are trained to indicate when a particular timeslice of speech is a particular phone (e.g. /t/, /v/ etc.)

– Phonological feature classes are trained to indicate when a particular timeslice of speech exhibits a particular phonological feature

/t/Manner: stop

Place of articulation: dentalVoicing: unvoiced

/d/Manner: stop

Place of articulation: dentalVoicing: voiced

Page 5: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

5

Speech Attributes• Two different types of speech attributes:

– Phone classes are trained to indicate when a particular timeslice of speech is a particular phone (e.g. /t/, /v/ etc.)

– Phonological feature classes are trained to indicate when a particular timeslice of speech exhibits a particular phonological feature

/t/Manner: stop

Place of articulation: dentalVoicing: unvoiced

/d/Manner: stop

Place of articulation: dentalVoicing: voiced

/iy/Height: high

Backness: frontRoundness: nonround

Page 6: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

6

Speech Attributes

• Attribute classifiers are trained using MLP neural networks that emit posterior probabilities – P(attribute | acoustics)– These posteriors can also be viewed as

indicator functions for the given classes– Outputs are highly correlated with each other

• We want to combine the observations given by these indicator functions to get a hypothesis for the speech

Page 7: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

7

Tandem Systems

• HMM-based systems using neural network outputs as features (Hermansky and Ellis, 2000)– Neural network output is used

to train an HMM– HMMs assume that the

observed features are independent of each other

– Features are decorrelated through principal components analysis (PCA) before training and testing

/k/ /iy/ /iy/

Pr(attr|X) Pr(attr|X) Pr(attr|X)

X X X

Page 8: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

8

CRF System

• We implement a CRF model using the neural network outputs as state feature functions

e.g:

/k/ /iy/ /iy/

Pr(attr|X) Pr(attr|X) Pr(attr|X)

otherwise ,0

// if ),|//(),,(/d/

dyxdPtys

ttx

• Compare the results to a Tandem system trained on the same features

• No PCA decorrelation is performed on the CRF inputs

Page 9: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

9

Phone Accuracy Results

Tandem

(triphones)

CRF

(monophones)

Phones (all 61) 67.32% 66.89%

Phon. Features (all 43) 66.85% 63.84%

Combined (all 104) 66.80% 67.87%

Combined (top 39) 67.85%

Page 10: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

10

Discussion• The CRF model is much more conservative in its

generation than the Tandem model– Many fewer insertions, many more deletions

• All features CRF: 6500 deletions, 731 insertions• All features Tandem (top 39): 3184 deletions, 2511 insertions

• Label state space of the Tandem model is much larger than the CRF

• Transition information is currently unused– Adding transition feature functions built on observed data may

improve results

• Benefit of this model over traditional Tandem model is that arbitrary features can be easily added– We want to explore adding arbitrary features to the model to see

how performance changes (e.g. speaking rate, stress, pitch, etc.)

Page 11: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

11

Page 12: Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies

12

Phone Precision Results

Tandem

(triphones)

CRF

(monophones)

Phones 73.93% 79.22%

Phon. Features 73.62% 76.61%

Combined 73.33% 79.49%

Combined (top 39) 74.43%