development of the space intelligibility assessment method catherine middag, gwen van nuffelen,

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 1

Development of the SPACE intelligibility assessment method

Catherine Middag, Gwen Van Nuffelen,

Jean-Pierre Martens, Marc De Bodt


Introduction

• Intelligibility = popular measure for pathological speech assessment

• Perceptual assessment affected by non-speech information : – familiarity of listener with speaker and type of disorder

hard to eliminate this subjective bias– guessing on the basis of linguistic context

test material design must eliminate this bias

• Replacing the human listener by an automatic speech recognizer (ASR) can solve the two problems, but is the ASR sufficiently reliable?– test case : automation of the Dutch Intelligibility Assessment (DIA)


1 .op ø b d f g h j k l m n p r s t v w z

1. dop

2. nuis

3.

top

Dutch Intelligibility Assessment (DIA)

• 50 isolated (nonsense) words• intelligibility = percent phonemes correct


How to apply ASR in the DIA?

• Two approaches– let ASR recognize the words and count the percentage

of correct decisions– let ASR check how well on average the acoustics

support the phonetic transcription of the target word (=alignment)

• Our experience– intelligibility emerging from first approach insufficiently

reliable– therefore we developed a system based on alignment


System architecture : flow chart

Speech aligner

speaker features

Intelligibility Prediction

Model

objective score

acoustic feature sequence Xt

target speech transcription



Speech aligner

speaker features


Model

objective score



Two systems:• complex state-of-the-art HMM-based system (ASR-ESAT)• simple system with a phonological layer (ASR-ELIS) (point more directly to articulatory problems)

Acoustic models trained on speech of normal adult speakers


ASR - ESAT

• Acoustic models– state-of-the-art Semi-Continuous HMM– triphone models trained on normal speech– states tied using decision trees + phonological questions

• Output– each frame t assigned to state st

– per frame : st, P(st|Xt)



Speech aligner




Model

objective score

speaker features

Three feature sets:• Phonemic features (patient has trouble pronouncing a certain phoneme)

• Phonological features (patient has problems with voicing, manner or place of articulation)

• NEW : context-dependent features (patient has problems with a desired change of voicing, manner or place of articulation)


Extraction of phonemic features (PMF)

# : (0.7+0.5+0.3) /3

/p/ : (0.4+0.8) /2

/o/ : (0.6+0.8) /2

/l/ : 0.6

Speech aligner

=ASR-ESAT

Phonemic features

Frame Phoneme P(st|Xt)

1 # 0.7

2 # 0.5

3 /p/ 0.4

4 /p/ 0.8

5 /o/ 0.6

6 /o/ 0.8

7 /l/ 0.6

8 # 0.3


Extraction of phonological features (PLF)

Frame Phone voicedP(K1|Xt)

backP(K2|Xt)

burstP(K3|Xt)

1 # 0.1 0.1 0.2

2 # 0.1 0.1 0.1

3 /pcl/ 0.2 0.1 0.1

4 /p/ 0.2 0.2 0.6

5 /o/ 0.8 0.7 0.2

6 /o/ 0.6 0.9 0.0

7 /l/ 0.5 0.5 0.1

8 # 0.1 0.1 0.0

Burst : 0.6

Back : (0.7+0.9)/2

Voiced : (0.8+0.6+0.5)/3

Speech aligner

=ASR-ELIS

Phonologicalfeatures



Not burst : (0.2+0.1+…

Not back : (0.1+0.1+…

Not voiced : (0.1+0.1+…



backP(K2|Xt)

burstP(K3|Xt)

1 # 0.1 0.1 0.2

2 # 0.1 0.1 0.1

3 /pcl/ 0.2 0.1 0.1

4 /p/ 0.2 0.2 0.6

5 /o/ 0.8 0.7 0.2

6 /o/ 0.6 0.9 0.0

7 /l/ 0.5 0.5 0.1

8 # 0.1 0.1 0.0

Speech aligner

=ASR-ELIS


Irrelevant features for these phones




backP(K2|Xt)

burstP(K3|Xt)

1 # 0.1 0.1 0.2

2 # 0.1 0.1 0.1

3 /pcl/ 0.2 0.1 0.1

4 /p/ 0.2 0.2 0.6

5 /o/ 0.8 0.7 0.2

6 /o/ 0.6 0.9 0.0

7 /l/ 0.5 0.5 0.1

8 # 0.1 0.1 0.0

Speech aligner

=ASR-ELIS


Extraction of context-dependent phonological features (CD-PLF)

• How well is change in PLF realized?– use PLF target in preceding/succeeding phone as context – binary features two values for target (present/absent)– binary features restricted number of left & right contexts

• Left or right context can be– present, absent, not relevant, silence

• Model selection (preliminary)– maximum 4 * 2 * 4 = 32 CD-PLFs per PLF

768 in total– select only those CD-PLFs occurring at least twice in every test

123 in total


Extraction of context-dependent phonological features (CD-PLF)Segment Phone voiced burst …

2 # 0.1 0.2

3 /pcl/ 0.2 0.2

4 /p/ 0.2 0.6

6 /o/ 0.6 0.1

7 /s/ 0.4 0.3

8 # 0.2 0.1

9 /m/ 0.7 0.3

10 /A/ 0.8 0.0

11 /l/ 0.6 0.1

12 # 0.1 0.1

CD-PLF features

Speech aligner

=ASR-ELIS

voicing burst

Off, on, off : +0.6 Yes, no, no : +0.1

On, on, on : +0.8 No, no, no : +0.0



Speech aligner



speaker features

objective score


Model


Intelligibility prediction model (IPM)

• Objective map speaker features (PMF, PLF, CD-PLF or combinations) to

speaker intelligibility score

• Model training– train on DIA recordings– pathological speakers (+ some normal control speakers)

• Model type and size– limited number of pathological speakers– high number of features

linear regression model

feature selection


Reference material (DIA)

• 211 speakers :– 51 normals– 60 dysarthric– 12 clefts (children)– 42 hearing impaired– 37 with laryngectomy– 7 with dysphonia– 2 others

• Pathological speakers : mean of 78,7 %

• Normals : mean of 93,3 %• Few with very low score

0 20 40 60 80 1000

10

20

30

40

50

60

human score

num

ber o

f pat

ient

s

histogram of the human scores


Solving microphone issues

• Two microphones were used. • Difference can be found in cepstral means ( Cepstral mean

subtraction was performed) :

-20 -15 -10 -5 0 5-50

0

50-15

-10

-5

0

5

10

15 shure

sony


Training / validation

• Models chosen with five-fold cross validation • Measure = Standard deviation (STD) : in case of

normality, 67% of the computed score lie in an interval of STD around the perceptual score

• More features = more chance of overfitting• Rule of thumb : take 1 feature for every 10 training

examples

Restrict number of features to maximum 15


20 40 60 80 100 12020

40

60

80

100

120

Perceptual score

Com

pute

d sc

ore

20 40 60 80 100 12020

40

60

80

100

120

Perceptual score

Com

pute

d sc

ore

Results : individual systems

PMFelis : 9.52 PMFesat : 8.57


20 40 60 80 100 12020

40

60

80

100

120

Perceptual score

Com

pute

d sc

ore

20 40 60 80 100 12020

40

60

80

100

120

Perceptual score

Com

pute

d sc

ore

Results : individual systems

PLF (elis) : 9.35 CD-PLF (elis) : 8.48


Results : all systems

Model STD N

PMFesat 8.57 15

PMFelis 9.52 15

PLF 9.35 15

CD-PLF 8.48 15

PMFelis + PLF 8.20 15

PMFesat + PLF 8.00 13

PMFelis + CD-PLF 7.63 15

PLF + CD-PLF 8.04 15

PMFesat + CD-PLF 7.34 15

PMFelis + PLF + CD-PLF 7.48 15

• New models with CD-PLF outperform old PLF models

• CD-PLFs form best system with one feature set

• PMFesat + CD-PLF best system with combined feature sets

• Using three ELIS feature sets yields next best result and needs only one recognizer (the simplest one)

less complex system


Results : combined system

CD-PLF + PMFesat:

STD = 7.34

20 40 60 80 100 12020

40

60

80

100

120

Perceptual score

Com

pute

d sc

ore


Results : pathology-specific IPM

• Instead of creating one general IPM, one can create IPMs for specific pathologies :– trained on all speakers (to have enough speakers)– model selection based on performance on speakers of that

pathology (importance of features depends on type of disorder)


Results : pathology-specific IPM (2)

Model DYS LAR HEAR

PMFesat 8.44 8.32 7.48

PMFelis 8.10 5.88 9.73

PLF 8.27 7.17 8.05

CD-PLF 6.49 5.70 6.87

PMFelis + PLF 6.97 5.14 6.63

PMFesat + PLF 6.87 6.49 6.20

PMFelis + CD-PLF 6.50 3.54 6.05

PLF + CD-PLF 6.32 5.82 6.17

PMFesat + CD-PLF 6.69 4.86 5.27

PMFelis + PLF + CD-PLF 6.32 3.68 5.73

• Very good match in case CD-PLFs are involved

• New models with CD-PLF outperform old PLF models

• CD-PLFs form best system with one feature set

• Using three ELIS feature sets yields (almost) best result and needs only one recognizer (the simplest one)

less complex system


Results : pathology-specific IPM

• Dysarthria : 6.32 (red circles)

• Dispersion of other speakers is increased

• Largest deviations in low intelligibility area :– scarce data in that area– can be solved by adding

more weight to patients with very low intelligibility

20 40 60 80 100 12020

40

60

80

100

120

Perceptual score

Com

pute

d sc

ore


Conclusions and future work

• PMF, PLF and CD-PLF can predict intelligibility of pathological speech:– CD-PLFs seem to play an important role :

• STD = 7.34 for general model combining CD-PLF and PMFesat• STDs less than 6.32 for pathology specific model using 3 elis feature

sets not the articulation pattern but the change in the articulation pattern

matters?– More research is needed before adding this feature set to the tool – Results on validation set compete with human inter-rater

agreements.• Future work:

– more profound articulatory assessment, which is directly related to determination of appropriate therapy

– monitoring of effectiveness of chosen therapy– using more natural speech (words, phrases) in tests


• Questions?

development of the space intelligibility assessment method catherine middag, gwen van nuffelen,

Documents