development of the space intelligibility assessment method catherine middag, gwen van nuffelen,
DESCRIPTION
Development of the SPACE intelligibility assessment method Catherine Middag, Gwen Van Nuffelen, Jean-Pierre Martens, Marc De Bodt. Intelligibility = popular measure for pathological speech assessment Perceptual assessment affected by non-speech information : - PowerPoint PPT PresentationTRANSCRIPT
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 1
Development of the SPACE intelligibility assessment method
Catherine Middag, Gwen Van Nuffelen,
Jean-Pierre Martens, Marc De Bodt
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 2
Introduction
• Intelligibility = popular measure for pathological speech assessment
• Perceptual assessment affected by non-speech information : – familiarity of listener with speaker and type of disorder
hard to eliminate this subjective bias– guessing on the basis of linguistic context
test material design must eliminate this bias
• Replacing the human listener by an automatic speech recognizer (ASR) can solve the two problems, but is the ASR sufficiently reliable?– test case : automation of the Dutch Intelligibility Assessment (DIA)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 3
1 .op ø b d f g h j k l m n p r s t v w z
1. dop
2. nuis
3.
top
Dutch Intelligibility Assessment (DIA)
• 50 isolated (nonsense) words• intelligibility = percent phonemes correct
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 4
How to apply ASR in the DIA?
• Two approaches– let ASR recognize the words and count the percentage
of correct decisions– let ASR check how well on average the acoustics
support the phonetic transcription of the target word (=alignment)
• Our experience– intelligibility emerging from first approach insufficiently
reliable– therefore we developed a system based on alignment
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 5
System architecture : flow chart
Speech aligner
speaker features
Intelligibility Prediction
Model
objective score
acoustic feature sequence Xt
target speech transcription
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 6
System architecture : flow chart
Speech aligner
speaker features
Intelligibility Prediction
Model
objective score
acoustic feature sequence Xt
target speech transcription
Two systems:• complex state-of-the-art HMM-based system (ASR-ESAT)• simple system with a phonological layer (ASR-ELIS) (point more directly to articulatory problems)
Acoustic models trained on speech of normal adult speakers
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 7
ASR - ESAT
• Acoustic models– state-of-the-art Semi-Continuous HMM– triphone models trained on normal speech– states tied using decision trees + phonological questions
• Output– each frame t assigned to state st
– per frame : st, P(st|Xt)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 8
24 binary phonological features concerning :• voicing• manner of articulation• place of articulation
ASR - ELIS
PLF extractor
Probability product model
P(K1|Xt), …,
P(K24|Xt)
P(S1|Xt),…, P(Sn|
Xt)
Viterbi decoder
target speech transcription
Xt
st, P(st|Xt) P(K1|Xt)..P(K24|Xt)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 9
System architecture : flow chart
Speech aligner
acoustic feature sequence Xt
target speech transcription
Intelligibility Prediction
Model
objective score
speaker features
Three feature sets:• Phonemic features (patient has trouble pronouncing a certain phoneme)
• Phonological features (patient has problems with voicing, manner or place of articulation)
• NEW : context-dependent features (patient has problems with a desired change of voicing, manner or place of articulation)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 10
Extraction of phonemic features (PMF)
# : (0.7+0.5+0.3) /3
/p/ : (0.4+0.8) /2
/o/ : (0.6+0.8) /2
/l/ : 0.6
Speech aligner
=ASR-ESAT
Phonemic features
Frame Phoneme P(st|Xt)
1 # 0.7
2 # 0.5
3 /p/ 0.4
4 /p/ 0.8
5 /o/ 0.6
6 /o/ 0.8
7 /l/ 0.6
8 # 0.3
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 11
Extraction of phonological features (PLF)
Frame Phone voicedP(K1|Xt)
backP(K2|Xt)
burstP(K3|Xt)
1 # 0.1 0.1 0.2
2 # 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 # 0.1 0.1 0.0
Burst : 0.6
Back : (0.7+0.9)/2
Voiced : (0.8+0.6+0.5)/3
Speech aligner
=ASR-ELIS
Phonologicalfeatures
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 12
Extraction of phonological features (PLF)
Not burst : (0.2+0.1+…
Not back : (0.1+0.1+…
Not voiced : (0.1+0.1+…
Phonologicalfeatures
Frame Phone voicedP(K1|Xt)
backP(K2|Xt)
burstP(K3|Xt)
1 # 0.1 0.1 0.2
2 # 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 # 0.1 0.1 0.0
Speech aligner
=ASR-ELIS
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 13
Irrelevant features for these phones
Extraction of phonological features (PLF)
Phonologicalfeatures
Frame Phone voicedP(K1|Xt)
backP(K2|Xt)
burstP(K3|Xt)
1 # 0.1 0.1 0.2
2 # 0.1 0.1 0.1
3 /pcl/ 0.2 0.1 0.1
4 /p/ 0.2 0.2 0.6
5 /o/ 0.8 0.7 0.2
6 /o/ 0.6 0.9 0.0
7 /l/ 0.5 0.5 0.1
8 # 0.1 0.1 0.0
Speech aligner
=ASR-ELIS
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 14
Extraction of context-dependent phonological features (CD-PLF)
• How well is change in PLF realized?– use PLF target in preceding/succeeding phone as context – binary features two values for target (present/absent)– binary features restricted number of left & right contexts
• Left or right context can be– present, absent, not relevant, silence
• Model selection (preliminary)– maximum 4 * 2 * 4 = 32 CD-PLFs per PLF
768 in total– select only those CD-PLFs occurring at least twice in every test
123 in total
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 15
Extraction of context-dependent phonological features (CD-PLF)Segment Phone voiced burst …
2 # 0.1 0.2
3 /pcl/ 0.2 0.2
4 /p/ 0.2 0.6
6 /o/ 0.6 0.1
7 /s/ 0.4 0.3
8 # 0.2 0.1
9 /m/ 0.7 0.3
10 /A/ 0.8 0.0
11 /l/ 0.6 0.1
12 # 0.1 0.1
CD-PLF features
Speech aligner
=ASR-ELIS
voicing burst
Off, on, off : +0.6 Yes, no, no : +0.1
On, on, on : +0.8 No, no, no : +0.0
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 16
System architecture : flow chart
Speech aligner
acoustic feature sequence Xt
target speech transcription
speaker features
objective score
Intelligibility Prediction
Model
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 17
Intelligibility prediction model (IPM)
• Objective map speaker features (PMF, PLF, CD-PLF or combinations) to
speaker intelligibility score
• Model training– train on DIA recordings– pathological speakers (+ some normal control speakers)
• Model type and size– limited number of pathological speakers– high number of features
linear regression model
feature selection
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 18
Reference material (DIA)
• 211 speakers :– 51 normals– 60 dysarthric– 12 clefts (children)– 42 hearing impaired– 37 with laryngectomy– 7 with dysphonia– 2 others
• Pathological speakers : mean of 78,7 %
• Normals : mean of 93,3 %• Few with very low score
0 20 40 60 80 1000
10
20
30
40
50
60
human score
num
ber o
f pat
ient
s
histogram of the human scores
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 19
Solving microphone issues
• Two microphones were used. • Difference can be found in cepstral means ( Cepstral mean
subtraction was performed) :
-20 -15 -10 -5 0 5-50
0
50-15
-10
-5
0
5
10
15 shure
sony
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 20
Training / validation
• Models chosen with five-fold cross validation • Measure = Standard deviation (STD) : in case of
normality, 67% of the computed score lie in an interval of STD around the perceptual score
• More features = more chance of overfitting• Rule of thumb : take 1 feature for every 10 training
examples
Restrict number of features to maximum 15
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 21
20 40 60 80 100 12020
40
60
80
100
120
Perceptual score
Com
pute
d sc
ore
20 40 60 80 100 12020
40
60
80
100
120
Perceptual score
Com
pute
d sc
ore
Results : individual systems
PMFelis : 9.52 PMFesat : 8.57
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 22
20 40 60 80 100 12020
40
60
80
100
120
Perceptual score
Com
pute
d sc
ore
20 40 60 80 100 12020
40
60
80
100
120
Perceptual score
Com
pute
d sc
ore
Results : individual systems
PLF (elis) : 9.35 CD-PLF (elis) : 8.48
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 23
Results : all systems
Model STD N
PMFesat 8.57 15
PMFelis 9.52 15
PLF 9.35 15
CD-PLF 8.48 15
PMFelis + PLF 8.20 15
PMFesat + PLF 8.00 13
PMFelis + CD-PLF 7.63 15
PLF + CD-PLF 8.04 15
PMFesat + CD-PLF 7.34 15
PMFelis + PLF + CD-PLF 7.48 15
• New models with CD-PLF outperform old PLF models
• CD-PLFs form best system with one feature set
• PMFesat + CD-PLF best system with combined feature sets
• Using three ELIS feature sets yields next best result and needs only one recognizer (the simplest one)
less complex system
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 24
Results : combined system
CD-PLF + PMFesat:
STD = 7.34
20 40 60 80 100 12020
40
60
80
100
120
Perceptual score
Com
pute
d sc
ore
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 25
Results : pathology-specific IPM
• Instead of creating one general IPM, one can create IPMs for specific pathologies :– trained on all speakers (to have enough speakers)– model selection based on performance on speakers of that
pathology (importance of features depends on type of disorder)
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 26
Results : pathology-specific IPM (2)
Model DYS LAR HEAR
PMFesat 8.44 8.32 7.48
PMFelis 8.10 5.88 9.73
PLF 8.27 7.17 8.05
CD-PLF 6.49 5.70 6.87
PMFelis + PLF 6.97 5.14 6.63
PMFesat + PLF 6.87 6.49 6.20
PMFelis + CD-PLF 6.50 3.54 6.05
PLF + CD-PLF 6.32 5.82 6.17
PMFesat + CD-PLF 6.69 4.86 5.27
PMFelis + PLF + CD-PLF 6.32 3.68 5.73
• Very good match in case CD-PLFs are involved
• New models with CD-PLF outperform old PLF models
• CD-PLFs form best system with one feature set
• Using three ELIS feature sets yields (almost) best result and needs only one recognizer (the simplest one)
less complex system
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 27
Results : pathology-specific IPM
• Dysarthria : 6.32 (red circles)
• Dispersion of other speakers is increased
• Largest deviations in low intelligibility area :– scarce data in that area– can be solved by adding
more weight to patients with very low intelligibility
20 40 60 80 100 12020
40
60
80
100
120
Perceptual score
Com
pute
d sc
ore
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 28
Conclusions and future work
• PMF, PLF and CD-PLF can predict intelligibility of pathological speech:– CD-PLFs seem to play an important role :
• STD = 7.34 for general model combining CD-PLF and PMFesat• STDs less than 6.32 for pathology specific model using 3 elis feature
sets not the articulation pattern but the change in the articulation pattern
matters?– More research is needed before adding this feature set to the tool – Results on validation set compete with human inter-rater
agreements.• Future work:
– more profound articulatory assessment, which is directly related to determination of appropriate therapy
– monitoring of effectiveness of chosen therapy– using more natural speech (words, phrases) in tests
ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent SPACE Symposium - 06/02/09 29
• Questions?