from speech signal acoustics to perception louis c.w. pols institute of phonetic sciences (ifa)...
TRANSCRIPT
From speech signal acousticsto perception
Louis C.W. Pols
Institute of Phonetic Sciences (IFA)
Amsterdam Center for Language and Communication
(ACLC)NATO-ASI “Dynamics of Speech Production and Perception”
Il Ciocco, Tuscany, Italy, July 4, 2002
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 2
Overview
how do we perceive (speech) dynamics? The Intelligent Ear. On the Nature of Sound
Perception, by Reinier Plomp (2002) from psychoacoustics to speech perception
(lack of) context; robustness; continuity V and C reduction; coarticulation
perceptual compensation for artic. undershoot?
speech efficiency conclusions
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 3
Various scientific preferences
several biases have affected the history of (speech &) hearing research (Plomp, 2002): dominance of sinusoidal tones as stimuli preference for microscopic approach (e.g.,
phoneme discrimination rather than intelligibility)
emphasis on psychophysical (rather than cognitive) aspects of hearing
clean stimuli in the lab rather than the acoustic reality of the outside world (disruptive sounds)
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 4
Psychoacoustics - speech perc.
duration, pitch, loudness, timbre, direction absolute and masked threshold, jnd, discrim. continuity complexity (pure - complex tone, voicing) effect of context, meaning (intell.), freq. occ. phoneme: more text-guided than perceived speech perceptual tasks:
phoneme —> sent. identif.; discrim.; matching
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 5
phenomenon threshold/ jnd
remarks phenomenon threshold/ jnd
remarks
threshold of hearing
0 dB at 1000 Hz frequency dependent formant frequency
3 - 5 % one formant only < 3 % with more experienced subjects
threshold of duration
constant energy at 10 – 300 ms
Energy = Power x Duration
formant amplitude
3 dB F2 in synthetic vowel
frequency discrimination
1.5 Hz at 1000 Hz
more when < 200 ms
overall intensity
1.5 dB synthetic vowel, mainly F1
intensity discrimination
0.5 – 1 dB up to 80 dB SL formant bandwidth
20 - 40 % one-formant vowel
temporal discrimination
5 ms at 50 ms duration dependent F0 (pitch) 0.3 - 0.5 % synthetic vowel
masking psychophysical tuning curve
pitch of complex tones
low pitch many peculiarities
gap detection 3 ms for wide-band noise
more at low freq. for narrow-band noise
Detection thresholds and jnd
multi-harmonic,simple, stationary signals single-formant-like
periodic signals
3 - 5%
1.5 Hz20 - 40%
frequency
F2
BW
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 6
Perceiving speech-like trans.
Ph.D thesis A. van Wieringen (1995) “Perceiving dynamic speechlike sounds. Psycho-
acoustics and speech perception” see also vWie & Pols, Acustica 84 (1998) 520-528
stimulus characteristics (segmented and/or reversed) natural or synthetic tone glide; single- or multi-formant transition isolated trans.; initial or final trans. with steady st. converg. or diverg. trans. (var. duration or slope)
task: jnd/DL; matching; abs. ident.; classif.
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 7
DL for short speech-like transitions
20 30 40 50
Transition duration (ms)
0
60
120
180
240
Tone glide
Tone glide
Single-isolated
ComplexSingle Single
Complex
Adopted from van Wieringen & Pols (1998), Acta Acustica 84, 520-528“Discrimination of short and rapid speechlike transitions”
complex
simple
short longer trans.
initial
final
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 8
Perceiving (speech) dynamics
vowel perception w/w or w/o transitions? our claims (vSon, IFA Proc. 17 (1993)):
only evidence for compensatory processes, i.e. perceptual-overshoot and dynamic-specification, when in an appropriate context
synthetic isolated dynamic formant tracks lead to perceptual undershoot (=averaging)
silent center studies are ambiguous concl.: info in formant dynamics is only
used when V’s are heard in appropriate context
F = 375 Hz2
F1
2FF =-375 Hz2
time --> ms
fre
qu
en
cy -
-> H
z
F1
2Ffr
eq
ue
ncy
-->
Hz
< 6.3, 12.5, 25, > 50, 100, 150 ms
< 25, 50 > 100, 150 ms
< 25, 50 > 100, 150 ms
Stationary (reference) tokens
Dynamic tokens
on- offglide
complete
F =-225 Hz1
on- offglide
complete
F = 225 Hz1
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 10
Vowel identification
compare V responses for dynamic stimuli with those for static stimuli
calculate net shift in V responses per onglide (CV), complete (CVC), or offglide (VC)
result: responses average over the trailing part of the formant track
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 11
X
1501005025-50
-40
-30
-20
-10
0
10
20
30
40
50
% N
et s
hift
->
Token duration -> ms
F = 225Hz1
F = 375Hz2
F =-375Hz2
F =-225Hz1Net shift in vowelresponses to tokenswith curved formanttracks vs. stationarytokens. All valuessignificant, exceptsmall open triangles
Perceptual undershoot
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 12
Effect of local context
“Perisegmental speech improves consonant and vowel identification”, vSon & Pols, Speech Comm. 29,1-22 (1999)
also “Phoneme recognition as a function of task and context”, IFA Proc. 24, 27-38 (2001) and Proc. SPRAAC, 25-30 (2001)
also Pols & vSon (1993), “Acoustics and perception of dynamic vowel segments”, Speech Comm. 13, 135-147
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 13
V and C identification
gated tokens from 120 CVC speech fragments taken from a long text reading
50 ms V kernel, + V trans., + C part (L/R) stimuli randomized; V identification (17 Ss)
and Ci and Cf identification (15 Ss) results:
phoneme identification benefits from extra speech
left context more beneficial than right context better identification when also other member of
pair was identified correctly (context effect)
KernelV
CVC
TCCTTCC
VCC
CCTCV VC
CCV
CV VC
233200150100500Time -> ms
S la
50 ms
+10 ms–10–10+10 ms
+25 ms+25 ms Transition Transition
(152)
(91)(112)
(91)
(106)(91)(56)(41)
(106)(91)
(56)(41)
(50)
Vowel identification
Consonant identification
Stimulus typeKernel VC V CV CVC
Err
ors
-> %
0
10
20
30
40
All+ Accent– Accent
204010031037
N
0.0
0.5
1.0
1.5
Log
2 P
erpl
exity
->
bits
+ +
* * *
+
Error rates of vowel identification for the individual stimulus token types. Long-short vowel errors (/α-a:, -o:/) are ignored
c
VC V C0
10
20
30
40
50
60
70
VCCV
Err
ors
-> %
ErrorCorrect
Other segment is
N = 1680
V and C in CV tokens were identified better when theother member of the pair was identified correctly
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 17
Effect of (lack of) context
100 Dutch listeners identifying V segments “Vowel contrast reduction”, K-vBeinum (1980)
3 conditions M1 M2 F1 F2 Av.isolated V %(3) ASC
95.2433
88.9404
88.0447
86.4634
89.6480
words %(5) ASC
88.1406
78.8320
84.9374
85.3529
84.3407
unstr., free conv. %(10) ASC
31.2174
28.7119
33.3209
38.9255
33.0189
ASC = 1/n Σ |LFi - LFi|2 (total variance), LFi = 100 10log Fii=1
n
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 18
Human word intelligibility vs. noise
from Ph.D thesisH. Steeneken (1992)‘On measuring andpredicting speechintelligibility’
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 19
Robustness to degraded speech
speech = time-modulated signal in frequency bands
relatively insensitive to (spectral) distortions prerequisite for digital hearing aid modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz
temporal smearing of envelope modulation ca. 4 Hz max. in modulation spectrum syllable LP>4 Hz and HP<8 Hz little effect on intelligibility
spectral envelope smearing for BW>1/3 oct masked SRT starts to degrade
(for references, see keynote paper Pols in Proc. ICPhS’99)
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 20
Some examples
partly reversed speech (Saberi & Perrott, Nature, 4/99) fixed duration segments time reversed or shifted in
time perfect sentence intelligibility up to 50 ms
(demo: every 50 ms reversed original ) low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum syllable as information unit? (S. Greenberg)
gap and click restoration (Warren) gating experiments
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 21
Continuity, especiallywhile masked
continuity effect (Miller & Licklider), auditory induction (Warren), pulsation threshold (Houtgast)
also for gliding tones also for complex tones also for pitch
fission, fusion segregation, streaming
phonemic restoration
500
900
1200
2000Hz
—> time
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 22
V and C reduction, coarticulation
spectral variability is not random but, at least partly, speaker-, style-, and context-specific
read - spontaneous; stressed - unstressed not just for vowels, but also for
consonants duration; spectral balance intervocalic sound energy difference F2 slope difference; locus equation
Stressed Unstressed Total0
5
10
15
20
25
30
35
Read
Spontaneous
Err
or
rate
->
%p 0.001 0.001 0.001
45
50
55
60
65
45
50
55
60
65
Read
Spontaneous
Stressed Unstressed Total
Dur
atio
n ->
ms
p 0.001 0.006 0.001
Mean consonant duration Mean error rate for C identification
Adopted from van Son & Pols (Eurospeech’97)
C-duration C error rate
791 VCV pairs (read & spontan.; stressed & unstr. segments; one male); C-identification by 22 Dutch subjects
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 24
Perception of ac. V reduction
Ph.D thesis Dick van Bergem (1995) “Acoustic and lexical vowel reduction”
lexical V reduction: Fr /betõ/ vs. Du /b@tOn/ acoustic V reduction:
Du ‘miljoen’ as /mIljun/ or as /m@ljun/ identify the unstressed vowels (as V or @)
by 20 listeners (8M, 12 F) in 47 words (cond. W and S) or 20 words (cond. P), like ‘milJOEN’ or
‘biosCOOP’ spoken by 20 male speakers (2280 stimuli)
5%
36%
60%
69%
4 reduction stages for 20 speakers
% schwa responses on /I/ by 20 listeners
model prediction for schwa in this m-l context
adapted fromvBergem (1995)
Conclusion: Vowel reduction is not centralization but contextual assimilation
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 26
Speech efficiency
speech is most efficient if it contains only the information needed to understand it:“Speech is the missing information” (Lindblom, JASA ‘96)
less information needed for more predictable things: shorter duration and more spectral reduction for high-
frequent syllables and words
C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
Correlation between consonant confusion and 4 measures indicated
Read — Read + Spont — Spont + All0
-0.05
-0.10
-0.15
-0.20
-0.25
-0.30
-0.35
-0.40DurationCoGI(syllable)
I(word)
<-
Co
rre
latio
n c
oe
ffic
ien
t
+
+
*
*
*
**
*
++
* *
Adopted from van Son et al. (Proc. ICSLP’98)
Dutch male sp.
20 min. R/S
12 k syll.
8k words
791 VCV R/S
- 308 lex. str.
- 483 unstr.
C ident. 22 Ss
July 4, 2002 From speech signal acoustics to perception, Il Ciocco 28
Conclusions
perceiving speech (segments) very much depends on speech quality and context
isolated segments is also a kind of context
only ‘proper’ interpretation of formant transitions (perceptual compensation for spectro-temporal undershoot) when presented in an appropriate context
reduced V are best perceived as schwa if transitions are contextually assimilated