from speech signal acoustics to perception louis c.w. pols institute of phonetic sciences (ifa)...

From speech signal acousticsto perception

Louis C.W. Pols

Institute of Phonetic Sciences (IFA)

Amsterdam Center for Language and Communication

(ACLC)NATO-ASI “Dynamics of Speech Production and Perception”

Il Ciocco, Tuscany, Italy, July 4, 2002

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 2

Overview

how do we perceive (speech) dynamics? The Intelligent Ear. On the Nature of Sound

Perception, by Reinier Plomp (2002) from psychoacoustics to speech perception

(lack of) context; robustness; continuity V and C reduction; coarticulation

perceptual compensation for artic. undershoot?

speech efficiency conclusions


Various scientific preferences

several biases have affected the history of (speech &) hearing research (Plomp, 2002): dominance of sinusoidal tones as stimuli preference for microscopic approach (e.g.,

phoneme discrimination rather than intelligibility)

emphasis on psychophysical (rather than cognitive) aspects of hearing

clean stimuli in the lab rather than the acoustic reality of the outside world (disruptive sounds)


Psychoacoustics - speech perc.

duration, pitch, loudness, timbre, direction absolute and masked threshold, jnd, discrim. continuity complexity (pure - complex tone, voicing) effect of context, meaning (intell.), freq. occ. phoneme: more text-guided than perceived speech perceptual tasks:

phoneme —> sent. identif.; discrim.; matching


phenomenon threshold/ jnd

remarks phenomenon threshold/ jnd

remarks

threshold of hearing

0 dB at 1000 Hz frequency dependent formant frequency

3 - 5 % one formant only < 3 % with more experienced subjects

threshold of duration

constant energy at 10 – 300 ms

Energy = Power x Duration

formant amplitude

3 dB F2 in synthetic vowel

frequency discrimination

1.5 Hz at 1000 Hz

more when < 200 ms

overall intensity

1.5 dB synthetic vowel, mainly F1

intensity discrimination

0.5 – 1 dB up to 80 dB SL formant bandwidth

20 - 40 % one-formant vowel

temporal discrimination

5 ms at 50 ms duration dependent F0 (pitch) 0.3 - 0.5 % synthetic vowel

masking psychophysical tuning curve

pitch of complex tones

low pitch many peculiarities

gap detection 3 ms for wide-band noise

more at low freq. for narrow-band noise

Detection thresholds and jnd

multi-harmonic,simple, stationary signals single-formant-like

periodic signals

3 - 5%

1.5 Hz20 - 40%

frequency

F2

BW


Perceiving speech-like trans.

Ph.D thesis A. van Wieringen (1995) “Perceiving dynamic speechlike sounds. Psycho-

acoustics and speech perception” see also vWie & Pols, Acustica 84 (1998) 520-528

stimulus characteristics (segmented and/or reversed) natural or synthetic tone glide; single- or multi-formant transition isolated trans.; initial or final trans. with steady st. converg. or diverg. trans. (var. duration or slope)

task: jnd/DL; matching; abs. ident.; classif.


DL for short speech-like transitions

20 30 40 50

Transition duration (ms)

0

60

120

180

240

Tone glide

Tone glide

Single-isolated

ComplexSingle Single

Complex

Adopted from van Wieringen & Pols (1998), Acta Acustica 84, 520-528“Discrimination of short and rapid speechlike transitions”

complex

simple

short longer trans.

initial

final


Perceiving (speech) dynamics

vowel perception w/w or w/o transitions? our claims (vSon, IFA Proc. 17 (1993)):

only evidence for compensatory processes, i.e. perceptual-overshoot and dynamic-specification, when in an appropriate context

synthetic isolated dynamic formant tracks lead to perceptual undershoot (=averaging)

silent center studies are ambiguous concl.: info in formant dynamics is only

used when V’s are heard in appropriate context

F = 375 Hz2

F1

2FF =-375 Hz2

time --> ms

fre

qu

en

cy -

-> H

z

F1

2Ffr

eq

ue

ncy

-->

Hz

< 6.3, 12.5, 25, > 50, 100, 150 ms

< 25, 50 > 100, 150 ms

< 25, 50 > 100, 150 ms

Stationary (reference) tokens

Dynamic tokens

on- offglide

complete

F =-225 Hz1

on- offglide

complete

F = 225 Hz1


Vowel identification

compare V responses for dynamic stimuli with those for static stimuli

calculate net shift in V responses per onglide (CV), complete (CVC), or offglide (VC)

result: responses average over the trailing part of the formant track


X

1501005025-50

-40

-30

-20

-10

0

10

20

30

40

50

% N

et s

hift

->

Token duration -> ms

F = 225Hz1

F = 375Hz2

F =-375Hz2

F =-225Hz1Net shift in vowelresponses to tokenswith curved formanttracks vs. stationarytokens. All valuessignificant, exceptsmall open triangles

Perceptual undershoot


Effect of local context

“Perisegmental speech improves consonant and vowel identification”, vSon & Pols, Speech Comm. 29,1-22 (1999)

also “Phoneme recognition as a function of task and context”, IFA Proc. 24, 27-38 (2001) and Proc. SPRAAC, 25-30 (2001)

also Pols & vSon (1993), “Acoustics and perception of dynamic vowel segments”, Speech Comm. 13, 135-147


V and C identification

gated tokens from 120 CVC speech fragments taken from a long text reading

50 ms V kernel, + V trans., + C part (L/R) stimuli randomized; V identification (17 Ss)

and Ci and Cf identification (15 Ss) results:

phoneme identification benefits from extra speech

left context more beneficial than right context better identification when also other member of

pair was identified correctly (context effect)

KernelV

CVC

TCCTTCC

VCC

CCTCV VC

CCV

CV VC

233200150100500Time -> ms

S la

50 ms

+10 ms–10–10+10 ms

+25 ms+25 ms Transition Transition

(152)

(91)(112)

(91)

(106)(91)(56)(41)

(106)(91)

(56)(41)

(50)

Vowel identification

Consonant identification

Stimulus typeKernel VC V CV CVC

Err

ors

-> %

0

10

20

30

40

All+ Accent– Accent

204010031037

N

0.0

0.5

1.0

1.5

Log

2 P

erpl

exity

->

bits

+ +

* * *

+

Error rates of vowel identification for the individual stimulus token types. Long-short vowel errors (/α-a:, -o:/) are ignored

c

VC V C0

10

20

30

40

50

60

70

VCCV

Err

ors

-> %

ErrorCorrect

Other segment is

N = 1680

V and C in CV tokens were identified better when theother member of the pair was identified correctly


Effect of (lack of) context

100 Dutch listeners identifying V segments “Vowel contrast reduction”, K-vBeinum (1980)

3 conditions M1 M2 F1 F2 Av.isolated V %(3) ASC

95.2433

88.9404

88.0447

86.4634

89.6480

words %(5) ASC

88.1406

78.8320

84.9374

85.3529

84.3407

unstr., free conv. %(10) ASC

31.2174

28.7119

33.3209

38.9255

33.0189

ASC = 1/n Σ |LFi - LFi|2 (total variance), LFi = 100 10log Fii=1

n


Human word intelligibility vs. noise

from Ph.D thesisH. Steeneken (1992)‘On measuring andpredicting speechintelligibility’


Robustness to degraded speech

speech = time-modulated signal in frequency bands

relatively insensitive to (spectral) distortions prerequisite for digital hearing aid modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz

temporal smearing of envelope modulation ca. 4 Hz max. in modulation spectrum syllable LP>4 Hz and HP<8 Hz little effect on intelligibility

spectral envelope smearing for BW>1/3 oct masked SRT starts to degrade

(for references, see keynote paper Pols in Proc. ICPhS’99)


Some examples

partly reversed speech (Saberi & Perrott, Nature, 4/99) fixed duration segments time reversed or shifted in

time perfect sentence intelligibility up to 50 ms

(demo: every 50 ms reversed original ) low frequency modulation envelope (3-8 Hz) vs.

acoustic spectrum syllable as information unit? (S. Greenberg)

gap and click restoration (Warren) gating experiments


Continuity, especiallywhile masked

continuity effect (Miller & Licklider), auditory induction (Warren), pulsation threshold (Houtgast)

also for gliding tones also for complex tones also for pitch

fission, fusion segregation, streaming

phonemic restoration

500

900

1200

2000Hz

—> time


V and C reduction, coarticulation

spectral variability is not random but, at least partly, speaker-, style-, and context-specific

read - spontaneous; stressed - unstressed not just for vowels, but also for

consonants duration; spectral balance intervocalic sound energy difference F2 slope difference; locus equation

Stressed Unstressed Total0

5

10

15

20

25

30

35

Read

Spontaneous

Err

or

rate

->

%p 0.001 0.001 0.001

45

50

55

60

65

45

50

55

60

65

Read

Spontaneous

Stressed Unstressed Total

Dur

atio

n ->

ms

p 0.001 0.006 0.001

Mean consonant duration Mean error rate for C identification

Adopted from van Son & Pols (Eurospeech’97)

C-duration C error rate

791 VCV pairs (read & spontan.; stressed & unstr. segments; one male); C-identification by 22 Dutch subjects


Perception of ac. V reduction

Ph.D thesis Dick van Bergem (1995) “Acoustic and lexical vowel reduction”

lexical V reduction: Fr /betõ/ vs. Du /b@tOn/ acoustic V reduction:

Du ‘miljoen’ as /mIljun/ or as /m@ljun/ identify the unstressed vowels (as V or @)

by 20 listeners (8M, 12 F) in 47 words (cond. W and S) or 20 words (cond. P), like ‘milJOEN’ or

‘biosCOOP’ spoken by 20 male speakers (2280 stimuli)

5%

36%

60%

69%

4 reduction stages for 20 speakers

% schwa responses on /I/ by 20 listeners

model prediction for schwa in this m-l context

adapted fromvBergem (1995)

Conclusion: Vowel reduction is not centralization but contextual assimilation


Speech efficiency

speech is most efficient if it contains only the information needed to understand it:“Speech is the missing information” (Lindblom, JASA ‘96)

less information needed for more predictable things: shorter duration and more spectral reduction for high-

frequent syllables and words

C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits

(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

Correlation between consonant confusion and 4 measures indicated

Read — Read + Spont — Spont + All0

-0.05

-0.10

-0.15

-0.20

-0.25

-0.30

-0.35

-0.40DurationCoGI(syllable)

I(word)

<-

Co

rre

latio

n c

oe

ffic

ien

t

+

+

*

*

*

**

*

++

* *

Adopted from van Son et al. (Proc. ICSLP’98)

Dutch male sp.

20 min. R/S

12 k syll.

8k words

791 VCV R/S

- 308 lex. str.

- 483 unstr.

C ident. 22 Ss


Conclusions

perceiving speech (segments) very much depends on speech quality and context

isolated segments is also a kind of context

only ‘proper’ interpretation of formant transitions (perceptual compensation for spectro-temporal undershoot) when presented in an appropriate context

reduced V are best perceived as schwa if transitions are contextually assimilated

from speech signal acoustics to perception louis c.w. pols institute of phonetic sciences (ifa)...

Documents