from speech signal acoustics to perception louis c.w. pols institute of phonetic sciences (ifa)...

28
From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) NATO-ASI “Dynamics of Speech Production and Perception” Il Ciocco, Tuscany, Italy,

Upload: aubrey-wilcox

Post on 15-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

From speech signal acousticsto perception

Louis C.W. Pols

Institute of Phonetic Sciences (IFA)

Amsterdam Center for Language and Communication

(ACLC)NATO-ASI “Dynamics of Speech Production and Perception”

Il Ciocco, Tuscany, Italy, July 4, 2002

Page 2: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 2

Overview

how do we perceive (speech) dynamics? The Intelligent Ear. On the Nature of Sound

Perception, by Reinier Plomp (2002) from psychoacoustics to speech perception

(lack of) context; robustness; continuity V and C reduction; coarticulation

perceptual compensation for artic. undershoot?

speech efficiency conclusions

Page 3: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 3

Various scientific preferences

several biases have affected the history of (speech &) hearing research (Plomp, 2002): dominance of sinusoidal tones as stimuli preference for microscopic approach (e.g.,

phoneme discrimination rather than intelligibility)

emphasis on psychophysical (rather than cognitive) aspects of hearing

clean stimuli in the lab rather than the acoustic reality of the outside world (disruptive sounds)

Page 4: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 4

Psychoacoustics - speech perc.

duration, pitch, loudness, timbre, direction absolute and masked threshold, jnd, discrim. continuity complexity (pure - complex tone, voicing) effect of context, meaning (intell.), freq. occ. phoneme: more text-guided than perceived speech perceptual tasks:

phoneme —> sent. identif.; discrim.; matching

Page 5: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 5

phenomenon threshold/ jnd

remarks phenomenon threshold/ jnd

remarks

threshold of hearing

0 dB at 1000 Hz frequency dependent formant frequency

3 - 5 % one formant only < 3 % with more experienced subjects

threshold of duration

constant energy at 10 – 300 ms

Energy = Power x Duration

formant amplitude

3 dB F2 in synthetic vowel

frequency discrimination

1.5 Hz at 1000 Hz

more when < 200 ms

overall intensity

1.5 dB synthetic vowel, mainly F1

intensity discrimination

0.5 – 1 dB up to 80 dB SL formant bandwidth

20 - 40 % one-formant vowel

temporal discrimination

5 ms at 50 ms duration dependent F0 (pitch) 0.3 - 0.5 % synthetic vowel

masking psychophysical tuning curve

pitch of complex tones

low pitch many peculiarities

gap detection 3 ms for wide-band noise

more at low freq. for narrow-band noise

Detection thresholds and jnd

multi-harmonic,simple, stationary signals single-formant-like

periodic signals

3 - 5%

1.5 Hz20 - 40%

frequency

F2

BW

Page 6: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 6

Perceiving speech-like trans.

Ph.D thesis A. van Wieringen (1995) “Perceiving dynamic speechlike sounds. Psycho-

acoustics and speech perception” see also vWie & Pols, Acustica 84 (1998) 520-528

stimulus characteristics (segmented and/or reversed) natural or synthetic tone glide; single- or multi-formant transition isolated trans.; initial or final trans. with steady st. converg. or diverg. trans. (var. duration or slope)

task: jnd/DL; matching; abs. ident.; classif.

Page 7: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 7

DL for short speech-like transitions

20 30 40 50

Transition duration (ms)

0

60

120

180

240

Tone glide

Tone glide

Single-isolated

ComplexSingle Single

Complex

Adopted from van Wieringen & Pols (1998), Acta Acustica 84, 520-528“Discrimination of short and rapid speechlike transitions”

complex

simple

short longer trans.

initial

final

Page 8: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 8

Perceiving (speech) dynamics

vowel perception w/w or w/o transitions? our claims (vSon, IFA Proc. 17 (1993)):

only evidence for compensatory processes, i.e. perceptual-overshoot and dynamic-specification, when in an appropriate context

synthetic isolated dynamic formant tracks lead to perceptual undershoot (=averaging)

silent center studies are ambiguous concl.: info in formant dynamics is only

used when V’s are heard in appropriate context

Page 9: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

F = 375 Hz2

F1

2FF =-375 Hz2

time --> ms

fre

qu

en

cy -

-> H

z

F1

2Ffr

eq

ue

ncy

-->

Hz

< 6.3, 12.5, 25, > 50, 100, 150 ms

< 25, 50 > 100, 150 ms

< 25, 50 > 100, 150 ms

Stationary (reference) tokens

Dynamic tokens

on- offglide

complete

F =-225 Hz1

on- offglide

complete

F = 225 Hz1

Page 10: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 10

Vowel identification

compare V responses for dynamic stimuli with those for static stimuli

calculate net shift in V responses per onglide (CV), complete (CVC), or offglide (VC)

result: responses average over the trailing part of the formant track

Page 11: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 11

X

1501005025-50

-40

-30

-20

-10

0

10

20

30

40

50

% N

et s

hift

->

Token duration -> ms

F = 225Hz1

F = 375Hz2

F =-375Hz2

F =-225Hz1Net shift in vowelresponses to tokenswith curved formanttracks vs. stationarytokens. All valuessignificant, exceptsmall open triangles

Perceptual undershoot

Page 12: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 12

Effect of local context

“Perisegmental speech improves consonant and vowel identification”, vSon & Pols, Speech Comm. 29,1-22 (1999)

also “Phoneme recognition as a function of task and context”, IFA Proc. 24, 27-38 (2001) and Proc. SPRAAC, 25-30 (2001)

also Pols & vSon (1993), “Acoustics and perception of dynamic vowel segments”, Speech Comm. 13, 135-147

Page 13: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 13

V and C identification

gated tokens from 120 CVC speech fragments taken from a long text reading

50 ms V kernel, + V trans., + C part (L/R) stimuli randomized; V identification (17 Ss)

and Ci and Cf identification (15 Ss) results:

phoneme identification benefits from extra speech

left context more beneficial than right context better identification when also other member of

pair was identified correctly (context effect)

Page 14: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

KernelV

CVC

TCCTTCC

VCC

CCTCV VC

CCV

CV VC

233200150100500Time -> ms

S la

50 ms

+10 ms–10–10+10 ms

+25 ms+25 ms Transition Transition

(152)

(91)(112)

(91)

(106)(91)(56)(41)

(106)(91)

(56)(41)

(50)

Vowel identification

Consonant identification

Page 15: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

Stimulus typeKernel VC V CV CVC

Err

ors

-> %

0

10

20

30

40

All+ Accent– Accent

204010031037

N

0.0

0.5

1.0

1.5

Log

2 P

erpl

exity

->

bits

+ +

* * *

+

Error rates of vowel identification for the individual stimulus token types. Long-short vowel errors (/α-a:, -o:/) are ignored

c

Page 16: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

VC V C0

10

20

30

40

50

60

70

VCCV

Err

ors

-> %

ErrorCorrect

Other segment is

N = 1680

V and C in CV tokens were identified better when theother member of the pair was identified correctly

Page 17: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 17

Effect of (lack of) context

100 Dutch listeners identifying V segments “Vowel contrast reduction”, K-vBeinum (1980)

3 conditions M1 M2 F1 F2 Av.isolated V %(3) ASC

95.2433

88.9404

88.0447

86.4634

89.6480

words %(5) ASC

88.1406

78.8320

84.9374

85.3529

84.3407

unstr., free conv. %(10) ASC

31.2174

28.7119

33.3209

38.9255

33.0189

ASC = 1/n Σ |LFi - LFi|2 (total variance), LFi = 100 10log Fii=1

n

Page 18: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 18

Human word intelligibility vs. noise

from Ph.D thesisH. Steeneken (1992)‘On measuring andpredicting speechintelligibility’

Page 19: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 19

Robustness to degraded speech

speech = time-modulated signal in frequency bands

relatively insensitive to (spectral) distortions prerequisite for digital hearing aid modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz

temporal smearing of envelope modulation ca. 4 Hz max. in modulation spectrum syllable LP>4 Hz and HP<8 Hz little effect on intelligibility

spectral envelope smearing for BW>1/3 oct masked SRT starts to degrade

(for references, see keynote paper Pols in Proc. ICPhS’99)

Page 20: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 20

Some examples

partly reversed speech (Saberi & Perrott, Nature, 4/99) fixed duration segments time reversed or shifted in

time perfect sentence intelligibility up to 50 ms

(demo: every 50 ms reversed original ) low frequency modulation envelope (3-8 Hz) vs.

acoustic spectrum syllable as information unit? (S. Greenberg)

gap and click restoration (Warren) gating experiments

Page 21: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 21

Continuity, especiallywhile masked

continuity effect (Miller & Licklider), auditory induction (Warren), pulsation threshold (Houtgast)

also for gliding tones also for complex tones also for pitch

fission, fusion segregation, streaming

phonemic restoration

500

900

1200

2000Hz

—> time

Page 22: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 22

V and C reduction, coarticulation

spectral variability is not random but, at least partly, speaker-, style-, and context-specific

read - spontaneous; stressed - unstressed not just for vowels, but also for

consonants duration; spectral balance intervocalic sound energy difference F2 slope difference; locus equation

Page 23: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

Stressed Unstressed Total0

5

10

15

20

25

30

35

Read

Spontaneous

Err

or

rate

->

%p 0.001 0.001 0.001

45

50

55

60

65

45

50

55

60

65

Read

Spontaneous

Stressed Unstressed Total

Dur

atio

n ->

ms

p 0.001 0.006 0.001

Mean consonant duration Mean error rate for C identification

Adopted from van Son & Pols (Eurospeech’97)

C-duration C error rate

791 VCV pairs (read & spontan.; stressed & unstr. segments; one male); C-identification by 22 Dutch subjects

Page 24: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 24

Perception of ac. V reduction

Ph.D thesis Dick van Bergem (1995) “Acoustic and lexical vowel reduction”

lexical V reduction: Fr /betõ/ vs. Du /b@tOn/ acoustic V reduction:

Du ‘miljoen’ as /mIljun/ or as /m@ljun/ identify the unstressed vowels (as V or @)

by 20 listeners (8M, 12 F) in 47 words (cond. W and S) or 20 words (cond. P), like ‘milJOEN’ or

‘biosCOOP’ spoken by 20 male speakers (2280 stimuli)

Page 25: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

5%

36%

60%

69%

4 reduction stages for 20 speakers

% schwa responses on /I/ by 20 listeners

model prediction for schwa in this m-l context

adapted fromvBergem (1995)

Conclusion: Vowel reduction is not centralization but contextual assimilation

Page 26: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 26

Speech efficiency

speech is most efficient if it contains only the information needed to understand it:“Speech is the missing information” (Lindblom, JASA ‘96)

less information needed for more predictable things: shorter duration and more spectral reduction for high-

frequent syllables and words

C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits

(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

Page 27: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

Correlation between consonant confusion and 4 measures indicated

Read — Read + Spont — Spont + All0

-0.05

-0.10

-0.15

-0.20

-0.25

-0.30

-0.35

-0.40DurationCoGI(syllable)

I(word)

<-

Co

rre

latio

n c

oe

ffic

ien

t

+

+

*

*

*

**

*

++

* *

Adopted from van Son et al. (Proc. ICSLP’98)

Dutch male sp.

20 min. R/S

12 k syll.

8k words

791 VCV R/S

- 308 lex. str.

- 483 unstr.

C ident. 22 Ss

Page 28: From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

July 4, 2002 From speech signal acoustics to perception, Il Ciocco 28

Conclusions

perceiving speech (segments) very much depends on speech quality and context

isolated segments is also a kind of context

only ‘proper’ interpretation of formant transitions (perceptual compensation for spectro-temporal undershoot) when presented in an appropriate context

reduced V are best perceived as schwa if transitions are contextually assimilated