pitch perception in auditory scenes 2 papers on pitch perception… of a single sound source of more...
TRANSCRIPT
Pitch perception in auditory scenes
2
Papers on pitch perception…
• of a single sound source
• of more than one sound source
LOTS - too many?
Almost none ± a few
4
Bach: Musical Offering (strings)
5
Talk outline
• Low-level grouping cues in pitch perception– harmonic structure
• Conjoint or disjoint allocation
– onset-time– context– spatial cues.
• Use of a difference in harmonic structure between two sound sources to help identify, track over time, and localise simultaneous sound sources.
6
Excitation pattern of complex tone on basilar membrane
-5.0
0.0
5.0
10.0
15.0
20.0
25.0
base apexlog (ish) frequency
2004001600
resolvedunresolved
600800
Output of 1600 Hz filter Output of 200 Hz filter
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
1/200s = 5ms
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
1/200s = 5ms
7
Mistuned harmonic’s contribution to pitch declines as Gaussian function of mistuning
• Experimental evidencefrom mistuning expts: Moore, Glasberg & Peters JASA (1985).
Darwin (1992). In M. E. H. Schouten (Ed). The auditory processing of speech: from sounds to words Berlin: Mouton de Gruyter
frequency (Hz)
400 800 1200 1600 2000 24000
Match low pitch
-1.5
-1
-0.5
0
0.5
1
1.5
540 560 580 600 620 640 660 680
∆F
0
(Hz)
Frequency of mistuned harmonic f (Hz)
∆Fo = a - k ∆f exp(-∆f
2 / 2 s
2)
s = 19.9
k = 0.073
8
Harmonic Sieve
• Only consider frequencies that are close enough to harmonic. Useful as front-end to a Goldstein-type model of pitch perception.
Duifhuis, Willems & Sluyter JASA (1982).
frequency (Hz)
400 800 1200 1600 2000 2400
200 Hz sieve spacing
0
blocked
9
Is “harmonic sieve” necessary with autocorrelation models?
• Autocorrelation could in principle explain mistuning effect– mistuned harmonic
initially shifts autocorrelation peak
– then produces its own peak
• But the numbers do not work out. – Meddis & Hewitt model
is too tolerant of mistuning.
10
Disjoint allocation ?
Will the 200-Hz series grab the 600-Hz component, and so make it give less of a pitch shift to the 155-Hz series ?
Double Complex, Ipsilateral
TARGET
MATCH
600Hz
155 465 775
155 465 775
200
Darwin, Buffa, Williams and Ciocca (1992). in Auditory physiology and perception edited by Cazals, Horner and Demany (Pergamon, Oxford)
11
Disjoint allocation… (2)
• No evidence of 600-Hz component being grabbed by "in-tune" 200-Hz complex
-1.5
-1
-0.5
0
0.5
1
1.5
550 570 590 610 630 650 670 690
ipsi doubleipsi singleipsi double-single
Match: 155 HzLeft: 155 Hz F0Right: 200 Hz F05 subjects
12
Pitch determined by conjoint allocation
• Decision on each simultaneous pitch taken independently.
13
Mistuning & pitch
-0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 5 8
vowelcomplex
Mea
n pi
tch
shif
t (H
z)
% Mistuning of 4th Harmonic
8 subjects
90 ms
14
Onset asynchrony & pitch
-0.2
0
0.2
0.4
0.6
0.8
1
0 80 160 240 320
vowelcomplex
Onset Asynchrony T (ms)
Mea
n pi
tch
shif
t (H
z) 8 subjects±3% mistuning
90 ms
T
15
Adaptation ?0, 30, 300 ms 90 ms
155
465
775
1085
1705
1395
640 640
Increases effect of mistuning
16
Use of onset-tme
• A long onset-time removes a harmonic from pitch calculation.
• Not just due to adaptation because of regrouping experiment
• NB Onset-time much longer for pitch (100s of ms) than for hearing sound out as separate harmonic or for timbre judgements of complex (10s of ms).
17
Repeated context and pitch50 ms 90 ms
155 Hz
1085
1500
620 ± ∆
1810 Hz
0
0.4
0.8
1.2
1.6
0 2 4 6 8
IsolatedIn sequence
Absolute mistuning of 4th harmonic (%)
Darwin, Hukin, & Al-Khatib, B. Y. (1995). "Grouping in pitch perception: evidence for sequential constraints," J. Acoust. Soc. Am. 98, 880-885.
18
Pitch perception is not a bacon slicer
• Pitch perception does not operate on independent time-slices
• It uses, for example, the “Old + New” heuristic to parse which components are temporally relevant
Speckschneidermaschine
19
These conclusions apply to resolved harmonics
Separate simultaneous pitches only audible with unresolved harmonics when each pitch comes predominantly from a separate frequency region.
Problem for autocorrelation models
And the Old+New Heuristic breaks down. No advantage for onset-time.
20
Superimposing unresolved harmonics
Time (s)0 0.1
ミ 0.9636
0.9636
0
Time (s)0 0.1
ミ 0.9696
0.9696
0
Time (s)0 0.1
ミ 1.929
1.929
0
2kHz - 3kHz
Fo = 100 Hz
Fo = 125 Hz
sum
Carlyon, R. P. (1996). "Encoding the fundamental frequency of a complex tone in the presence of a spectrally overlapping masker," J. Acoust. Soc. Am. 99, 517-24.
21
Spatial cues ?
• Contralateral mistuned component contributes almost as much as ipsilateral (k= .073 vs .061)Cf Beerends and
Houtsma JASA (1989). -1.5
-1
-0.5
0
0.5
1
1.5
540 560 580 600 620 640 660 680
ipsilateral
contralateral
∆F
0 (H
z)
Frequency of mistuned harmonic f (Hz)
∆Fo = a - k ∆f exp(-∆f
2 / 2 s
2)
s = 17.4
k = 0.061
s = 19.9
k = 0.073
22
Summary of constraints for pitch
• Harmonic (Gaussian s.d. 3%)• Onset time (~100-200 ms)• Repeated Context• V. weak spatial constraint• Independent estimates of different
simultaneous F0s (conjoint allocation)
• N.B. These constraints different for unresolved harmonics and for e.g. timbre / vowel perception
Old +New
23
Using pitch differences to separate different objects
• Helps intelligibility of simultaneous speech
• Helps to track one voice in presence of others
• Helps to separate objects for independent
localisation
24
Fo between two sentences(Bird & Darwin 1998; after Brokx & Nooteboom, 1982)
0
20
40
60
80
100
0 2 4 6 8 10
Normal
Fo difference (semitones)
40 Subjects40 Sentence Pairs
Perfect Fourth ~4:3
Two sentences (same talker)• only voiced consonants • (with very few stops)Thus maximising Fo effect
Task: write down target sentence
Replicates & extends Brokx & Nooteboom
Target sentence Fo = 140 Hz
Masking sentence = 140 Hz ± 0,1,2,5,10 semitones
25
Example of using harmonicity for grouping for another property
How do we localise complex sounds when more than one sound source is present?
Maybe we group sounds first and then pool the localisation estimates of the component frequencies for that sound. More stable than grouping by location?
26
Localisation by ITD:Jeffress / Trahiotis & Stern
500 Hz Narrow-band noise
Right ear leads by 1.5 ms
500 Hz Wider-band noise
Right ear leads by 1.5 ms
Hear on Left
(phase ambiguity)
Hear on Right(consistent ITD)
27
Resolution of phase ambiguityResolution of phase ambiguity by across-frequency comparison
500 Hz: period = 2ms
L lags by 1.5 ms or L leads by 0.5 ms ?
-2.5200
800
600
400
-0.5 1.5 3.5
Delay of cross-correlator ms
Frequency of auditory filter Hz
Cross-correlation peaks for sound delayed in one ear by 1.5 ms
300 Hz: period = 3.3ms
R R LL R
Actual delay
Left ear actually lags by 1.5 ms
L lags by 1.5 ms or L leads by 1.8 ms ?
R
29
Mistuning & localisation
-15
-10
-5
0
5
10
15
Poin
ter
IID
(d
B)
0 1 3 6
6 Ss
percent mistuning