primitive probabilistic auditory scene analysis -...
TRANSCRIPT
Primitive probabilistic auditory scene analysis
Richard E. Turner ([email protected])
Computational and Biological Learning Lab, University of Cambridge
Maneesh Sahani ([email protected])
Gatsby Computational Neuroscience Unit, UCL, London
Motivation
object
parts of objects
structural primitives
sensoryinput
Motivation
object
parts of objects
structural primitives
sensoryinput
trees
bark
edges
photoreceptoractivities
Motivation
object
parts of objects
structural primitives
sensoryinput
trees
bark
edges
photoreceptoractivities
bird song
motif
AM tones and noise
auditory filteractivities
Auditory Scene Analysis
object
parts of objects
structural primitives
sensoryinput
trees
bark
edges
photoreceptoractivities
bird song
motif
AM tones and noise
auditory filteractivities
superschema−based
grouping
schema−basedgrouping
primitivegrouping
Generative model: Probabilistic Auditory Scene Synthesis
trees
bark
edges
photoreceptoractivities
bird song
motif
AM tones and noise
auditory filteractivities
superschema−based
grouping
schema−basedgrouping
primitivegrouping
p(source)
p(parts|source)
p(primitive|part)
p(sound|primitives)
Recognition model: Probabilistic Auditory Scene Analysis
bird song
motif
AM tones and noise
auditory filteractivities
superschema−based
grouping
schema−basedgrouping
primitivegrouping
p(source)
p(parts|source)
p(primitive|part)
p(sound|primitives)
p(source|sound)
p(part|sound)
p(primitive|sound)
sound waveform
Primitive Probabilistic Auditory Scene Analysis
bird song
motif
AM tones and noise
auditory filteractivities
superschema−based
grouping
schema−basedgrouping
primitivegrouping
p(source)
p(parts|source)
p(primitive|part)
p(sound|primitives)
p(source|sound)
p(part|sound)
p(primitive|sound)
sound waveform
Primitive Probabilistic Auditory Scene Analysis
Part 1: Probabilistic generative model: primitive auditory scenesynthesis
What are the important low-level statistics of natural sounds?
Part 2: Probabilistic recognition model: primitive auditoryscene analysis
Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.
Primitive Probabilistic Auditory Scene Analysis
Part 1: Probabilistic generative model: primitive auditory scenesynthesis
What are the important low-level statistics of natural sounds?
Part 2: Probabilistic recognition model: primitive auditoryscene analysis
Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.
Primitive Probabilistic Auditory Scene Analysis
Part 1: Probabilistic generative model: primitive auditory scenesynthesis
What are the important low-level statistics of natural sounds?
Part 2: Probabilistic recognition model: primitive auditoryscene analysis
Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.
Heuristic Analysis: Fire sound
y
Heuristic Analysis: Fire sound
y
−202
4.1
KH
z
−202
2.4
KH
z
−202
1 K
Hz
0.5 0.52 0.54 0.56 0.58 0.6 0.62−2
02
time /s
0.4
KH
z
filter
Heuristic Analysis: Fire sound
y
−202
4.1
KH
z
−202
2.4
KH
z
−202
1 K
Hz
0.5 0.52 0.54 0.56 0.58 0.6 0.62−2
02
time /s
0.4
KH
z
filter and demodulate
Heuristic Analysis: Fire sound
Heuristic Analysis: Fire sound
Heuristic Analysis: Water
Heuristic Analysis: Rain
Heuristic Analysis: Speech
Summary
sound → filter bank → demodulate → transformed envelopes
• Important statistics include
– energy in sub-bands (power-spectrum)– patterns of co-modulation (local power-sharing)– time-scale of the modulation– depth of the modulation (sparsity)
• Josh’s texture synthesis work tomorrow
• Formulate a probabilistic model to capture these statistics:
sound ← modulate carriers ← modulators ← transformed envelopes
Structural Primitives = Co-modulated narrow-band processes
Summary
sound → filter bank → demodulate → transformed envelopes
• Important statistics include
– energy in sub-bands (power-spectrum)– patterns of co-modulation (local power-sharing)– time-scale of the modulation– depth of the modulation (sparsity)
• Josh’s texture synthesis work tomorrow
• Formulate a probabilistic model to capture these statistics:
sound ← modulate carriers ← modulators ← transformed envelopes
Structural Primitives = Co-modulated narrow-band processes
Summary
sound → filter bank → demodulate → transformed envelopes
• Important statistics include
– energy in sub-bands (power-spectrum)– patterns of co-modulation (local power-sharing)– time-scale of the modulation– depth of the modulation (sparsity)
• Josh’s texture synthesis work tomorrow
• Formulate a probabilistic model to capture these statistics:
sound ← modulate carriers ← modulators ← transformed envelopes
Structural Primitives = Co-modulated narrow-band processes
Model Schematic
0 0.05 0.1
0
time
y t
0 0.05 0.1
0
time0 0.05 0.1
0
time
0
a 1,t
0
a D,t
0c 1,t
0c D,t
=
...
...
+ ... +
× ×
= =
Model Schematic
0 0.05 0.1
0
time
y t
0 0.05 0.1
0
time0 0.05 0.1
0
time
0
a 1,t
0
a D,t
0c 1,t
0c D,t
0x 1,t
0x D,t
=
...
...
+ ... +
× ×
= =
...
Probabilistic Model
• Transformed envelopes: slow and +/–
p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )
Γe,t−t′ = σ2e exp
(− 1
2τ2e
(t− t′)2)
• Envelopes: positive mixture of transformed envelopes with controllable sparsity
ad,t = a
(E∑e=1
gd,exe,t + µd
), e.g. a(z) = log(1 + exp(z)).
• Carriers: band-limited Gaussian noise
p(cd,t|cd,t−1:t−2, θ) = Norm
(cd,t;
2∑t′=1
λd,t′cd,t−t′, σ2d
)
yt =D∑d=1
cd,tad,t + σyεt
Probabilistic Model
• Transformed envelopes: slow and +/–
p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )
Γe,t−t′ = σ2e exp
(− 1
2τ2e
(t− t′)2)
• Envelopes: positive mixture of transformed envelopes with controllable sparsity
ad,t = a
(E∑e=1
gd,exe,t + µd
), e.g. a(z) = log(1 + exp(z)).
• Carriers: band-limited Gaussian noise
p(cd,t|cd,t−1:t−2, θ) = Norm
(cd,t;
2∑t′=1
λd,t′cd,t−t′, σ2d
)
yt =D∑d=1
cd,tad,t + σyεt
Probabilistic Model
• Transformed envelopes: slow and +/–
p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )
Γe,t−t′ = σ2e exp
(− 1
2τ2e
(t− t′)2)
• Envelopes: positive mixture of transformed envelopes with controllable sparsity
ad,t = a
(E∑e=1
gd,exe,t + µd
), e.g. a(z) = log(1 + exp(z)).
• Carriers: band-limited Gaussian noise
p(cd,t|cd,t−1:t−2, θ) = Norm
(cd,t;
2∑t′=1
λd,t′cd,t−t′, σ2d
)
yt =D∑d=1
cd,tad,t + σyεt
Probabilistic Model
• Transformed envelopes: slow and +/–
p(xe,1:T |Γe,1:T,1:T ) = Norm (xe,1:T ; 0,Γe,1:T,1:T )
Γe,t−t′ = σ2e exp
(− 1
2τ2e
(t− t′)2)
• Envelopes: positive mixture of transformed envelopes with controllable sparsity
ad,t = a
(E∑e=1
gd,exe,t + µd
), e.g. a(z) = log(1 + exp(z)).
• Carriers: band-limited Gaussian noise
p(cd,t|cd,t−1:t−2, θ) = Norm
(cd,t;
2∑t′=1
λd,t′cd,t−t′, σ2d
)
yt =D∑d=1
cd,tad,t + σyεt
What do the parameters control?
λd control the centre frequency and bandwidth of each sub-band
σ2d control the power in each sub-band
τe controls time-scale of modulators
µd and σ2e control the modulation depth, skew and the sparsity
gd,e control the patterns of modulation across sub-bands
0 0.1 0.2 0.3 0.40
0.1
0.2
0.3
0.4
0.5
centre frequency
band
wid
th
Intuitions for role of model parameters: Generation
Generation: Increasing sparsity
Generation: Increasing sparsity
Generation: Increasing sparsity
Generation: Increasing sparsity
Generation: Increasing sparsity
Generation: Increasing sparsity
Generation: Increasing co-modulation
Generation: Increasing co-modulation
Generation: Increasing co-modulation
Generation: Increasing co-modulation
Generation: Increasing co-modulation
Generation: Increasing co-modulation
Generation: Increasing co-modulation
Generation: Decreasing time-scale
Generation: Decreasing time-scale
Generation: Decreasing time-scale
Generation: Decreasing time-scale
Generation: Decreasing time-scale
Generation: Decreasing time-scale
Generation: Decreasing time-scale
Inference and Learning
• Key Observation: Fix envelopes, posterior over carriers is Gaussian.
• Leads to MAP estimation of the envelopes (like HMMs, ICA, NMF)
XMAP = arg maxX
p(X|Y)
p(X|Y) =1Zp(X,Y) =
1Z
∫dCp(X,Y,C) =
1Zp(X)
∫dCp(Y|A,C)p(C)
• Compute integral efficiently using Kalman Smoothing
• Leads to gradient based optimisation for transformed amplitudes
• Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)
• See Turner and Sahani, ICASSP 2010.
Inference and Learning
• Key Observation: Fix envelopes, posterior over carriers is Gaussian.
• Leads to MAP estimation of the envelopes (like HMMs, ICA, NMF)
XMAP = arg maxX
p(X|Y)
p(X|Y) =1Zp(X,Y) =
1Z
∫dCp(X,Y,C) =
1Zp(X)
∫dCp(Y|A,C)p(C)
• Compute integral efficiently using Kalman Smoothing
• Leads to gradient based optimisation for transformed amplitudes
• Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)
• See Turner and Sahani, ICASSP 2010.
Synthetic Sounds
Stream Stream Stream Wind Wind Fire Fire Rain Twigs Speech
Orig Orig Orig Orig Orig Orig Orig Orig Orig Orig
Model Model Model Model Model Model Model Model Model Model
Primitive Probabilistic Auditory Scene Analysis
Part 1: Probabilistic generative model: primitive auditory scenesynthesis
What are the important low-level statistics of natural sounds?
Part 2: Probabilistic recognition model: primitive auditoryscene analysis
Provocative computational theory: Grouping rules arise frominferences based on the statistics of natural sounds.
Introduction to the psychophysics
• Probe model with stimuli used to study psychophysics of auditory sceneanalysis
• Fix carriers to be ‘auditory filter’ like
• Assume perceptual access to:
– instantaneous frequency of the carriers (not the phase)– modulators
How does the model ‘hear’ these stimuli?
Continuity Illusion
yt = a1,tc1,t + a2,tc2,t
−5
0
5
sign
al
0
100
freq
uenc
y /H
z
Continuity Illusion
yt = a1,tc1,t + a2,tc2,t
−5
0
5
sign
al
0
100
freq
uenc
y /H
z
Continuity Illusion
yt = a1,tc1,t + a2,tc2,t
−5
0
5
sign
al
0
100
freq
uenc
y /H
z
−1
0
1
ytone
0 0.5 1 1.5
−5
0
5
ynois
e
time /s
Common Amplitude Modulation
yt = (a1,t + a3,t)c1,t + (a2,t + a3,t)c2,t
0
sign
al
0 0.5 1 1.5 2 2.5 3 3.5 4
100
150
200
time /s
freq
uenc
y /H
z
Common Amplitude Modulation
yt = (a1,t + a3,t)c1,t + (a2,t + a3,t)c2,t
0
sign
al
0 0.5 1 1.5 2 2.5 3 3.5 4
100
150
200
time /s
freq
uenc
y /H
z
Common Amplitude Modulation
yt = (a1,t + a3,t)c1,t + (a2,t + a3,t)c2,t
0
sign
al
0 0.5 1 1.5 2 2.5 3 3.5 4
100
150
200
time /s
freq
uenc
y /H
z
Old Plus New Heuristic
yt = a1,t(c1,t + c2,t + c3,t) + a2,t(c2,t + c4,t) + a3,tc2,t + a4,t(c1,t + c3,t)
−2
0
2
0
200
400
freq
uenc
y /H
z
Old Plus New Heuristic
yt = a1,t(c1,t + c2,t + c3,t) + a2,t(c2,t + c4,t) + a3,tc2,t + a4,t(c1,t + c3,t)
−2
0
2
0
200
400
freq
uenc
y /H
z
Old Plus New Heuristic
yt = a1,t(c1,t + c2,t + c3,t) + a2,t(c2,t + c4,t) + a3,tc2,t + a4,t(c1,t + c3,t)
−2
0
2
0
400
freq
uenc
y /H
z
−202
a 1
−202
a 2
−101
a 3
0 0.5 1 1.5 2 2.5 3 3.5 4−2
02
a 4
time /s
Comodulation Masking Release
yt = a1,tc1,t + a2,tc2,t
−4−2
024
signal 1
80
100
120
freq
uenc
y /H
z
signal 2
Comodulation Masking Release
yt = a1,tc1,t + a2,tc2,t
−4−2
024
signal 1
80
100
120
freq
uenc
y /H
z
signal 2
Comodulation Masking Release
yt = a1,tc1,t + a2,tc2,t
−4−2
024
signal 1
80
100
120
freq
uenc
y /H
z
signal 2
−4−2
024
ynois
e
0 0.5 1 1.5 2−2
0
2
ytone
time /s0 0.5 1 1.5 2
−2
0
2
time /s
Good Continuation
yt = a1,tc1,t + a2,tc2,t
0
sign
al 1
0
50
100
freq
uenc
y /H
z
0
sign
al 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 80
120
160
time /s
freq
uenc
y /H
z
Good Continuation
yt = a1,tc1,t + a2,tc2,t
0
sign
al 1
0
50
100
freq
uenc
y /H
z
0
sign
al 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 80
120
160
time /s
freq
uenc
y /H
z
Good Continuation
yt = a1,tc1,t + a2,tc2,t
0
sign
al 1
0
50
100
freq
uenc
y /H
z
0
sign
al 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5160
240
320
time /s
freq
uenc
y /H
z
Proximity
yt = a1,tc1,t + a2,tc2,t
−1
0
1
sign
al 1
0
100
freq
uenc
y /H
z
−1
0
1
sign
al 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0
100
time /s
freq
uenc
y /H
z
Proximity
yt = a1,tc1,t + a2,tc2,t
−1
0
1
sign
al 1
0
100
freq
uenc
y /H
z
−1
0
1
sign
al 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0
100
time /s
freq
uenc
y /H
z
Proximity
yt = a1,tc1,t + a2,tc2,t
−1
0
1
sign
al 1
0
100
freq
uenc
y /H
z
−1
0
1
sign
al 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0
100
time /s
freq
uenc
y /H
z
Conclusions
• Developed a model for natural sounds comprising quickly varying carriers andslowly varying modulators
• Inspired by natural sound statistics
• Inference replicates many of the characteristics of auditory scene analysis
Future work
• Train on a large corpus of natural sounds: quantitatively predictpsychophysical results
• Trading between grouping principles learned from natural sounds
• Psychophysics: prediction of grouping on new stimuli
• Experimental stimulus generation: samples from the model are controlled,but natural sounding
• Model extensions: Hierarchical (cascades of modulators), asymmetries, moregeneral filter-shapes
Additional Slides
Inference with fixed amplitudes, ad,t = 1
700 800 900 1000 1100 1200 1300 1400 1500 1600frequency /Hz
carr
ier
spec
tra
filte
r 3
filte
r 2
filte
r 1
5.35 5.36 5.37 5.38 5.39 5.4time /s
y
5.35 5.36 5.37 5.38 5.39 5.4time /s