robust direction estimation with convolutional neural ...pertila/icassp2017talk.pdf · robust...
Post on 12-May-2018
223 Views
Preview:
TRANSCRIPT
Robust Direction Estimation withConvolutional Neural Networks-based
Steered Response Power
Pasi Pertila, Emre Cakir
Laboratory of Signal ProcessingAudio Research Group
Tampere University of TechnologyTampere, Finland
The 42nd IEEE International Conference on Acoustics,Speech and Signal Processing, 2017
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Outline
1 Introduction
2 Time-Frequency masking for speech enhancement
3 Sound direction of arrival (DOA) estimation
4 Results
5 Conclusions and Discussion
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 2/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
1 Introduction
2 Time-Frequency masking for speech enhancement
3 Sound direction of arrival (DOA) estimation
4 Results
5 Conclusions and Discussion
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 3/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Introduction
Uses of sound DOA estimation
Spatial filtering: beamforming, speechenhancement
Surveillance, automatic camera management
Speaker DOA estimate can be degraded byreverberation and everyday noise
time-varying
e.g. household, cafeteria…
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 4/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Introduction
Uses of sound DOA estimation
Spatial filtering: beamforming, speechenhancement
Surveillance, automatic camera management
Speaker DOA estimate can be degraded byreverberation and everyday noise
time-varying
e.g. household, cafeteria…
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 4/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Time-Frequency (TF) maskingremoves undesired TF-components from observation
successfully applied for speech enhancement usingdeep learning[1][2]
applied in Time Di�erence of Arrival estimation
Eqs. based on insight to deal with static noise[3]
Regression to deal with reverberation[4]
This work proposes deep learning-based (CNN)TF-masking for DOA estimation
to deal with everyday noise and reverberation[1]
A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13
[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM
Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]
Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound
Source Localization on Mobile Robots”, IROS, 2015[4]
Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized
cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 5/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Why use convolutional neuralnetworks (CNNs) for speech?
Speech signals contain significant local information inspectral domain.
changing speaker and environmental conditions→ shi�s in spectral position
CNNs have a translational shi� invariance property→ Suitable option for our task.
convolutional neural networks (CNNs)
are discriminative classifiers
compute activations through shared weights overlocal receptive fields
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Why use convolutional neuralnetworks (CNNs) for speech?
Speech signals contain significant local information inspectral domain.
changing speaker and environmental conditions→ shi�s in spectral position
CNNs have a translational shi� invariance property→ Suitable option for our task.
convolutional neural networks (CNNs)
are discriminative classifiers
compute activations through shared weights overlocal receptive fields
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Why use convolutional neuralnetworks (CNNs) for speech?
Speech signals contain significant local information inspectral domain.
changing speaker and environmental conditions→ shi�s in spectral position
CNNs have a translational shi� invariance property→ Suitable option for our task.
convolutional neural networks (CNNs)
are discriminative classifiers
compute activations through shared weights overlocal receptive fields
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Why use convolutional neuralnetworks (CNNs) for speech?
Speech signals contain significant local information inspectral domain.
changing speaker and environmental conditions→ shi�s in spectral position
CNNs have a translational shi� invariance property→ Suitable option for our task.
convolutional neural networks (CNNs)
are discriminative classifiers
compute activations through shared weights overlocal receptive fields
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Why use convolutional neuralnetworks (CNNs) for speech?
Speech signals contain significant local information inspectral domain.
changing speaker and environmental conditions→ shi�s in spectral position
CNNs have a translational shi� invariance property→ Suitable option for our task.
convolutional neural networks (CNNs)
are discriminative classifiers
compute activations through shared weights overlocal receptive fields
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Why use convolutional neuralnetworks (CNNs) for speech?
Speech signals contain significant local information inspectral domain.
changing speaker and environmental conditions→ shi�s in spectral position
CNNs have a translational shi� invariance property→ Suitable option for our task.
convolutional neural networks (CNNs)
are discriminative classifiers
compute activations through shared weights overlocal receptive fields
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
1 Introduction
2 Time-Frequency masking for speech enhancement
3 Sound direction of arrival (DOA) estimation
4 Results
5 Conclusions and Discussion
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 7/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Array signal model
The ith microphone signal is modeled as themixture ofreverberated signals in presence of noise
xi(t, f)︸ ︷︷ ︸observation
=∑
n hmi,rn(f) · sn(t, f)︸ ︷︷ ︸reverbered nth source signal
+ ei(t, f)︸ ︷︷ ︸noise
, (1)
f = 0, . . . ,K − 1 is discrete frequency index,
t is processing frame index,
hmi,rn(f) is the room impulse response (RIR) betweensource position rn ∈ R3 and microphone positionmi ∈ R3 in Cartesian coordinates
i = 0, . . . ,M − 1 whereM is the number ofmicrophones.
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 8/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
TF masking
Enhances by applying a mask ηi(t, f) to the observed signal
si(t, f) = xi(t, f) · ηi(t, f) (2)
Wiener filter
η(t, f) =|s(t, f)|2
|s(t, f)|2 + |e(t, f)|2, (3)
To enhance the direct-path signal in presence of interference
s(t, f) contains the direct path component of source
e(t, f) is a mixture of source reverberation,interference with reverberation, and noise
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 9/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
TF masking
Enhances by applying a mask ηi(t, f) to the observed signal
si(t, f) = xi(t, f) · ηi(t, f) (2)
Wiener filter
η(t, f) =|s(t, f)|2
|s(t, f)|2 + |e(t, f)|2, (3)
To enhance the direct-path signal in presence of interference
s(t, f) contains the direct path component of source
e(t, f) is a mixture of source reverberation,interference with reverberation, and noise
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 9/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
TF masking
Enhances by applying a mask ηi(t, f) to the observed signal
si(t, f) = xi(t, f) · ηi(t, f) (2)
Wiener filter
η(t, f) =|s(t, f)|2
|s(t, f)|2 + |e(t, f)|2, (3)
To enhance the direct-path signal in presence of interference
s(t, f) contains the direct path component of source
e(t, f) is a mixture of source reverberation,interference with reverberation, and noise
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 9/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations
16ch circular array, radius 15 cm.
Seven rooms, RT60 ∈ [0.3, 1.3] s.
Mixed with interference and ambient noise.
Three classes of interference[1]
Household, Interior background, Printer.
speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB
Ambient noise SNR ∼ U [6, 12] dB
22440 training, 3000 validation, and 5160 test mixtures
[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 10/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancementArray signal model
TF-mask learning with CNNs
CNN Training data generationprocess
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).
Input/output block size is 32 frames × 172 freq.bins.
Window length 21.4 ms, 50 % overlap, 16 kHz data.
Masks between microphones are highly similar
Features and targets are averaged over microphones→ Single mask: Computationally e�icient
CNN design
Four convolutional layers
96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution
Each followed by batch-norm. and dropout (0.25)
First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search
Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 11/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
1 Introduction
2 Time-Frequency masking for speech enhancement
3 Sound direction of arrival (DOA) estimation
4 Results
5 Conclusions and Discussion
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 12/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
SRP-PHAT
Steered Response Power with PHAseTransform (SRP-PHAT)
L(k, t)=∑i,j
K−1∑f=0
xi(t, f) · (xj(t, f))∗
|xi(t, f)|·|x∗j (t, f)|exp(τi,j · ωf ), (4)
k ∈ R3 sound direction (unit vector)
τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,
(·)∗ denotes complex conjugate, ωf =2πf/K , and
c is the speed of sound.
Due to feature and target averaging, ηi(t, f) = ηj(t, f)
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 13/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
SRP-PHAT with TF maskingBased on weighted GCC-PHAT[1]
Weighted SRP-PHAT
L(k, t)=∑i,j
K−1∑f=0
ηi(t, f)xi(t, f) · (ηj(t, f)xj(t, f))∗
|xi(t, f)|·|x∗j (t, f)|exp(τi,j ·ωf ),
(4)
k ∈ R3 sound direction (unit vector)
τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,
(·)∗ denotes complex conjugate, ωf =2πf/K , and
c is the speed of sound.
Due to feature and target averaging, ηi(t, f) = ηj(t, f)
[1]F. Grondin, and F. Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for
Sound Source Localization on Mobile Robots”, IROS, 2015Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 13/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
SRP-PHAT with TF maskingBased on weighted GCC-PHAT[1]
Weighted SRP-PHAT
L(k, t)=∑i,j
K−1∑f=0
η(t, f)2xi(t, f) · (xj(t, f))∗
|xi(t, f)|·|x∗j (t, f)|exp(τi,j · ωf ), (4)
k ∈ R3 sound direction (unit vector)
τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,
(·)∗ denotes complex conjugate, ωf =2πf/K , and
c is the speed of sound.
Due to feature and target averaging, ηi(t, f) = ηj(t, f)
[1]F. Grondin, and F. Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for
Sound Source Localization on Mobile Robots”, IROS, 2015Robust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 13/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
SRP-PHAT with TF masking
The point estimate for DOA
k(t) = argmaxk
L(k, t). (5)
For each frame t separately, pick the maximum responsedirection of the weighted SRP-PHAT L(k, t)
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 14/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysis
Speech: recorded Japanese sentences (RWCP)
Mixed with reverberated interference signals
Same SIR levels as in training.
Di�erent inteference instances, di�erent RIRs
1 Speech played back from moving loudspeaker
4 di�erent rooms (with di�erent number of RIRs)
Total of 1440 mixtures
2 Speech played back from a static loudspeaker
5 di�erent rooms (with di�erent number of RIRs)
Total of 1650 mixtures
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisUsed performance measures
1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.
2 Percentage of DOA estimates in ground truthdirection(±10°)
Only frames with detected speech are included
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 16/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisUsed performance measures
1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.
2 Percentage of DOA estimates in ground truthdirection(±10°)
Only frames with detected speech are included
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 16/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisUsed performance measures
1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.
2 Percentage of DOA estimates in ground truthdirection(±10°)
Only frames with detected speech are included
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 16/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisCompared methods
1 SRP-PHAT (traditional approach)
2 SRP-PHAT weighted with CNN-predicted mask
3 SRP-PHAT weighted with interference canceling mask(ICM)
ICM←Wiener filter with access to addedinterference and original reverberated speech
will not remove reverberation of target speaker.
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 17/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisCompared methods
1 SRP-PHAT (traditional approach)
2 SRP-PHAT weighted with CNN-predicted mask
3 SRP-PHAT weighted with interference canceling mask(ICM)
ICM←Wiener filter with access to addedinterference and original reverberated speech
will not remove reverberation of target speaker.
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 17/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisCompared methods
1 SRP-PHAT (traditional approach)
2 SRP-PHAT weighted with CNN-predicted mask
3 SRP-PHAT weighted with interference canceling mask(ICM)
ICM←Wiener filter with access to addedinterference and original reverberated speech
will not remove reverberation of target speaker.
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 17/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking
DOA performance analysis
4 Results
5 Conclusions andDiscussion
DOA performance analysisA
zim
uth
angl
e e
(deg
)
a)
50 100 150
100
200
300
DOA estimate, active frame DOA estimate, inactive frame Ground truth ±10°
Azi
mut
h an
gle e
(deg
)
b)
50 100 150
100
200
300
Time (frame)
Azi
mut
h an
gle e
(deg
)
c)
50 100 150
100
200
300
Time (frame)
Azi
mut
h an
gle e
(deg
)
d)
50 100 150
100
200
300
a) SRP-PHAT without interferenceb) SRP-PHAT with added printer interference at +6 dB SIR,c) SRP-PHAT with ICM weight for the mixture,d) SRP-PHAT with CNN predicted weight for the mixture.
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 18/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
1 Introduction
2 Time-Frequency masking for speech enhancement
3 Sound direction of arrival (DOA) estimation
4 Results
5 Conclusions and Discussion
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 19/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
ResultsSRP-PHAT results[1], static source
0
10
20
30
40
50
60
70
80
90
100
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
Perfo
rman
ce (%
)
Relative SRP-PHAT mass within ±10° of ground truth (Static)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
CNN weighting improves SRP-PHAT
In SIR +12,+6 dB: CNN outperforms ICM.
Most likely caused by reduction of reverberation
[1]Normalized with SRP-PHAT result without interferenceRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 20/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
ResultsSRP-PHAT results[1], static source
0
10
20
30
40
50
60
70
80
90
100
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
Perfo
rman
ce (%
)
Relative SRP-PHAT mass within ±10° of ground truth (Static)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
CNN weighting improves SRP-PHAT
In SIR +12,+6 dB: CNN outperforms ICM.
Most likely caused by reduction of reverberation
[1]Normalized with SRP-PHAT result without interferenceRobust DOA estimation using CNN-based SRP –
Pertila, Cakir ICASSP’17 20/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
ResultsSRP-PHAT results, moving source
0
10
20
30
40
50
60
70
80
90
100
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
Perfo
rman
ce (%
)
Relative SRP-PHAT mass within ±10° of ground truth (Moving)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 21/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
ResultsDOA point estimate, static source
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.Re
lativ
e pe
rform
ance
(%)
Correct DOA estimates within ±10 deg. of ground truth (Static)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
CNN weighting improves SRP-PHAT
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 22/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
ResultsDOA point estimate, moving source
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
House
Inter.
bg.
Avg.
Rela
tive
perfo
rman
ce (%
)
Correct DOA estimates within ±10 deg. of ground truth (Moving)
SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT
SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 23/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
1 Introduction
2 Time-Frequency masking for speech enhancement
3 Sound direction of arrival (DOA) estimation
4 Results
5 Conclusions and Discussion
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 24/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.
Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.
Relative performance unchanged by source motion
CNN generalized to new speakers and new instances ofinterference from unseen angles.
The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)
SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.
Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.
Relative performance unchanged by source motion
CNN generalized to new speakers and new instances ofinterference from unseen angles.
The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)
SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.
Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.
Relative performance unchanged by source motion
CNN generalized to new speakers and new instances ofinterference from unseen angles.
The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)
SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.
Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.
Relative performance unchanged by source motion
CNN generalized to new speakers and new instances ofinterference from unseen angles.
The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)
SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.
Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.
Relative performance unchanged by source motion
CNN generalized to new speakers and new instances ofinterference from unseen angles.
The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)
SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28
1 Introduction
2 Time-Frequencymasking for speechenhancement
3 Sound direction ofarrival (DOA)estimation
4 Results
5 Conclusions andDiscussion
Conclusions and Discussion
CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.
Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.
Relative performance unchanged by source motion
CNN generalized to new speakers and new instances ofinterference from unseen angles.
The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)
SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking
Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28
top related