is audio signal processing still useful in the...
TRANSCRIPT
deeploria – 21/04/2016
IS AUDIO SIGNAL PROCESSINGSTILL USEFUL IN THE ERA OFMACHINE LEARNING?Emmanuel VincentInria, France
A silent animation
Disclaimer: All characters appearing in this animation arefictitious. Any resemblance to real persons, living or dead, is purelycoincidental.
Credits:Icons: made by Freepik, Anton Saputro and Robin Kylander from www.flaticon.comand licensed under Creative Commons BY 3.0.Music: Modern Times, by Charlie Chaplin & David Raksin.
deeploria – 21/04/2016 2
A long time ago, God created audio signal processing
I
Hello my dear friends!
deeploria – 21/04/2016 3
Researchers made some progress every year
I
All is for the best in the best
of all possible worlds.
deeploria – 21/04/2016 4
New technologies would sometimes raise skepticism. . .
I
Hello, I’m coming from the
image community. Can I help
you model spectrograms?
You can’t be a good model.
Speech follows my own model.
I
NMF
deeploria – 21/04/2016 5
. . . but would finally be adopted
I
All is for the best in the
best of all possible
worlds. I
NMF
deeploria – 21/04/2016 6
One day a disruptive technology appeared
I
DNN
Hey, pals! I’m the new state of
the art for automatic speech
recognition.
I
Good for you! But
we’re not concerned… I
NMF
deeploria – 21/04/2016 7
It quickly spread to other fields. . .
I
DNN
I also speak 20 languages, paint
like Van Gogh, compose music
like Bach, cook like Bocuse…
I
Well, we’re impressed!
But we’re still not
concerned… I
NMF
deeploria – 21/04/2016 8
. . . until it reached the core of signal processing
I
DNN
I just tried to perform speech
enhancement and, guess what,
I can also do it!
I
!I
NMF
deeploria – 21/04/2016 9
Some researchers felt depressed. . .
I
DNN
Every time I fire a signal
processing researcher, my
output SNR goes up!
I
This guy is going to kill
us and take our jobs! I
NMF
deeploria – 21/04/2016 10
. . . some felt angry. . .
I
DNN
So what? My SNR outperforms
yours and I can run on
embedded platforms too!
I
You’re a brute-force
black box and you train
slowly! I
NMF
deeploria – 21/04/2016 11
. . . and some saw a great opportunity!
I
DNN
I need more data and domain
knowledge to progress. Would
you help me?
I
Let’s work together!
We need your modeling
power to progress too. I
NMF
deeploria – 21/04/2016 12
Outline
Focus of today’s talk: use of deep neural networks (DNNs) forspeech enhancement.
Goal: convince (if needed) that DNNs have greater modeling power, report evidence that domain knowledge is still useful, introduce hybrid DNN / signal processing architectures.
deeploria – 21/04/2016 13
SOURCE SEPARATION BASICS
Source separation: what is it?
Goal: extract the signals corresponding to several sound sourceswhich are simultaneously active in a recording.
Applications: speech enhancement in phones, hearing aids robust speech and speaker recognition remixing of audio contents audio monitoring/surveillance
deeploria – 21/04/2016 15
General model
Linear mixing equation (t time, f frequency):
xtf =∑
jyjtf
mixture spatial image of j-th source
Gaussian source model:
yjtf ∼ N (0, v jtf Rjf )
power spectrum spatial covariance matrix
deeploria – 21/04/2016 16
General two-step algorithm
Estimate the model parameters (maximum a posteriori):
maxθ
∑t,f
log p(θ|xtf )
where θ = Rjf , vjtf .
Estimate the sources (minimum mean square error):
yjtf = Ωjtf xtf ou Ωjtf = vjtf Rjf (∑
j′vj′tf Rj′f )−1
Ωjtf is called Wiener filter.
deeploria – 21/04/2016 17
Single-channel Wiener filterOn single-mic data, the filter operates as a time-frequency mask.
Source de parole
temps (s)
frequence
(Hz)
102
103
104
0 0.5 1 1.5 2 2.5
dB
0
20
40
60
Melange parole + bruit
temps (s)
frequence
(Hz)
102
103
104
0 0.5 1 1.5 2 2.5
dB
0
20
40
60
Filtre de Wiener
temps (s)
frequence
(Hz)
102
103
104
0 0.5 1 1.5 2 2.5
dB
−30
−20
−10
0
Signal filtre
temps (s)
frequence
(Hz)
102
103
104
0 0.5 1 1.5 2 2.5
dB
0
20
40
60
deeploria – 21/04/2016 18
Multichannel Wiener filter
On multi-mic data, the Wiener filter also achieves spatial filtering.Filtre de Wiener (anechoıque)
angle ()
frequence
(Hz)
102
103
104
0 45 90 135 180
dB
−30
−20
−10
0
10
Filtre de Wiener (reverb)
angle ()
frequence
(Hz)
102
103
104
0 45 90 135 180
dB
−30
−20
−10
0
10
deeploria – 21/04/2016 19
WHY SHOULD I USE DNN?
Using DNNs for source separation
Inputs: magnitude spectra of the mixture in the current timeframe + left and right context frames
Outputs:I magnitude spectra of speech and noise in the current frameI or time-frequency mask
Training data: simulated mixtures of speech and noise
Test data: real recordings
deeploria – 21/04/2016 21
Theoretical benefits compared to previous models
Explanatory power: DNNs can account for complex nonlineardependencies between frames/frequencies.
Scalability wrt. data: DNN performance increases morequickly with the data size due to efficient parameter “sharing”.
Invariance/robustness to outliers: thanks to the “shrinkage”effect of nonlinear activation, DNNs become more invariantand robust with more layers.
Discriminative training: discriminative training is hard forgenerative models (requires interleaved inference and training)but easy for DNNs.
deeploria – 21/04/2016 22
Impact on time-frequency mask estimationInput
102
103
104
0 1 2 3
dB
0
20
40
60
IMCRA mask
102
103
104
0 1 2 3
0
0.2
0.4
0.6
0.8
1
NMF mask
102
103
104
0 1 2 3
0
0.2
0.4
0.6
0.8
1
DNN mask
102
103
104
0 1 2 3
0
0.2
0.4
0.6
0.8
1
CHiME-3: speech recorded in a cafe. Single-channel enhancement by Wiener mask.NMF training: noise context. DNN training: bus + cafe + pedestrian area + street.
deeploria – 21/04/2016 23
Impact on spectral enhancement
Enhancement SDR WERDelay-and-sum 1.74 dB 21.12%Multichannel NMF 5.82 dB 19.46%DNN post-filter 14.17 dB 14.82%
CHiME-2: speech mixed with real domestic noise. DNN = LSTM-SA (see below).WER reported for multicondition-trained DNN-HMM acoustic model on test data.
F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B.Schuller, “Speech enhancement with LSTM recurrent neural networks and itsapplication to noise-robust ASR”, in Proc. LVA/ICA, 2015.
deeploria – 21/04/2016 24
Impact on spatial enhancement
Enhancement SDRDUET 0.14 dBMESSL 2.73 dBDNN 6.32 dB
Note: time-frequency masking only (no multichannel filter).
Speech mixed with one babble noise source, RT60 = 300 ms, fixed target RIR andmulticondition noise RIR. DNN input = GFCC+CCF+ILD (see below).SDR measured wrt. IBM target.
Y. Jiang, D.L. Wang, R.S. Liu, Z.M. Feng, “Binaural classification for reverberantspeech segregation using deep neural networks”, IEEE/ACM TASL, 2014.
deeploria – 21/04/2016 25
HOW CAN I USE MY DOMAINKNOWLEDGE?
Design choices
DNN-based enhancement relies on a number of design choices, forwhich domain knowledge is useful: preprocessing/choice of inputs, generation/choice of training data, choice of outputs and cost function.
deeploria – 21/04/2016 27
Input and output features
DNNs can learn suitable features. . . only when given enough data!
Tradeoff between designed features: more invariant but information loss, raw data: no information loss but more data required.
Feature design required for speech enhancement in practice.
Examples: speech enhancement increases input invariance and improves
DNN-based automatic speech recognition, masks increase output invariance and often improve speech
enhancement compared to source spectra as targets.
deeploria – 21/04/2016 28
Spectral input features
Inputs HIT-FAMulti-resolution cochleagram (MRCG) 70%Single-resolution cochleagram 68%Gammatone frequency cesptral coefficients (GFCC) 67%Mel frequency cesptral coefficients (MFCC) 64%Perceptual linear prediction (PLP) 63%Gabor filterbank (GFB) 63%Pitch 48%
Speech mixed with NOISEX at-5 dB SNR, matched speaker and noise conditions.HIT: correctly classified target-dominant TF bins. FA: wrongly classifiednoise-dominant TF bins. HIT-FA correlates with speech intelligibility.
J. Chen, Y. Wang, D.L. Wang, “A feature study for classification-based speechseparation at low signal-to-noise ratios”, IEEE/ACM TASL, 2014.
deeploria – 21/04/2016 29
Spatial input features
Inputs HIT-FAcross-correlation + level diff. (ILD) 87%time diff. (ITD) + level diff. (ILD) 82%
Speech mixed with one babble noise source at-5 dB SNR, RT60 = 300 ms, fixedtarget RIR and multicondition noise RIR.HIT: correctly classified target-dominant TF bins. FA: wrongly classifiednoise-dominant TF bins. HIT-FA correlates with speech intelligibility.
Y. Jiang, D.L. Wang, R.S. Liu, Z.M. Feng, “Binaural classification for reverberantspeech segregation using deep neural networks”, IEEE/ACM TASL, 2014.
deeploria – 21/04/2016 30
Results of other algorithms as inputs
Idea: concatenate/replace input features with estimated speech or noise spectra obtained from other
enhancement techniques, or any feature embedding them.
Inputs SDRMagnitude spectrum 9.01 dBResults of several NMFs 8.15 dBResults of several DNNs 9.31 dBResults of several NMFs and DNNs 9.50 dB
CHiME-2: speech mixed with real domestic noise. DNN = MLP-SA (see below).
X. Jaureguiberry, E. Vincent, G. Richard, “Fusion methods for speech enhancementand audio source separation”, Research Report, Telecom ParisTech, 2015.
deeploria – 21/04/2016 31
Training data augmentation
Idea: increase training data amount and coverage by simulating a range of RT60 and SNR, perturbing the signals (below).
Noise data HIT-FAOriginal 62%Time warped 70%Frequency warped 73%Combined 72%
Speech mixed with DEMAND noise at -5 dB SNR. Fixed training set size.
J. Chen, Y. Wang, D.L. Wang, “Noise perturbation improves supervised speechseparation”, in Proc. LVA/ICA, 2015.
deeploria – 21/04/2016 32
Cost function
Idea: cost as close as possible to target evaluation metric.
Cost function SDRError on time-frequency mask (M−M)2 (MA) 11.43 dBError on magnitude spectrogram (|Y| − |Y|)2 (SA) 12.16 dB
Further improvement using squared error on complex spectrogram.
CHiME-2: speech mixed with real domestic noise. DNN = MLP post-filter.
F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, “Discriminatively trained recurrentneural networks for single-channel speech separation”, in Proc. GlobalSIP, 2014.
deeploria – 21/04/2016 33
TOWARDS NEW DNNARCHITECTURES
Hybrid DNN / signal processing architectures
In addition to using existing DNN architectures, can we combine them with signal processing algorithms? derive better architectures?
Two completely different examples below: multichannel DNN, deep NMF.
deeploria – 21/04/2016 35
Multichannel NMF model
xtf =∑
jyjtf
mixture spatial image of j-th source
yjtf ∼ N (0, v jtf Rjf )
power spectrum spatial covariance matrix
vjtf =∑
khkt wkf
time activations basis spectra
deeploria – 21/04/2016 36
ExamplePiano source C
1nf
n (s)
f (k
Hz)
0 0.5 10
2
4
dB
0
20
40
60
Violin source C2nf
n (s)
f (k
Hz)
0 0.5 10
2
4
dB
0
20
40
Mixture Xnf
n (s)
f (k
Hz)
0 0.5 10
2
4
dB
0
20
40
60
Basis spectra wjkf
f (k
Hz)
k (piano) k (violin)1 2 3 1 2 3
0
2
4
dB
−40
−20
0
Estimated scale factors hjkn
n (s)
k (p
iano
)k
(vio
lin
)
0 0.5 1
1
2
3
1
2
3
dB
40
60
80
Estimated piano variance Σ1nf
n (s)
f (k
Hz)
0 0.5 10
2
4
dB
0
20
40
60
Estimated violin variance Σ2nf
n (s)
f (k
Hz)
0 0.5 10
2
4
dB
0
20
40
60
Estimated mixture variance Σ1nf
+Σ2nf
n (s)
f (k
Hz)
0 0.5 10
2
4
dB
0
20
40
60
Estimated piano source C1nf
n (s)
f (k
Hz)
^
0 0.5 10
2
4
dB
0
20
40
60
Estimated violin source C2nf
n (s)
f (k
Hz)
^
0 0.5 10
2
4
dB
0
20
40
60
deeploria – 21/04/2016 37
Multichannel NMF algorithm
E-step: estimate the second-order statistics of the sources
Ωjtf = vjtf Rjf (∑
j′vj′tf Rj′f )−1 (multichannel Wiener filter)
Ryjtf = Ωjtf Rxtf ΩHjtf + (I−Ωjtf )vjtf Rjf
M-step: update the parameters
Rjf ←1T
∑t
Ryjtf
vjtf
ξjtf ← tr(R−1jf Ryjtf )/I (power spectrum estimate)
hkt ← hkt
∑f wkf v−2
jtf ξjtf∑f wkf v−1
jtf(improve estimate by NMF)
deeploria – 21/04/2016 38
Multichannel DNN algorithm
E-step: estimate the second-order statistics of the sources
Ωjtf = vjtf Rjf (∑
j′vj′tf Rj′f )−1 (multichannel Wiener filter)
Ryjtf = Ωjtf Rxtf ΩHjtf + (I−Ωjtf )vjtf Rjf
M-step: update the parameters
Rjf ←1T
∑t
Ryjtf
vjtf
ξjtf ← tr(R−1jf Ryjtf )/I (power spectrum estimate)
vjtf ← DNN(ξ1/2jtf )2 (improve estimate by DNN)
deeploria – 21/04/2016 39
Comparison with post-filtering
1-ch PSDspeech
PSDnoise
DNNRealignment(based on TDOA)
6-ch
MultichannelSpeech enhan. 6-ch
Averagingover channels 1-ch6-ch
Averagingover channels
Single-channelSpeech enhan. 1-ch
deeploria – 21/04/2016 40
Results
Noisy WER=33.23%Single-channel DNN WER=36.92%Delay-and-sum WER=26.30%DNN post-filter WER=26.54%Multichannel DNN WER=20.17%
CHiME-3: speech recorded in a bus. Single DNN iteration, no post-processing.WER for multicondition-trained GMM-HMM acoustic model on real test data.
S. Sivasankaran, A.A. Nugraha, E. Vincent, J.A. Morales Cordovilla, S. Dalmia, I.Illina, A. Liutkus, “Robust ASR using neural network based speech enhancement andfeature simulation”, in Proc. ASRU, 2015.A.A. Nugraha, A. Liutkus, E. Vincent, “Multichannel audio source separation withdeep neural networks”, Research Report RR-8740, Inria, 2015.
deeploria – 21/04/2016 41
Deep NMF
Create a totally new DNN architecture: consider the NMF multiplicative
update for hkt as a “simple” nonlinearfunction with parameters wkf ,
unfold iterative updates into a stack ofseveral such nonlinear functions,
untie the parameters wkf across layers, train the last layers discriminatively.
For β = 1, Dβ is the generalized KL divergence, and β = 2 yieldsthe squared error. An L1 sparsity constraint with weight µ is addedto favor solutions where only few basis vectors are active at a time.
The following multiplicative updates for iteration k ∈ 1, . . . ,Kminimize (2) subject to non-negativity constraints [9]:
Hk = Hk−1 WT (M (WHk−1)β−2)
WT (WHk−1)β−1 + µ, (3)
where denotes element-wise multiplication, the matrix quotient iselement-wise, and H0 is initialized randomly.
AfterK iterations, to reconstruct each source, typically a Wienerfiltering-like approach is used, which enforces the constraint that allthe source estimates Sl,K sum up to the mixture:
Sl,K =WlHl,K
∑l′ W
l′Hl′,KM. (4)
A commonly used approach has been to train NMF bases in-dependently on each source, before combining them. However thecombination was generally not trained for good separation perfor-mance from a mixture. Recently, discriminative methods have beenapplied to sparse dictionary based methods to achieve better perfor-mance in particular tasks [10]. In a similar way, we can discrimi-natively train NMF bases for source separation. The following op-timization problem for training bases, termed discriminative NMF(DNMF) was proposed in [4, 5]:
W = argminW
∑
l
γlDβ2
(Sl | WlHl(M,W)
), (5)
H(M,W) = argminH
Dβ1(M | WH) + µ|H|1, (6)
and where β1 controls the divergence used in the bottom-level anal-ysis objective, and β2 controls the divergence used in the top-levelreconstruction objective. The weights γl account for the application-dependent importance of source l; for example, in speech de-noising,we focus on reconstructing the speech signal. The first part (5) min-imizes the reconstruction error given H. The second part ensuresthat H are the activations that arise from the test-time inference ob-jective. Given the bases W, the activations H(M,W) are uniquelydetermined, due to the convexity of (6). Nonetheless, the above re-mains a difficult bi-level optimization problem, since the bases Woccur in both levels.
In [5] the bi-level problem was approached by directly solvingfor the derivatives of the lower level problem after convergence. In[4], the problem was approached by untying the bases used for re-construction in (5) from the analysis bases used in (6), and discrim-inatively training only the reconstruction bases, while the analysisbases are classically trained separately on each source type. In addi-tion, (4) was incorporated into the discriminative criteria as
W = argminW
∑
l
γlDβ2
(Sl | Sl,K(M,W)
). (7)
Here, we propose to take this further by unfolding the entiremodel as a deep non-negative neural network, and untying the pa-rameters across layers as Wk for k = 0, . . . ,K. This leads to the
Fig. 1. Illustration of the proposed deep NMF neural network
following architecture with K + 1 layers, illustrated in Fig. 1:
Hkt = fWk−1(mt,H
k−1t ),
= Hk−1t (W
k−1)T (mt (Wk−1Hk−1t )β−2)
(Wk−1)T (Wk−1Hk−1t )β−1 + µ
, (8)
Sl,Kt = gWK (mt,HKt ) =
Wl,KHl,Kt∑
l′ Wl′,KHl′,K
t
mt. (9)
We call this new model deep NMF.In order to train this network while enforcing the non-negativity
constraints, we derive recursively-defined multiplicative updateequations by back-propagating a split between positive and negativeparts of the gradient. Multiplicative updates are often derived usinga heuristic approach which uses the ratio of the negative part to thepositive part as a multiplication factor to update the value of thatvariable of interest. Here we do the same for each Wk matrix in theunfolded network:
Wk ⇐Wk [∇WkE ]−[∇WkE ]+
. (10)
To propagate the positive and negative parts, we use:
[∂E∂hkr,t
]
+
=∑
r′
([∂E
∂hk+1r′,t
]
+
[∂hk+1
r′,t
∂hkr,t
]
+
+
[∂E
∂hk+1r′,t
]
−
[∂hk+1
r′,t
∂hkr,t
]
−
)
[∂E∂hkr,t
]
−=∑
r′
([∂E
∂hk+1r′,t
]
+
[∂hk+1
r′,t
∂hkr,t
]
−+
[∂E
∂hk+1r′,t
]
−
[∂hk+1
r′,t
∂hkr,t
]
+
)
[∂E∂wkf,r
]
+
=∑
t,r′
([∂E
∂hk+1r′,t
]
+
[∂hk+1
r′,t
∂wkf,r
]
+
+
[∂E
∂hk+1r′,t
]
−
[∂hk+1
r′,t
∂wkf,r
]
−
)
[∂E∂wkf,r
]
−=∑
t,r′
([∂E
∂hk+1r′,t
]
+
[∂hk+1
r′,t
∂wkf,r
]
−+
[∂E
∂hk+1r′,t
]
−
[∂hk+1
r′,t
∂wkf,r
]
+
)
where hkr,t are the activation coefficients at time t for the rth basisset in the kth layer, and wkf,r are the values of the rth basis vector inthe f th feature dimension in the kth layer.
deeploria – 21/04/2016 42
Deep NMF
Discriminative layers SDR0 9.5 dB1 10.3 dB2 10.5 dB3 10.8 dB4 10.9 dB
9 input frames, 25 layers, 1000 basis vectors per layer.
J. Le Roux, J.R. Hershey, F. Weninger, “Deep NMF for speech separation”, in Proc.ICASSP, 2015.
deeploria – 21/04/2016 43
CONCLUSION AND PERSPECTIVES
Take-home message
Big data won’t stop. It will grow even bigger!
Using this data brings a competitive advantage compared tosimpler models currently in use.
This applies to almost any audio task, including but not limited to music separation, music classification, source localization, room parameter estimation, speech synthesis, voice conversion. . .
deeploria – 21/04/2016 45
Using more domain knowledge
Enable DNN to exploit prior information. Simulate huge amounts of realistic data required for DNN
training, especially multichannel. Achieve good perceptual quality by post-processing.
deeploria – 21/04/2016 46
Inventing better networks Adapt model to new conditions (not just more features). Propagate probability distributions through a DNN. Extend deep unfolding to other signal models than NMF.
All
Possible
NNs
Combinations
of NNs
Existing
NNs
Graphical
modelsMRF
NMFDeep
NMF
MLP
M. Kim, P. Smaragdis, “Adaptive denoising autoencoders: A fine-tuning scheme tolearn from test mixtures”, Proc. LVA/ICA, 2015.A.H. Abdelaziz, S. Watanabe, J.R. Hershey, E. Vincent, D. Kolossa, “Uncertaintypropagation through deep neural networks”, in Proc. Interspeech, 2015.J.R. Hershey, J. Le Roux, F. Weninger, “Deep unfolding: Model-based inspiration ofnovel deep architectures”, Tech. Rep. TR2014-117, MERL, 2014.
deeploria – 21/04/2016 47