is audio signal processing still useful in the...

deeploria – 21/04/2016

IS AUDIO SIGNAL PROCESSINGSTILL USEFUL IN THE ERA OFMACHINE LEARNING?Emmanuel VincentInria, France

A silent animation

Disclaimer: All characters appearing in this animation arefictitious. Any resemblance to real persons, living or dead, is purelycoincidental.

Credits:Icons: made by Freepik, Anton Saputro and Robin Kylander from www.flaticon.comand licensed under Creative Commons BY 3.0.Music: Modern Times, by Charlie Chaplin & David Raksin.

deeploria – 21/04/2016 2

A long time ago, God created audio signal processing

I

Hello my dear friends!

deeploria – 21/04/2016 3

Researchers made some progress every year

I

All is for the best in the best

of all possible worlds.

deeploria – 21/04/2016 4

New technologies would sometimes raise skepticism. . .

I

Hello, I’m coming from the

image community. Can I help

you model spectrograms?

You can’t be a good model.

Speech follows my own model.

I

NMF

deeploria – 21/04/2016 5

. . . but would finally be adopted

I

All is for the best in the

best of all possible

worlds. I

NMF

deeploria – 21/04/2016 6

One day a disruptive technology appeared

I

DNN

Hey, pals! I’m the new state of

the art for automatic speech

recognition.

I

Good for you! But

we’re not concerned… I

NMF

deeploria – 21/04/2016 7

It quickly spread to other fields. . .

I

DNN

I also speak 20 languages, paint

like Van Gogh, compose music

like Bach, cook like Bocuse…

I

Well, we’re impressed!

But we’re still not

concerned… I

NMF

deeploria – 21/04/2016 8

. . . until it reached the core of signal processing

I

DNN

I just tried to perform speech

enhancement and, guess what,

I can also do it!

I

!I

NMF

deeploria – 21/04/2016 9

Some researchers felt depressed. . .

I

DNN

Every time I fire a signal

processing researcher, my

output SNR goes up!

I

This guy is going to kill

us and take our jobs! I

NMF

deeploria – 21/04/2016 10

. . . some felt angry. . .

I

DNN

So what? My SNR outperforms

yours and I can run on

embedded platforms too!

I

You’re a brute-force

black box and you train

slowly! I

NMF

deeploria – 21/04/2016 11

. . . and some saw a great opportunity!

I

DNN

I need more data and domain

knowledge to progress. Would

you help me?

I

Let’s work together!

We need your modeling

power to progress too. I

NMF

deeploria – 21/04/2016 12

Outline

Focus of today’s talk: use of deep neural networks (DNNs) forspeech enhancement.

Goal: convince (if needed) that DNNs have greater modeling power, report evidence that domain knowledge is still useful, introduce hybrid DNN / signal processing architectures.

deeploria – 21/04/2016 13

SOURCE SEPARATION BASICS

Source separation: what is it?

Goal: extract the signals corresponding to several sound sourceswhich are simultaneously active in a recording.

Applications: speech enhancement in phones, hearing aids robust speech and speaker recognition remixing of audio contents audio monitoring/surveillance

deeploria – 21/04/2016 15

General model

Linear mixing equation (t time, f frequency):

xtf =∑

jyjtf

mixture spatial image of j-th source

Gaussian source model:

yjtf ∼ N (0, v jtf Rjf )

power spectrum spatial covariance matrix

deeploria – 21/04/2016 16

General two-step algorithm

Estimate the model parameters (maximum a posteriori):

maxθ

∑t,f

log p(θ|xtf )

where θ = Rjf , vjtf .

Estimate the sources (minimum mean square error):

yjtf = Ωjtf xtf ou Ωjtf = vjtf Rjf (∑

j′vj′tf Rj′f )−1

Ωjtf is called Wiener filter.

deeploria – 21/04/2016 17

Single-channel Wiener filterOn single-mic data, the filter operates as a time-frequency mask.

Source de parole

temps (s)

frequence

(Hz)

102

103

104

0 0.5 1 1.5 2 2.5

dB

0

20

40

60

Melange parole + bruit

temps (s)

frequence

(Hz)

102

103

104

0 0.5 1 1.5 2 2.5

dB

0

20

40

60

Filtre de Wiener

temps (s)

frequence

(Hz)

102

103

104

0 0.5 1 1.5 2 2.5

dB

−30

−20

−10

0

Signal filtre

temps (s)

frequence

(Hz)

102

103

104

0 0.5 1 1.5 2 2.5

dB

0

20

40

60

deeploria – 21/04/2016 18

Multichannel Wiener filter

On multi-mic data, the Wiener filter also achieves spatial filtering.Filtre de Wiener (anechoıque)

angle ()

frequence

(Hz)

102

103

104

0 45 90 135 180

dB

−30

−20

−10

0

10

Filtre de Wiener (reverb)

angle ()

frequence

(Hz)

102

103

104

0 45 90 135 180

dB

−30

−20

−10

0

10

deeploria – 21/04/2016 19

WHY SHOULD I USE DNN?

Using DNNs for source separation

Inputs: magnitude spectra of the mixture in the current timeframe + left and right context frames

Outputs:I magnitude spectra of speech and noise in the current frameI or time-frequency mask

Training data: simulated mixtures of speech and noise

Test data: real recordings

deeploria – 21/04/2016 21

Theoretical benefits compared to previous models

Explanatory power: DNNs can account for complex nonlineardependencies between frames/frequencies.

Scalability wrt. data: DNN performance increases morequickly with the data size due to efficient parameter “sharing”.

Invariance/robustness to outliers: thanks to the “shrinkage”effect of nonlinear activation, DNNs become more invariantand robust with more layers.

Discriminative training: discriminative training is hard forgenerative models (requires interleaved inference and training)but easy for DNNs.

deeploria – 21/04/2016 22

Impact on time-frequency mask estimationInput

102

103

104

0 1 2 3

dB

0

20

40

60

IMCRA mask

102

103

104

0 1 2 3

0

0.2

0.4

0.6

0.8

1

NMF mask

102

103

104

0 1 2 3

0

0.2

0.4

0.6

0.8

1

DNN mask

102

103

104

0 1 2 3

0

0.2

0.4

0.6

0.8

1

CHiME-3: speech recorded in a cafe. Single-channel enhancement by Wiener mask.NMF training: noise context. DNN training: bus + cafe + pedestrian area + street.

deeploria – 21/04/2016 23

Impact on spectral enhancement

Enhancement SDR WERDelay-and-sum 1.74 dB 21.12%Multichannel NMF 5.82 dB 19.46%DNN post-filter 14.17 dB 14.82%

CHiME-2: speech mixed with real domestic noise. DNN = LSTM-SA (see below).WER reported for multicondition-trained DNN-HMM acoustic model on test data.

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B.Schuller, “Speech enhancement with LSTM recurrent neural networks and itsapplication to noise-robust ASR”, in Proc. LVA/ICA, 2015.

deeploria – 21/04/2016 24

Impact on spatial enhancement

Enhancement SDRDUET 0.14 dBMESSL 2.73 dBDNN 6.32 dB

Note: time-frequency masking only (no multichannel filter).

Speech mixed with one babble noise source, RT60 = 300 ms, fixed target RIR andmulticondition noise RIR. DNN input = GFCC+CCF+ILD (see below).SDR measured wrt. IBM target.

Y. Jiang, D.L. Wang, R.S. Liu, Z.M. Feng, “Binaural classification for reverberantspeech segregation using deep neural networks”, IEEE/ACM TASL, 2014.

deeploria – 21/04/2016 25

HOW CAN I USE MY DOMAINKNOWLEDGE?

Design choices

DNN-based enhancement relies on a number of design choices, forwhich domain knowledge is useful: preprocessing/choice of inputs, generation/choice of training data, choice of outputs and cost function.

deeploria – 21/04/2016 27

Input and output features

DNNs can learn suitable features. . . only when given enough data!

Tradeoff between designed features: more invariant but information loss, raw data: no information loss but more data required.

Feature design required for speech enhancement in practice.

Examples: speech enhancement increases input invariance and improves

DNN-based automatic speech recognition, masks increase output invariance and often improve speech

enhancement compared to source spectra as targets.

deeploria – 21/04/2016 28

Spectral input features

Inputs HIT-FAMulti-resolution cochleagram (MRCG) 70%Single-resolution cochleagram 68%Gammatone frequency cesptral coefficients (GFCC) 67%Mel frequency cesptral coefficients (MFCC) 64%Perceptual linear prediction (PLP) 63%Gabor filterbank (GFB) 63%Pitch 48%

Speech mixed with NOISEX at-5 dB SNR, matched speaker and noise conditions.HIT: correctly classified target-dominant TF bins. FA: wrongly classifiednoise-dominant TF bins. HIT-FA correlates with speech intelligibility.

J. Chen, Y. Wang, D.L. Wang, “A feature study for classification-based speechseparation at low signal-to-noise ratios”, IEEE/ACM TASL, 2014.

deeploria – 21/04/2016 29

Spatial input features

Inputs HIT-FAcross-correlation + level diff. (ILD) 87%time diff. (ITD) + level diff. (ILD) 82%

Speech mixed with one babble noise source at-5 dB SNR, RT60 = 300 ms, fixedtarget RIR and multicondition noise RIR.HIT: correctly classified target-dominant TF bins. FA: wrongly classifiednoise-dominant TF bins. HIT-FA correlates with speech intelligibility.

Y. Jiang, D.L. Wang, R.S. Liu, Z.M. Feng, “Binaural classification for reverberantspeech segregation using deep neural networks”, IEEE/ACM TASL, 2014.

deeploria – 21/04/2016 30

Results of other algorithms as inputs

Idea: concatenate/replace input features with estimated speech or noise spectra obtained from other

enhancement techniques, or any feature embedding them.

Inputs SDRMagnitude spectrum 9.01 dBResults of several NMFs 8.15 dBResults of several DNNs 9.31 dBResults of several NMFs and DNNs 9.50 dB

CHiME-2: speech mixed with real domestic noise. DNN = MLP-SA (see below).

X. Jaureguiberry, E. Vincent, G. Richard, “Fusion methods for speech enhancementand audio source separation”, Research Report, Telecom ParisTech, 2015.

deeploria – 21/04/2016 31

Training data augmentation

Idea: increase training data amount and coverage by simulating a range of RT60 and SNR, perturbing the signals (below).

Noise data HIT-FAOriginal 62%Time warped 70%Frequency warped 73%Combined 72%

Speech mixed with DEMAND noise at -5 dB SNR. Fixed training set size.

J. Chen, Y. Wang, D.L. Wang, “Noise perturbation improves supervised speechseparation”, in Proc. LVA/ICA, 2015.

deeploria – 21/04/2016 32

Cost function

Idea: cost as close as possible to target evaluation metric.

Cost function SDRError on time-frequency mask (M−M)2 (MA) 11.43 dBError on magnitude spectrogram (|Y| − |Y|)2 (SA) 12.16 dB

Further improvement using squared error on complex spectrogram.

CHiME-2: speech mixed with real domestic noise. DNN = MLP post-filter.

F. Weninger, J.R. Hershey, J. Le Roux, B. Schuller, “Discriminatively trained recurrentneural networks for single-channel speech separation”, in Proc. GlobalSIP, 2014.

deeploria – 21/04/2016 33

TOWARDS NEW DNNARCHITECTURES

Hybrid DNN / signal processing architectures

In addition to using existing DNN architectures, can we combine them with signal processing algorithms? derive better architectures?

Two completely different examples below: multichannel DNN, deep NMF.

deeploria – 21/04/2016 35

Multichannel NMF model

xtf =∑

jyjtf

mixture spatial image of j-th source

yjtf ∼ N (0, v jtf Rjf )

power spectrum spatial covariance matrix

vjtf =∑

khkt wkf

time activations basis spectra

deeploria – 21/04/2016 36

ExamplePiano source C

1nf

n (s)

f (k

Hz)

0 0.5 10

2

4

dB

0

20

40

60

Violin source C2nf

n (s)

f (k

Hz)

0 0.5 10

2

4

dB

0

20

40

Mixture Xnf

n (s)

f (k

Hz)

0 0.5 10

2

4

dB

0

20

40

60

Basis spectra wjkf

f (k

Hz)

k (piano) k (violin)1 2 3 1 2 3

0

2

4

dB

−40

−20

0

Estimated scale factors hjkn

n (s)

k (p

iano

)k

(vio

lin

)

0 0.5 1

1

2

3

1

2

3

dB

40

60

80

Estimated piano variance Σ1nf

n (s)

f (k

Hz)

0 0.5 10

2

4

dB

0

20

40

60

Estimated violin variance Σ2nf

n (s)

f (k

Hz)

0 0.5 10

2

4

dB

0

20

40

60

Estimated mixture variance Σ1nf

+Σ2nf

n (s)

f (k

Hz)

0 0.5 10

2

4

dB

0

20

40

60

Estimated piano source C1nf

n (s)

f (k

Hz)

^

0 0.5 10

2

4

dB

0

20

40

60

Estimated violin source C2nf

n (s)

f (k

Hz)

^

0 0.5 10

2

4

dB

0

20

40

60

deeploria – 21/04/2016 37

Multichannel NMF algorithm

E-step: estimate the second-order statistics of the sources

Ωjtf = vjtf Rjf (∑

j′vj′tf Rj′f )−1 (multichannel Wiener filter)

Ryjtf = Ωjtf Rxtf ΩHjtf + (I−Ωjtf )vjtf Rjf

M-step: update the parameters

Rjf ←1T

∑t

Ryjtf

vjtf

ξjtf ← tr(R−1jf Ryjtf )/I (power spectrum estimate)

hkt ← hkt

∑f wkf v−2

jtf ξjtf∑f wkf v−1

jtf(improve estimate by NMF)

deeploria – 21/04/2016 38

Multichannel DNN algorithm

E-step: estimate the second-order statistics of the sources

Ωjtf = vjtf Rjf (∑

j′vj′tf Rj′f )−1 (multichannel Wiener filter)

Ryjtf = Ωjtf Rxtf ΩHjtf + (I−Ωjtf )vjtf Rjf

M-step: update the parameters

Rjf ←1T

∑t

Ryjtf

vjtf

ξjtf ← tr(R−1jf Ryjtf )/I (power spectrum estimate)

vjtf ← DNN(ξ1/2jtf )2 (improve estimate by DNN)

deeploria – 21/04/2016 39

Comparison with post-filtering

1-ch PSDspeech

PSDnoise

DNNRealignment(based on TDOA)

6-ch

MultichannelSpeech enhan. 6-ch

Averagingover channels 1-ch6-ch

Averagingover channels

Single-channelSpeech enhan. 1-ch

deeploria – 21/04/2016 40

Results

Noisy WER=33.23%Single-channel DNN WER=36.92%Delay-and-sum WER=26.30%DNN post-filter WER=26.54%Multichannel DNN WER=20.17%

CHiME-3: speech recorded in a bus. Single DNN iteration, no post-processing.WER for multicondition-trained GMM-HMM acoustic model on real test data.

S. Sivasankaran, A.A. Nugraha, E. Vincent, J.A. Morales Cordovilla, S. Dalmia, I.Illina, A. Liutkus, “Robust ASR using neural network based speech enhancement andfeature simulation”, in Proc. ASRU, 2015.A.A. Nugraha, A. Liutkus, E. Vincent, “Multichannel audio source separation withdeep neural networks”, Research Report RR-8740, Inria, 2015.

deeploria – 21/04/2016 41

Deep NMF

Create a totally new DNN architecture: consider the NMF multiplicative

update for hkt as a “simple” nonlinearfunction with parameters wkf ,

unfold iterative updates into a stack ofseveral such nonlinear functions,

untie the parameters wkf across layers, train the last layers discriminatively.

For β = 1, Dβ is the generalized KL divergence, and β = 2 yieldsthe squared error. An L1 sparsity constraint with weight µ is addedto favor solutions where only few basis vectors are active at a time.

The following multiplicative updates for iteration k ∈ 1, . . . ,Kminimize (2) subject to non-negativity constraints [9]:

Hk = Hk−1 WT (M (WHk−1)β−2)

WT (WHk−1)β−1 + µ, (3)

where denotes element-wise multiplication, the matrix quotient iselement-wise, and H0 is initialized randomly.

AfterK iterations, to reconstruct each source, typically a Wienerfiltering-like approach is used, which enforces the constraint that allthe source estimates Sl,K sum up to the mixture:

Sl,K =WlHl,K

∑l′ W

l′Hl′,KM. (4)

A commonly used approach has been to train NMF bases in-dependently on each source, before combining them. However thecombination was generally not trained for good separation perfor-mance from a mixture. Recently, discriminative methods have beenapplied to sparse dictionary based methods to achieve better perfor-mance in particular tasks [10]. In a similar way, we can discrimi-natively train NMF bases for source separation. The following op-timization problem for training bases, termed discriminative NMF(DNMF) was proposed in [4, 5]:

W = argminW

∑

l

γlDβ2

(Sl | WlHl(M,W)

), (5)

H(M,W) = argminH

Dβ1(M | WH) + µ|H|1, (6)

and where β1 controls the divergence used in the bottom-level anal-ysis objective, and β2 controls the divergence used in the top-levelreconstruction objective. The weights γl account for the application-dependent importance of source l; for example, in speech de-noising,we focus on reconstructing the speech signal. The first part (5) min-imizes the reconstruction error given H. The second part ensuresthat H are the activations that arise from the test-time inference ob-jective. Given the bases W, the activations H(M,W) are uniquelydetermined, due to the convexity of (6). Nonetheless, the above re-mains a difficult bi-level optimization problem, since the bases Woccur in both levels.

In [5] the bi-level problem was approached by directly solvingfor the derivatives of the lower level problem after convergence. In[4], the problem was approached by untying the bases used for re-construction in (5) from the analysis bases used in (6), and discrim-inatively training only the reconstruction bases, while the analysisbases are classically trained separately on each source type. In addi-tion, (4) was incorporated into the discriminative criteria as

W = argminW

∑

l

γlDβ2

(Sl | Sl,K(M,W)

). (7)

Here, we propose to take this further by unfolding the entiremodel as a deep non-negative neural network, and untying the pa-rameters across layers as Wk for k = 0, . . . ,K. This leads to the

Fig. 1. Illustration of the proposed deep NMF neural network

following architecture with K + 1 layers, illustrated in Fig. 1:

Hkt = fWk−1(mt,H

k−1t ),

= Hk−1t (W

k−1)T (mt (Wk−1Hk−1t )β−2)

(Wk−1)T (Wk−1Hk−1t )β−1 + µ

, (8)

Sl,Kt = gWK (mt,HKt ) =

Wl,KHl,Kt∑

l′ Wl′,KHl′,K

t

mt. (9)

We call this new model deep NMF.In order to train this network while enforcing the non-negativity

constraints, we derive recursively-defined multiplicative updateequations by back-propagating a split between positive and negativeparts of the gradient. Multiplicative updates are often derived usinga heuristic approach which uses the ratio of the negative part to thepositive part as a multiplication factor to update the value of thatvariable of interest. Here we do the same for each Wk matrix in theunfolded network:

Wk ⇐Wk [∇WkE ]−[∇WkE ]+

. (10)

To propagate the positive and negative parts, we use:

[∂E∂hkr,t

]

+

=∑

r′

([∂E

∂hk+1r′,t

]

+

[∂hk+1

r′,t

∂hkr,t

]

+

+

[∂E

∂hk+1r′,t

]

−

[∂hk+1

r′,t

∂hkr,t

]

−

)

[∂E∂hkr,t

]

−=∑

r′

([∂E

∂hk+1r′,t

]

+

[∂hk+1

r′,t

∂hkr,t

]

−+

[∂E

∂hk+1r′,t

]

−

[∂hk+1

r′,t

∂hkr,t

]

+

)

[∂E∂wkf,r

]

+

=∑

t,r′

([∂E

∂hk+1r′,t

]

+

[∂hk+1

r′,t

∂wkf,r

]

+

+

[∂E

∂hk+1r′,t

]

−

[∂hk+1

r′,t

∂wkf,r

]

−

)

[∂E∂wkf,r

]

−=∑

t,r′

([∂E

∂hk+1r′,t

]

+

[∂hk+1

r′,t

∂wkf,r

]

−+

[∂E

∂hk+1r′,t

]

−

[∂hk+1

r′,t

∂wkf,r

]

+

)

where hkr,t are the activation coefficients at time t for the rth basisset in the kth layer, and wkf,r are the values of the rth basis vector inthe f th feature dimension in the kth layer.

deeploria – 21/04/2016 42

Deep NMF

Discriminative layers SDR0 9.5 dB1 10.3 dB2 10.5 dB3 10.8 dB4 10.9 dB

9 input frames, 25 layers, 1000 basis vectors per layer.

J. Le Roux, J.R. Hershey, F. Weninger, “Deep NMF for speech separation”, in Proc.ICASSP, 2015.

deeploria – 21/04/2016 43

CONCLUSION AND PERSPECTIVES

Take-home message

Big data won’t stop. It will grow even bigger!

Using this data brings a competitive advantage compared tosimpler models currently in use.

This applies to almost any audio task, including but not limited to music separation, music classification, source localization, room parameter estimation, speech synthesis, voice conversion. . .

deeploria – 21/04/2016 45

Using more domain knowledge

Enable DNN to exploit prior information. Simulate huge amounts of realistic data required for DNN

training, especially multichannel. Achieve good perceptual quality by post-processing.

deeploria – 21/04/2016 46

Inventing better networks Adapt model to new conditions (not just more features). Propagate probability distributions through a DNN. Extend deep unfolding to other signal models than NMF.

All

Possible

NNs

Combinations

of NNs

Existing

NNs

Graphical

modelsMRF

NMFDeep

NMF

MLP

M. Kim, P. Smaragdis, “Adaptive denoising autoencoders: A fine-tuning scheme tolearn from test mixtures”, Proc. LVA/ICA, 2015.A.H. Abdelaziz, S. Watanabe, J.R. Hershey, E. Vincent, D. Kolossa, “Uncertaintypropagation through deep neural networks”, in Proc. Interspeech, 2015.J.R. Hershey, J. Le Roux, F. Weninger, “Deep unfolding: Model-based inspiration ofnovel deep architectures”, Tech. Rep. TR2014-117, MERL, 2014.

deeploria – 21/04/2016 47

is audio signal processing still useful in the...

Documents