convolutional gated recurrent neural network .besides the frequency domain audio features,...

Download Convolutional Gated Recurrent Neural Network .Besides the frequency domain audio features, processing

Post on 11-Aug-2018




0 download

Embed Size (px)


  • Convolutional Gated Recurrent Neural NetworkIncorporating Spatial Features for Audio Tagging

    Yong Xu Qiuqiang Kong Qiang Huang Wenwu Wang Mark D. PlumbleyCenter for Vision, Speech and Sigal Processing University of Surrey, Guildford, UK

    Email: {yong.xu, q.kong, q.huang,, m.plumbley}

    AbstractEnvironmental audio tagging is a newly proposedtask to predict the presence or absence of a specific audio eventin a chunk. Deep neural network (DNN) based methods have beensuccessfully adopted for predicting the audio tags in the domesticaudio scene. In this paper, we propose to use a convolutionalneural network (CNN) to extract robust features from mel-filterbanks (MFBs), spectrograms or even raw waveforms for audiotagging. Gated recurrent unit (GRU) based recurrent neuralnetworks (RNNs) are then cascaded to model the long-termtemporal structure of the audio signal. To complement the inputinformation, an auxiliary CNN is designed to learn on the spatialfeatures of stereo recordings. We evaluate our proposed methodson Task 4 (audio tagging) of the Detection and Classificationof Acoustic Scenes and Events 2016 (DCASE 2016) challenge.Compared with our recent DNN-based method, the proposedstructure can reduce the equal error rate (EER) from 0.13 to 0.11on the development set. The spatial features can further reducethe EER to 0.10. The performance of the end-to-end learning onraw waveforms is also comparable. Finally, on the evaluation set,we get the state-of-the-art performance with 0.12 EER while theperformance of the best existing system is 0.15 EER.


    Audio tagging (AT) aims at putting one or several tagson a sound clip. The tags are the sound events that occurin the audio clip, for example, speech, television, per-cussion, bird singing, and so on. Audio tagging has manyapplications in areas such as information retrieval [1], soundclassification [2] and recommendation system [3].

    Many frequency domain audio features such as mel-frequency cepstrum coefficients (MFCCs) [4], Mel filter banksfeature (MFBs) [5] and spectrogram [6] have been used forspeech recognition related tasks [7] for many years. However,it is unclear how these features perform on the non-speechaudio processing tasks. Recently MFCCs and the MFBs werecompared on the audio tagging task [8] and the MFBs can getbetter performance in the framework of deep neural networks.The spectrogram has been suggested to be better than theMFBs in the sound event detection task [9], but has not yetbeen investigated in the audio tagging task.

    Besides the frequency domain audio features, processingsound from the raw time domain waveforms, has attracteda lot of attentions recently [10], [11], [12]. However, mostof this works are for speech recognition related tasks; thereare few works investigating raw waveforms for environmentalaudio analysis. For common signal processing steps, the shorttime Fourier transform (STFT) is always adopted to transformraw waveforms into frequency domain features using a set of

    Fourier basis. Recent research [10] suggests that the Fourierbasis sets may not be optimal and better basis sets can belearned from raw waveforms directly using a large set of audiodata. To learn the basis automatically, convolutional neuralnetwork (CNN) is applied on the raw waveforms which issimilar to CNN processing on the pixels of the image [13].Processing raw waveforms has seen many promising resultson speech recognition [10] and generating speech and music[14], with less research in non-speech sound processing.

    Most audio tagging systems [8], [15], [16], [17] use monochannel recordings, or simply average the multi-channels asthe input signal. However, using this kind of merging strategydisregards the spatial information of the stereo audio. This islikely to decrease recognition accuracy because the intensityand phase of the sound received from different channels aredifferent. For example, kitchen sound and television soundfrom different directions will have different intensities ondifferent channels, depending on the direction of the sources.Multi-channel signals contain spatial information which couldbe used to help to distinguish different sound sources. Spatialfeatures have been demonstrated to improve results in sceneclassification [18] and sound event detection [19]. However,there is little work using multi-channel information for theaudio tagging task.

    Our main contribution in this paper includes three parts.First, we show experimental results on different features in-cluding MFBs and spectrogram as well as the raw waveformson the audio tagging task of the DCASE 2016 challenge.Second, we propose a convolutional gated recurrent neuralnetwork (CGRNN) which is the combination of the CNN andthe gated recurrent unit (GRU) to process non-speech sounds.Third, the spatial features are incorporated in the hidden layerto utilize the location information. The work is organizedas follows: in Section II, the proposed CGRNN is presentedfor audio tagging. In section III, the spatial features will beillustrated and incorporated into the proposed method. Theexperimental setup and results are shown in Section IV andSection V. Section VI summarizes the work and foresees thefuture work.


    Neural networks have several types of structures: the mostcommon one is the deep feed-forward neural network. Anotherpopular structure is the convolutional neural network (CNN),








    ] 2

    4 Fe

    b 20


  • 0

    0-th Filter with fixed size

    Input features

    Convolutional layer


    Pooling layer

    ( 1)-th Filter with fixed size


    2 3

    4 5


    1 2




    2 3

    4 5


    1 2


    Fig. 1. The structure of the one-dimension CNN which consists of oneconvolutional layer and one max-pooling layer. N filters with a fixed sizeF are convolved with the one dimensional signal to get outputs pit{i =0, , (N 1)}. xit means the i-th dimension feature of the current frame.

    which is widely used in image classification [20], [13]. CNNscan extract robust features from pixel-level values for images[13] or raw waveforms for speech signals [10]. Recurrentneural network is the third structure which is often usedfor sequence modeling, such as language models [21] andspeech recognition [22]. In this section, we will introduce theconvolutional neural network and the recurrent neural networkwith gated recurrent units.

    A. One dimension convolutional neural network

    Audio or speech signals are one dimensional. Fig. 1 showsthe structure of a one-dimension CNN which consists ofone convolutional layer and one max-pooling layer. N filterswith a fixed size F are convolved with the one dimensionalsignal to get outputs pti{i = 0, , (N 1)}. Given that thedimension of the input features was M , the activation h of theconvolutional layer would have (MF+1) values. The max-pooling size is also (M F +1) which means each filter willgive one output value. This is similar to speech recognitionwork [10] where CNN has been used to extract features fromthe raw waveform signal. The way of each filter producing onevalue can also be explained as a global pooling layer which isa structural regularizer that explicitly enforces feature maps tobe confidence maps of meaningful feature channels [23]. So Nactivations are obtained as the robust features from the basicfeatures. As for the input feature size M , a short time window,e.g., 32 ms, was fed into the CNN each time. The long-term pattern will be learned by the following recurrent neuralnetwork. As for the filter size or kernel size, a large receptivefield is set considering that only one convolutional layer isdesigned in this work. For example, F = 400 and M = 512are set in [10]. If the input feature was raw waveforms, eachfilter of the CNN was actually learned as a finite impulseresponse (FIR) filter [12]. If the spectrograms or mel-filterbanks were fed into the CNN, the filtering was operated onthe frequency domain [24] to reduce the frequency variants,such as the same audio pattern but with different pitches.

    B. Gated recurrent unit based RNN

    Recurrent neural networks have recently shown promisingresults in speech recognition [22]. Fig. 2 shows the basic ideaof the RNN. The current activation ht is determined by the


    Fig. 2. The structure of the simple recurrent neural network and its unfoldedversion. The current activation ht is determined by the current input xt andthe previous activation ht1.

    current input xt and the previous activation ht1. RNN withthe capability to learn the long-term pattern is superior to afeed-forward DNN, because a feed-forward DNN is designedthat the input contextual features each time are independent.The hidden activations of RNN are formulated as:

    ht = (Whxt + Rhht1 + bh) (1)

    However, a simple recurrent neural network with the recurrentconnection only on the hidden layer is difficult to train dueto the well-known vanishing gradient or exploding gradientproblems [25]. The long short-term memory (LSTM) structure[26] was proposed to overcome this problem by introducinginput gate, forget gate, output gate and cell state to control theinformation stream through the time. The fundamental idea ofthe LSTM is memory cell which maintains its state throughtime [27].

    As an alternative structure to the LSTM, the gated recurrentunit (GRU) was proposed in [28]. The GRU was demonstratedto be better than LSTM in some tasks [29], and is formulatedas follows [27]:

    rt = (Wrxt + Rrht1 + br) (2)

    zt = (Wzxt + Rzht1 + bz) (3)

    ht = (Whxt + rt (Rhht1) + bh) (4)

    ht = zt ht1 + (1 zt) ht (5)

    where ht,

View more >