introspection for convolutional automatic speech …188 the original input, for example the...

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 187–199Brussels, Belgium, November 1, 2018. c©2018 Association for Computational Linguistics

187

Introspection for convolutional automatic speech recognition

Andreas Krug and Sebastian StoberUniversity of Potsdam, Research Focus Cognitive Sciences

Karl-Liebknecht-Str. 24/25, 14476 Potsdam, Germany{ankrug, sstober}@uni-potsdam.de

Abstract

Artificial Neural Networks (ANNs) have expe-rienced great success in the past few years. Theincreasing complexity of these models leads toless understanding about their decision processes.Therefore, introspection techniques have beenproposed, mostly for images as input data.Patterns or relevant regions in images can beintuitively interpreted by a human observer. Thisis not the case for more complex data like speechrecordings. In this work, we investigate the appli-cation of common introspection techniques fromcomputer vision to an Automatic Speech Recog-nition (ASR) task. To this end, we use a modelsimilar to image classification, which predicts let-ters from spectrograms. We show difficulties inapplying image introspection to ASR. To tacklethese problems, we propose normalized aver-aging of aligned inputs (NAvAI): a data-drivenmethod to reveal learned patterns for predictionof specific classes. Our method integratesinformation from many data examples throughlocal introspection techniques for ConvolutionalNeural Networks (CNNs). We demonstrate thatour method provides better interpretability ofletter-specific patterns than existing methods.

1 Introduction

Artificial Neural Networks (ANNs) performincredibly well in many fields of application, even out-performing humans. In particular, deep learning (DL)has been used with great success in a variety oftasks. The most successful applications of DL are incomputer vision, like image classification (Krizhevskyet al., 2012) or segmentation (Chen et al., 2014).Moreover, DL performs well in audio processing, likeautomatic speech recognition (Bahdanau et al., 2016)or machine translation (Wu et al., 2016). One reasonfor the success of these models is the increase in theircomplexity by implementing deeper or wider networklayers (Szegedy et al., 2015). While this allows themodel to learn more complex patterns for solving its

task, it is becoming more difficult to interpret howit accomplishes it (Yosinski et al., 2015). Severalintrospection techniques were proposed to shed lighton the decision processes in ANNs (Zeiler and Fergus2014, Springenberg et al. 2014, Selvaraju et al. 2016).However, most of them come with restrictions on thenetwork architecture or the type of task that is solved.In particular, most methods focus on interpretability ofANNs in computer vision. The reason for this is thatevaluating the results from introspection techniqueson images is intuitive for a person. This is not the casefor more complex data like audio waveforms or multi-channel data like Electroencephalography (EEG)recordings. Applying introspection techniques fromcomputer vision to this kind of data is possible, butevaluating the results is hard, as a human expertcannot easily interpret the input data in the first place.

In this work, we investigate the application ofseveral introspection techniques to the domain ofAutomatic Speech Recognition (ASR). To make thetask of ASR similar to image classification, we usea fully-convolutional ANN for letter-wise predictionfrom audio spectrograms. We identify problems inapplying introspection techniques from computervision to the ASR domain. To overcome thesedifficulties, we propose normalized averaging ofaligned inputs (NAvAI): a data-driven introspectionmethod for interpreting speech recognition.

2 Related Work

In computer vision, Convolutional Neural Net-works (CNNs) are the most common choice ofnetwork architecture (Szegedy et al., 2015). As wewant to adapt techniques from this domain, we focuson introspection methods developed for CNNs.

Introspection techniques for classification tasksin deep learning can roughly be divided into twocategories. Firstly, there are local introspectionmethods, which trace the classification result back to

188

the original input, for example the deconvolutionalnetwork approach by Zeiler and Fergus 2014or layer-wise relevance propagation (Bach et al.,2015). The second category are global introspectiontechniques that infer input patterns or characteristicswhich activate particular neurons, like activationmaximization (Erhan et al. 2009, Yosinski et al. 2015).

2.1 Local introspection

Local introspection traces back signals to a particularinput source. This means inferring, which parts ofan input sample were important for the prediction.The backpropagated signal comes either frompre-softmax activations or the softmax-logits of anANN classifier’s output layer. The common wayis to trace back the result of the output layer as aone-hot vector, so only class-specific information areretained (Springenberg et al., 2014). This means thatthe position of highest activation is set to 1, while allother positions are set to 0.

A simple and fast way to infer the contribution of in-put values to the classification score is to perform sen-sitivity analysis. This method computes the (squared)partial derivatives of output scores with respect tothe values of a particular input sample (Gevrey et al.,2003). Another method is to use deconvolutional net-works, which invert the data flow of a convolutionalclassifier network to reconstruct the input (Zeilerand Fergus, 2014). The backward pass also includesunits that revert max-pooling operations. This is doneby storing the maximum positions before pooling inso-called switches (Zeiler and Fergus, 2014).

Another local introspection method is guidedbackpropagation (Springenberg et al., 2014). Thistechnique is based on gradient backpropagation butintegrates information about the forward pass. Fora network which uses Rectified Linear Unit (ReLU)activation, the authors propose to only backpropagatepositive gradients, where the corresponding forwardactivation is positive as well (Springenberg et al.,2014). The authors also report that introspectionusing the deconvolutional network approach by Zeilerand Fergus 2014 performs poorly for higher layers,where neurons can be maximally activated by a widervariety of input signals. Their method does not showthis drop in performance for higher layers. Guidedbackpropagation reveals detailed features in the inputwhich are important for the prediction, but is notstrongly class-discriminative.

Selvaraju et al. 2016 introduced the class-discriminative Gradient-weighted Class Activa-

tion Mapping (Grad-CAM), which identifies low-resolution regions of importance in the input.Grad-CAM first computes importance weights foreach feature map in one layer. This is done by globalaverage pooling gradients of the prediction score withrespect to the feature maps. These importance weightsare used to compute a weighted sum of forward activa-tions, which represent the influence on the predictedclass. The authors use a ReLU on the weighted sums,to only show positive influences on the prediction.By using their method to mask the result of guidedbackpropagation, which they call guided Grad-CAM,they get both class-specificity and high resolution inrelevant input values (Selvaraju et al., 2016).

All of those local introspection methods onlyreveal information about a single input sample. Thiscould help understanding particular decisions, forexample wrong classifications. For revealing decisionprocesses of an ANN as a whole, global introspectionis essential.

2.2 Global introspectionThe most common global introspection technique isactivation maximization (AM) (Erhan et al., 2009).AM is independent of the input and can be used tofind patterns which activate particular features. Thismethod optimizes the input, such that the activationof a particular feature is maximized. Such a featurecould be a single neuron at any position of thenetwork. For classifiers, the most interesting featureis the output neuron of the predicted class. The op-timization target can be the corresponding activationeither before or after applying the softmax. It is alsopossible to visualize optimal inputs for a whole layer,as in Google DeepDream (Mordvintsev et al., 2015).However, the input optimization approach has somedrawbacks. Optimal inputs tend to be unnatural andnoisy, thus cannot be interpreted (Nguyen et al., 2015).Therefore, it is crucial to regularize, for example bytotal variation (Mahendran and Vedaldi, 2015) or aGenerative Adversarial Network (GAN) objective(Nguyen et al., 2016) to penalize unnatural data. Evenwith regularized optimization, the optimal input needsto be interpretable. This means a human has to beable to assess, whether patterns in the optimized inputare related to a certain class.

2.3 Introspection for audioThe aforementioned local and global introspectiontechniques are almost exclusively applied to taskswhich use images as input. This is due to the intuitiveinterpretability of relevance mappings onto images for

189

a human observer. Whether an introspection techniqueperforms well is mostly measured by how plausiblethe result is for a person. This is not an objective quan-tification, but it indicates how similar the ANN deci-sions are to the human perception. However, this is notpossible for all types of data. For example, when usingwaveforms as input to an audio-classification task likeASR, it is not intuitive to assess the meaningfulnessof important regions or optimal inputs. To our knowl-edge there are no comparable introspection techniquesfor ANN in speech recognition tasks. However, thisis not the first attempt to understand ANNs for speechrecognition. Several studies explored representationsof speech in ANNs for acoustic modelling, for ex-ample multi-layer perceptrons (Nagamine et al. 2015,Nagamine et al. 2016, Nagamine and Mesgarani 2017)or Deep Belief Networks (Mohamed et al., 2012).

3 Methods

3.1 Automatic Speech Recognition

The use of CNNs for speech is not uncommon.However, they are often used as part of complexhybrid models, for example involving Hidden MarkovModels (Abdel-Hamid et al., 2014) or Recurrent Neu-ral Networks (Trigeorgis et al., 2016). Such complexmodels are much harder to introspect than fully-convolutional ones. CNNs are also used for speech-related tasks different from ASR, like learning spec-trum feature representations (Cummins et al., 2017).

For ASR, we implement a fully-convolutionalarchitecture to apply introspection techniques fromcomputer vision. To this end, we are using an archi-tecture based on Wav2Letter (Collobert et al., 2016).This model is a fully-convolutional neural network,which predicts letters from spectrograms. We trainthe network on z-normalized spectrograms, scaledto 128 mel-frequency bins. Each letter prediction canuse 206 time steps due to the receptive field of theconvolutions. We use whole-sequence audio record-ings from the LibriSpeech corpus (Panayotov et al.,2015). Training and architecture are described indetail in (Kunze et al., 2017). Different to Kunze et al.we slightly changed the number of neurons per layerto powers of two (250 to 256 neurons and 2000 to2048 neurons). Moreover, we used a vocabulary withrepetition characters like Collobert et al. 2016 usedwith their Auto Segmentation Criterion (ASG) loss.

3.2 Activation Maximization

We visualize important features by computing theoptimal input for activating a particular neuron. We

used L1- and L2-regularization to avoid unnaturalnoisy results, both with a scale of 0.001. Theoptimization was initialized with a 206×128 input(the receptive field size) using a Xavier uniforminitializer (Glorot and Bengio, 2010). Training wasperformed to maximize the activation of a particularneuron, using an Adam optimizer (Kingma and Ba,2014) with learning rate 0.05 for 250 steps. Weapplied AM for neurons of different layers to showdifferences in the complexity of optimal patterns.

3.3 Preparing the data for introspectionOur ASR model predicts all letters for a given speechrecording at once, but we are interested in determiningthe important regions for single predicted letters.Therefore, we perform our analyses on spectrogramframes, which are predicted as only one letter. Basedon the receptive field size, we perform introspectionon spectrogram frames of width 206. Moreover, weonly investigate spectrogram frames predicted asletters ’a’ to ’z’, because blank and repetition charac-ters would not be interpretable for a human observer.For training and evaluation, our neural networkuses same-padding with zeros. To avoid biasing ourintrospection results due to padding, we only analyzespectrogram frames without padding. Because we aretraining with whole sentences, most of the letters arepredicted from spectrogram frames without padding.

3.4 Local introspectionFor a spectrogram frame of interest, we first performa forward pass through the network, while storingall layers’ activations and the output scores. To findimportant positions in the input data, we performdifferent methods for propagating back the predictionscore. In particular, we are using sensitivity analysis(Gevrey et al., 2003) and layer-wise relevancepropagation (LRP) (Montavon et al., 2017). As initialvalue for the backward pass, we use a vector whichis set to 1 for the predicted class and 0 for all otherpositions. We call this vectorR(out).

We did not investigate guided backpropagation,because we rely on getting class-discriminativeintrospection results. We also did not use Grad-CAM,because of the 1D-convolutions in our network. Asour input data is treated as 128 one-dimensionalchannels, applying Grad-CAM to our networkwould only identify important regions over the timedimension. This means, we would not be able toidentify which frequencies are important.

Sensitivity analysis was performed by computingthe partial derivative of R(out) with respect to the

190

input spectrogram frame, as shown in Equation 1.The resulting gradient-based relevances R(0) can beinterpreted as positions in the input x, which increaseor decrease the prediction score upon change.

R(0)i =

∂R(out)

∂xi(1)

Different to sensitivity, LRP aims to map highrelevances to input positions that have causedR(out).We performed non-Taylor-type LRP, which weadapted from Equation 56 in Bach et al. 2015:

R(l,l+1)i←j =

zijzj·R(l+1)

j (2)

where i refers to the neuron in the lower layer land j to the neuron in the higher layer l+1. Thisoriginal rule means, that relevances are propagatedback based on the ratio of local (zij) and global (zj)pre-activations. The pre-activations are outputs of theconvolution for a neuron j either masking all lowerlayer neurons but neuron i (local) or not masking anyneurons (global). Hence, the ratio zij

zjis the relative

influence of a neuron i on the pre-activation of neuronj. This allows to distribute the relevance from neuronj to the lower layer neurons while conserving thesum of relevance values (compare Equation 5).

Computing these ratios is computationally ex-pensive for more complex neural networks, as it isnecessary to compute the contribution of every valuei of the lower layer to every value j of the higherlayer through the convolutions. For example, in theinput layer of our network, it is necessary to computelocal pre-activations from 128 × 206 input valuesto 256×80 output values. This corresponds to 500million local pre-activations in the first layer.

Applying Equation 2 is not straightforward inour speech recognizer network. This is due to twomajor differences to the image classification networksthat Bach et al. 2015 used. Firstly, our networkinvolves negative input values from z-normalizedmel-spectrograms. Secondly, after each convolution,batch normalization is applied before the ReLUactivation. This allows convolution outputs to changetheir sign before entering the ReLU activation. Inorder to account for negative values and the effects ofbatch normalization, we adapt Equation 2 as follows.We compute the ratio between local and globalpre-activations using the absolute value of the globalpre-activation. This preserves the sign of local pre-activations for comparison to the convolution outputafter applying batch normalization. The magnitude

of the neuron influence is not changed. For avoidingdivision by zero, we add a small value ε=1e-21 tothe absolute value of global pre-activation. This ratiois multiplied with the sign of the output value afterapplying batch normalization (bn), shown in Equation3. With this approach, a positive ratio indicates thata local pre-activation supports the output after batchnormalization, because they have the same sign. Webackpropagate the relevance as shown in Equation 4.

rij=zij|zj|+ε

·sgn(bn(zj)) (3)

R(l,l+1)i←j =rij ·R(l+1)

j (4)

The original rule in Equation 2 is satisfying theconservation law∑

i

R(l,l+1)i←j =R

(l+1)j (5)

where no relevance may be lost by distributing thevalue to lower layer neurons. In our adaptation, thisconservation law is not satisfied, because we changethe sign of some ratios to correct for batch normal-ization. As this procedure does not change absolutevalues, we do not lose any information about the rel-evances. Furthermore, using the original rule, rel-evances can become unbounded for negative ratioszijzj

. To avoid absolute relevances to become verylarge, we scale the values by the maximum absoluterelevance value in each step of LRP. Scaling therelevances is also violating the conservation law inEquation 5. However, the relevances still contain thesame information, as the ratio between all relevancesis conserved.

3.5 Normalized averaging of aligned inputs

We perform global introspection by analyzing thetraining data set, in which we want to find commonletter-specific patterns. To this end, we propose anovel approach for global introspection, called nor-malized averaging of aligned inputs (NAvAI). Wedescribe NAvAI for ASR, but applying it to otherdomains is straight-forward. Our method averagesall spectrogram frames predicted as the same letter.This mean spectrogram input should retain informa-tion related to the letter and average out values whichare related to the context. Averaging only producesmeaningful results, if the predicted letter is properlyaligned to the spectrogram frame. This means that theposition of the predicted letter needs to be the same inall frames. Otherwise, even letter-specific information

191

would get averaged out. Therefore, before comput-ing average frames, NAvAI aligns the spectrogramframes as described below. Computing the averageover (aligned) letter-specific spectrogram frames re-tains information about what is common to all frames.However, this is not necessarily exclusive to spectro-gram frames of this particular letter. There might be in-formation, which is contained for all predicted letters.Therefore, our method normalizes the letter-averagedspectrogram frames by subtracting the mean over spec-trogram frames predicted as any letter ’a’ to ’z’.

3.6 Alignment of spectrogram frames

For proper alignment between predicted letter andspectrogram, we facilitate the introspection techniquesfrom Section 3.4. We follow the hypothesis, that thetime step, where a predicted letter actually occurs, isthe one that is most important for the prediction score.We infer this position from local introspection resultsusing sensitivity analysis and LRP. Positive rele-vances from LRP identify values, which have causedthe prediction score. Therefore, we use the maximumposition from LRP. In contrast, sensitivity can bemeaningful both at the maximum and minimumvalue position. Positive values imply importance ofpositions, because increasing them would make theprediction more certain. Negative gradients are ofinterest as well, because they show where a changein input value causes the prediction certainty to drop.Most of the predictions are already very close tobeing a one-hot vector as softmax-output. Then, theremight be no or only a small gradient for increasing theprediction certainty. In this case, the minimum valueposition from sensitivity might be more appropriatethan the maximum value position. The alignmentprocedure crops the spectrogram frames on one side,such that the determined positions are in the center.

4 Results & Discussion

4.1 Optimal inputs by activation maximization

We performed AM for neurons of different layers. Forvisualization, we chose neurons which are maximallyactivated for the prediction of letter ’a’ in a randomlychosen spectrogram frame. In the output layer, thisneuron corresponds to the predicted letter (here it isthe ’a’-neuron). As the outputs of the three topmostlayers are one-dimensional, we use the neuron ofhighest activation. In all other layers, we choseneurons with highest average activation over the timedimension. We only show optimization of stronglyactivated neurons, because they are evidently sensitive

to some pattern and potentially letter-specific. InFigure 1, we representatively show four differentlayers of the network. The top row shows optimalinputs for a neuron in the first and second layer. Inthose layers, AM reveals patterns, which can be in-terpreted as features in the spectrogram. For example,the input layer neuron detects a shift of intensitytowards higher frequencies. The second-layer neuroncombines low-level features, so it is sensitive todifferent changes in frequency intensities, particularlyof lower frequencies. In contrast, optimizing neuron

layer 1 layer 2

layer 8 output layer

Figure 1: Optimal inputs for neurons in different layersof the network. Each shown neuron has highest activationfor predicting letter ’a’ in the respective layer. Optimalinputs to bottom layers (top row) are still interpretable asfeatures in the spectrogram. The higher layers (bottomrow), in particular the output neuron for ’a’ (bottom right),do not look like spectrograms and cannot be interpretedas particular features which the neuron is sensitive to. Theaxes are equal to the spectrogram frames in Figure 2.

output in higher layers (bottom row) does not revealany interpretable patterns. Those neurons are sensitiveto a large variety of different patterns, so that asingle optimal input is not natural anymore. This isa common problem for AM. Still, it is easier to detectunnatural but related patterns in real-world imagesthan in audio spectrograms.

To obtain more natural results, one possibilitywould be using stronger regularization techniques likea GAN penalty. On the other hand, using strongerregularization interferes with determining the actuallearned patterns. Regularizing is therefore favoringresults similar to data over actual insight in the model.For our speech recognizer, we can conclude that themodel did not learn a single abstract representationfor the letters. This is not surprising, as the same letteris not pronounced equally in every context.

In addition, in the output layer, we observed zero-

192

Figure 2: Local introspection using sensitivity analysis and LRP. Left: Two spectrogram frames, both predicted as letter’a’. By propagating the prediction score back through the network, important regions are identified. Center: Sensitivityanalysis results. Right: Relevances using LRP. The results of both methods are visualized as an overlay on top of theoriginal spectrogram. Blue values indicate negative sensitivity/relevance, red indicates positive values.

areas in the beginning and end of the optimal input.This implies that the network capacity is not fully uti-lized for the prediction and could still be compressed.

4.2 Sensitivity analysis and LRPWe performed sensitivity analysis and LRP for allspectrogram frames. Here, we show characteristicsof these methods based on two spectrogram framespredicted as letter ’a’. Figure 2 shows those twospectrogram frames (left) and the local introspectionresults. The sensitivity values (center) and LRP-basedrelevances (right) are visualized superimposed on theinput spectrogram. Both methods differ strongly inwhat they identify as important for the prediction.

Sensitivity analysis identifies relevant positions ina larger area and includes more data points than LRP.As we assumed, if the prediction already was certain,sensitivity analysis is resulting in mostly negativevalues, as in the top example. In the bottom example,there are more positive gradients, indicating a lesscertain prediction. Moreover, sensitivity analysisidentifies important regions close to center of thespectrogram frame.

The relevances backpropagated with LRP aremuch more position-specific than sensitivity values.The top example shows fewer relevant positions. Inthe bottom example, relevance is assigned to onlytwo small regions. This indicates that LRP identifiesimportant positions, but emphasizes the most relevantones. We assume, this is due to having negative

input values. As mentioned above, relevances canbecome unbounded for negative input values, whichwe prevented by scaling them. However, this doesnot reduce possible large differences between weakand strong relevances. We also observe that LRPidentifies regions as relevant, which are further awayfrom the center of the spectrogram frame.

Neither sensitivity analysis nor LRP reveal patterns,which can be interpreted as typical letters for thenetwork. Furthermore, different spectrogram framespredicted as the same letter do rarely show commonpatterns. We would expect that in most cases impor-tant regions for predicting letter ’a’ are formants in thespectrogram, which are characteristic for vowels. Thesecond example in Figure 2 is one of many examples,where this expectation is not met. Sensitivity analysisshows that the beginning of the utterance is important,because highly sensitive positions are distributed overall frequencies in one time point. This can be ex-plained for the model, since it could have learned thecontext around the formant pattern. The LRP result isidentifying two small regions in the spectrogram bothof negative values in the spectrogram. Although thismight be valid for what is important for the model,this cannot be interpreted as features of an ’a’.

4.3 Global introspectionWe perform global introspection with our novelmethod NAvAI. We compute letter-specific spectro-grams as average over aligned spectrogram frames

193

'a'

't'

all

letter

alignment

minimumsensitivity

maximumsensitivity

maximumLRP

notaligned

'a'

't'

mean

spectro

gram

lette

r-specific

patte

rns

norm

aliz

er

-=

Figure 3: Averaging and normalizing letter-specific spectrogram frames. Top two rows: Mean inputs over spectrogramspredicted as letter ’a’ and ’t’, respectively. Middle row: Average spectrogram frame over all letters, which is usedfor normalization. Bottom two rows: By subtracting the mean over all letters from the letter-averaged spectrograms,we obtain patterns specific to the prediction of certain letters. Each analysis was performed with different alignmentmethods, of which each is visualized in a column. The second to fourth column correspond to aligning the spectrogramsframes to the predicted letter based on local introspection. As comparison, the first column presents the averaging resultsfor the unaligned spectrogram frames. The axes are equal to the spectrograms in Figure 2. Values in each frame arenormalized, such that the absolute maximum is 1.

and normalize them. The alignment procedure cropsthe frames, so they are centered to the most importantposition. Because of this, the beginning and end ofthe averaged frame implicitly has lower values.

Mean spectrogram frames over letters Figure 3exemplifies the mean spectrogram frames for letters’a’ and ’t’ (top two rows). It is not possible to seeinterpretable differences between particular letters.Therefore, we cannot tell anything about the qualityof the alignment methods as well. Interestingly, forall letters there are higher mean values in the center ofunaligned mean spectrogram frames where only theposition is slightly shifted comparing the letters. This

indicates that the network learned to align the centerof the snippet with sounds that have a high value forall frequencies. For example, this could mean that thenetwork can detect release bursts of plosives easilyand uses them as a center point the prediction of lettersin their context. Alignment by minimum or maximumsensitivity is causing this effect to be less pronounced.With using maximum LRP for alignment this effectvanishes. This is due to LRP identifying importantregions further away from the center than sensitivity.The middle row of Figure 3 shows the mean overall letters. If there was nothing in common betweenthe letters, all information would have been averagedout. On the contrary, we can observe that the overall

194

mean is very similar to the letter-specific means.This shows that there is information common to allletters, which overshadows the spectrogram featuresthat are relevant for prediction. More precisely, thisinformation is not only common to the letters, butto spectrograms in general. For example, in speech,there is higher intensity of low frequencies thanof high frequencies. This is reflected in the meanspectrograms, where the mean value decreases withhigher frequency.

Mean spectrogram frame normalization To re-veal where the differences between the letter-specificpatterns are, we normalized each letter-averagedspectrogram frame with the mean over all letters.Figure 3 visualizes this procedure, where the meanover all letters (middle row) is subtracted from theexemplary mean frames over letters ’a’ and ’t’ (toptwo rows). The resulting average spectrogram framesafter normalization are shown in the bottom tworows of Figure 3. With normalization, we obtain thefinal result of our NAvAI method and we are able toobserve letter-specific patterns.

Patterns in normalized mean spectrogram framesWe emphasize that the normalized frames (bottomtwo rows of Figure 3) are not spectrograms anymore.Positive (red) and negative (blue) values indicate,where the average over one letter is higher or lowerthan the mean over all letters, respectively. Thiscan lead to positive values where the (average)spectrogram was negative, and vice versa. Moreover,our method is not identifying which particular featuresare used by the network for prediction. For example,if NAvAI reveals two formants for a particular letter,it is not certain that both are used for the prediction.

First of all, we averaged spectrogram frameswithout alignment (first column in Figure 3). Thereare no letter-specific patterns visible in the resultingframes. For all letters, the normalized frame onlyshows a transition from positive to negative values(or the other way round) in the center. This simplyreflects the above mentioned high intensities in thecenter, which are slightly shifted for different letters.

With all investigated alignment methods, we canobserve a clear difference between the patterns fordifferent letters. For predicting letter ’a’, the networkis detecting a stronger signal at the center and rightof it for sensitivity-based and LRP alignments. Alsofor all alignments, two formants are clearly visibleat around 700 Hz and 2700 Hz. This pattern makessense, as vowels are combinations of different

formants. While all alignment methods show thispattern, it is more wide-spread across time using LRP.

Similarly, we can observe a letter-specific patternfor the letter ’t’. For sensitivity-based alignments,there is a quick change from lower to higher signaland back in the center. This transition occurs inall frequencies, while it is more pronounced in thehigher ones. This corresponds to the typical pattern ofplosives. Their release burst is characterized by a highintensity of all frequencies in a very short time span.With LRP-based alignment, we did not observe thispattern. From the observations for letter ’a’, we knowthat the signal is more wide-spread for LRP. This isnot affecting the observed formant pattern of letter’a’, but it affects the plosive pattern. Here, the signalof interest spans the frequency dimension. Spreadingthe strong signal wider in the time dimension causesaveraging out the interesting pattern. The weakersignal of low frequencies is detected with bothsensitivity and LRP, because this is more consistentin the time dimension. The wide spread of signalswhen aligning by maximum LRP indicates that theletters were not properly aligned to the spectrogram.

The alignment by minimum or maximum sen-sitivity both revealed letter-specific patterns whichalso are specific in the time dimension. There is onlyslight difference between minimum and maximumsensitivity alignment, but the resulting normalizedmean spectrogram frames seem to be more specificwhen aligning at the minimum sensitivity. We cannotguarantee that the alignment centers the spectrogramsat the real occurrence of the letter. This can beseen in the typical patterns for ’a’, which are rightof the center. However, as long as the alignmentis consistent, we still get meaningful results. Wesuspected that the network learns to facilitate releasebursts of plosives in the center of prediction frames. Ifthis was true, alignment should not change the centerposition much for letters that are mostly pronouncedas plosives. This idea is supported by the shownresults, as there is a much smaller difference betweenaligned and unaligned mean spectrogram frames for’t’ compared to ’a’. Patterns of all letters are providedin Supplemental Material A.

5 Conclusion

Applying local and global introspection methodsfor image classification CNNs to an ASR task is notstraight-forward. There are difficulties due to thereal-value space of input data, architectural limitationsand interpretability of audio data. We showed that

195

local introspection with sensitivity analysis and LRPdoes not give much insight into the network. Globalintrospection with weakly regularized AM was onlyproducing interpretable patterns for lower layers.

We introduced NAvAI as a novel introspectionmethod, which determines class-specific features byaveraging over examples for each class. This approachadapts simple averaging to specific properties of theASR task, by aligning letters to spectrograms throughlocal introspection techniques and normalization. Weshowed that our method is capable of revealing in-terpretable patterns, which are common to predictingparticular letters. Although demonstrated for ASR,NAvAI is generally applicable to other domains.

This work did not cover, whether there are differentpatterns corresponding to particular contexts orpronunciation of letters. In future work, the classeswill be separated into different pronunciations, forexample by facilitating information about phonemes.Although the patterns are interpretable, someknowledge about features in spectrograms is needed.Evaluating the introspection as a sound examplewould be far more intuitive. Therefore, future workwill cover synthesizing sound samples from the intro-spection results or working with waveforms directly.This work pinpointed several issues, where commonintrospection techniques fail for CNN-based ASR.Following our results, we will further develop or adaptintrospection techniques and optimize the architecturetowards better applicability of introspection.

Acknowledgments

This research has been funded by the Federal Ministryof Education and Research of Germany (BMBF) andsupported by the donation of a GeForce GTX TitanX graphics card from the NVIDIA Corporation.

ReferencesOssama Abdel-Hamid, Abdel-rahman Mohamed, Hui

Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014.Convolutional neural networks for speech recognition.IEEE/ACM Transactions on audio, speech, andlanguage processing, 22(10):1533–1545.

Sebastian Bach, Alexander Binder, Gregoire Montavon,Frederick Klauschen, Klaus-Robert Muller, and Wo-jciech Samek. 2015. On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation. PloS one, 10(7):e0130140.

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk,Philemon Brakel, and Yoshua Bengio. 2016. End-to-end attention-based large vocabulary speech

recognition. In Acoustics, Speech and Signal Process-ing (ICASSP), 2016 IEEE International Conference on,pages 4945–4949. IEEE.

Liang-Chieh Chen, George Papandreou, Iasonas Kokki-nos, Kevin Murphy, and Alan L Yuille. 2014. Semanticimage segmentation with deep convolutional nets andfully connected crfs. arXiv preprint arXiv:1412.7062.

Ronan Collobert, Christian Puhrsch, and Gabriel Syn-naeve. 2016. Wav2letter: an end-to-end convnet-basedspeech recognition system. CoRR, abs/1609.03193.

Nicholas Cummins, Shahin Amiriparian, GerhardHagerer, Anton Batliner, Stefan Steidl, and Bjorn WSchuller. 2017. An image-based deep spectrum featurerepresentation for the recognition of emotional speech.In Proceedings of the 2017 ACM on MultimediaConference, pages 478–484. ACM.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, andPascal Vincent. 2009. Visualizing higher-layer featuresof a deep network. University of Montreal, 1341(3):1.

Muriel Gevrey, Ioannis Dimopoulos, and Sovan Lek.2003. Review and comparison of methods to studythe contribution of variables in artificial neural networkmodels. Ecological modelling, 160(3):249–264.

Xavier Glorot and Yoshua Bengio. 2010. Understandingthe difficulty of training deep feedforward neural net-works. In Proceedings of the thirteenth internationalconference on artificial intelligence and statistics,pages 249–256.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.2012. Imagenet classification with deep convolutionalneural networks. In F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems 25, pages1097–1105. Curran Associates, Inc.

Julius Kunze, Louis Kirsch, Ilia Kurenkov, AndreasKrug, Jens Johannsmeier, and Sebastian Stober. 2017.Transfer learning for speech recognition on a budget.In Proceedings of the 2nd Workshop on RepresentationLearning for NLP, pages 168–177. Association forComputational Linguistics.

Aravindh Mahendran and Andrea Vedaldi. 2015. Under-standing deep image representations by inverting them.In Proceedings of the IEEE conference on computervision and pattern recognition, pages 5188–5196.

Abdel-rahman Mohamed, Geoffrey Hinton, and GeraldPenn. 2012. Understanding how deep belief networksperform acoustic modelling. In Acoustics, Speech andSignal Processing (ICASSP), 2012 IEEE InternationalConference on, pages 4273–4276. IEEE.

196

Gregoire Montavon, Sebastian Lapuschkin, AlexanderBinder, Wojciech Samek, and Klaus-Robert Muller.2017. Explaining nonlinear classification decisionswith deep taylor decomposition. Pattern Recognition,65:211–222.

Alexander Mordvintsev, Christopher Olah, and MikeTyka. 2015. Inceptionism: Going deeper into neuralnetworks. Google Research Blog. Retrieved June,20(14):5.

Tasha Nagamine and Nima Mesgarani. 2017. Understand-ing the representation and computation of multilayerperceptrons: A case study in speech recognition. InInternational Conference on Machine Learning, pages2564–2573.

Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani.2015. Exploring how deep neural networks formphonemic categories. In Sixteenth Annual Confer-ence of the International Speech CommunicationAssociation.

Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani.2016. On the role of nonlinear transformations in deepneural network acoustic models. In Interspeech, pages803–807.

Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski,Thomas Brox, and Jeff Clune. 2016. Synthesizingthe preferred inputs for neurons in neural networksvia deep generator networks. In Advances in NeuralInformation Processing Systems, pages 3387–3395.

Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deepneural networks are easily fooled: High confidencepredictions for unrecognizable images. In Proceedingsof the IEEE Conference on Computer Vision andPattern Recognition, pages 427–436.

Vassil Panayotov, Guoguo Chen, Daniel Povey, andSanjeev Khudanpur. 2015. Librispeech: an asr corpusbased on public domain audio books. In Acoustics,Speech and Signal Processing (ICASSP), 2015 IEEEInternational Conference on, pages 5206–5210. IEEE.

Ramprasaath R. Selvaraju, Abhishek Das, RamakrishnaVedantam, Michael Cogswell, Devi Parikh, and DhruvBatra. 2016. Grad-cam: Visual explanations fromdeep networks via gradient-based localization. CoRR,abs/1610.02391.

Jost Tobias Springenberg, Alexey Dosovitskiy, ThomasBrox, and Martin Riedmiller. 2014. Striving forsimplicity: The all convolutional net. arXiv preprintarXiv:1412.6806.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich.2015. Going deeper with convolutions. In Proceedingsof the IEEE conference on computer vision and patternrecognition, pages 1–9.

George Trigeorgis, Fabien Ringeval, Raymond Brueckner,Erik Marchi, Mihalis A Nicolaou, Bjorn Schuller, andStefanos Zafeiriou. 2016. Adieu features? end-to-endspeech emotion recognition using a deep convolu-tional recurrent network. In Acoustics, Speech andSignal Processing (ICASSP), 2016 IEEE InternationalConference on, pages 5200–5204. IEEE.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,Mohammad Norouzi, Wolfgang Macherey, MaximKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.2016. Google’s neural machine translation system:Bridging the gap between human and machinetranslation. arXiv preprint arXiv:1609.08144.

Jason Yosinski, Jeff Clune, Anh Nguyen, ThomasFuchs, and Hod Lipson. 2015. Understanding neuralnetworks through deep visualization. arXiv preprintarXiv:1506.06579.

Matthew D Zeiler and Rob Fergus. 2014. Visualizingand understanding convolutional networks. In Euro-pean conference on computer vision, pages 818–833.Springer.

197

A Supplemental Material

A.1 NAvAI results for letters ’a’ to ’i’

alignment

minimumsensitivity

maximumsensitivity

maximumLRP

notaligned

'a'

'b'

'c'

'd'

'e'

'f'

'g'

'h'

'i'

198

A.2 NAvAI results for letters ’j’ to ’r’

'j'

'k'

'l'

'm'

'n'

'o'

'p'

'q'

'r'

alignment

minimumsensitivity

maximumsensitivity

maximumLRP

notaligned

199

A.3 NAvAI results for letters ’s’ to ’z’

's'

't'

'u'

'v'

'w'

'x'

'y'

'z'

alignment

minimumsensitivity

maximumsensitivity

maximumLRP

notaligned

introspection for convolutional automatic speech …188 the original input, for example the...

Documents