explaining predictions of non-linear classiﬁers in nlpiphome.hhi.de/samek/pdf/arracl16.pdf ·...

Explaining Predictions of Non-Linear Classifiers in NLP

Leila Arras1, Franziska Horn2, Gregoire Montavon2,Klaus-Robert Muller2,3, and Wojciech Samek1

1Machine Learning Group, Fraunhofer Heinrich Hertz Institute, Berlin, Germany2Machine Learning Group, Technische Universitat Berlin, Berlin, Germany

3Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea{leila.arras, wojciech.samek}@hhi.fraunhofer.de

[email protected]

Abstract

Layer-wise relevance propagation (LRP)is a recently proposed technique for ex-plaining predictions of complex non-linearclassifiers in terms of input variables. Inthis paper, we apply LRP for the first timeto natural language processing (NLP).More precisely, we use it to explain thepredictions of a convolutional neural net-work (CNN) trained on a topic categoriza-tion task. Our analysis highlights whichwords are relevant for a specific predictionof the CNN. We compare our techniqueto standard sensitivity analysis, both qual-itatively and quantitatively, using a “worddeleting” perturbation experiment, a PCAanalysis, and various visualizations. Allexperiments validate the suitability of LRPfor explaining the CNN predictions, whichis also in line with results reported in re-cent image classification studies.

1 Introduction

Following seminal work by Bengio et al. (2003)and Collobert et al. (2011), the use of deep learn-ing models for natural language processing (NLP)applications received an increasing attention in re-cent years. In parallel, initiated by the computervision domain, there is also a trend toward under-standing deep learning models through visualiza-tion techniques (Erhan et al., 2010; Landecker etal., 2013; Zeiler and Fergus, 2014; Simonyan etal., 2014; Bach et al., 2015; Lapuschkin et al.,2016a) or through decision tree extraction (Krish-nan et al., 1999). Most work dedicated to under-standing neural network classifiers for NLP tasks(Denil et al., 2014; Li et al., 2015) use gradient-based approaches. Recently, a technique calledlayer-wise relevance propagation (LRP) (Bach et

al., 2015) has been shown to produce more mean-ingful explanations in the context of image classi-fications (Samek et al., 2015). In this paper, we ap-ply the same LRP technique to a NLP task, wherea neural network maps a sequence of word2vecvectors representing a text document to its cat-egory, and evaluate whether similar benefits interms of explanation quality are observed.

In the present work we contribute by (1) ap-plying the LRP method to the NLP domain, (2)proposing a technique for quantitative evaluationof explanation methods for NLP classifiers, and(3) qualitatively and quantitatively comparing twodifferent explanation methods, namely LRP and agradient-based approach, on a topic categorizationtask using the 20Newsgroups dataset.

2 Explaining Predictions of Classifiers

We consider the problem of explaining a predic-tion f(x) associated to an input x by assigning toeach input variable xd a scoreRd determining howrelevant the input variable is for explaining theprediction. The scores can be pooled into groupsof input variables (e.g. all word2vec dimensions ofa word, or all components of a RGB pixel), suchthat they can be visualized as heatmaps of high-lighted texts, or as images.

2.1 Layer-Wise Relevance Propagation

Layer-wise relevance propagation (Bach et al.,2015) is a newly introduced technique for obtain-ing these explanations. It can be applied to variousmachine learning classifiers such as deep convolu-tional neural networks. The LRP technique pro-duces a decomposition of the function value f(x)on its input variables, that satisfies the conserva-tion property:

f(x) =∑

dRd. (1)

The decomposition is obtained by performing abackward pass on the network, where for eachneuron, the relevance associated with it is redis-tributed to its predecessors. Considering neuronsmapping a set of n inputs (xi)i∈[1,n] to the neuronactivation xj through the sequence of functions:

zij = xiwij +bjn

zj =∑

izij

xj = g(zj)

where for convenience, the neuron bias bj hasbeen distributed equally to each input neuron, andwhere g(·) is a monotonously increasing activationfunction. Denoting by Ri and Rj the relevanceassociated with xi and xj , the relevance is redis-tributed from one layer to the other by definingmessages Ri←j indicating how much relevancemust be propagated from neuron xj to its inputneuron xi in the lower layer. These messages aredefined as:

Ri←j =zij +

s(zj)n∑

i zij + s(zj)Rj

where s(zj) = ε · (1zj≥0 − 1zj<0) is a stabilizingterm that handles near-zero denominators, with εset to 0.01. The intuition behind this local rele-vance redistribution formula is that each input xishould be assigned relevance proportionally to itscontribution in the forward pass, in a way that therelevance is preserved (

∑iRi←j = Rj).

Each neuron in the lower layer receives rele-vance from all upper-level neurons to which it con-tributes

Ri =∑

jRi←j .

This pooling ensures layer-wise conservation:∑iRi =

∑j Rj . Finally, in a max-pooling

layer, all relevance at the output of the layeris redistributed to the pooled neuron with max-imum activation (i.e. winner-take-all). An im-plementation of LRP can be found in (La-puschkin et al., 2016b) and downloaded fromwww.heatmapping.org1.

2.2 Sensitivity AnalysisAn alternative procedure called sensitivity analy-sis (SA) produces explanations by scoring inputvariables based on how they affect the decisionoutput locally (Dimopoulos et al., 1995; Gevrey

1Currently the available code is targeted on image data.

et al., 2003). The sensitivity of an input variable isgiven by its squared partial derivative:

Rd =( ∂f∂xd

)2.

Here, we note that unlike LRP, sensitivity analysisdoes not preserve the function value f(x), but thesquared l2-norm of the function gradient:

‖∇xf(x)‖22 =∑

dRd. (2)

This quantity is however not directly related tothe amount of evidence for the category to de-tect. Similar gradient-based analyses (Denil et al.,2014; Li et al., 2015) have been recently applied inthe NLP domain, and were also used by Simonyanet al. (2014) in the context of image classification.While recent work uses different relevance defini-tions for a group of input variables (e.g. gradientmagnitude in Denil et al. (2014) or max-norm ofabsolute value of simple derivatives in Simonyanet al. (2014)), in the present work (unless other-wise stated) we employ the squared l2-norm ofgradients allowing for decomposition of Eq. 2 asa sum over relevances of input variables.

3 Experiments

For the following experiments we use the 20news-bydate version of the 20Newsgroups2 dataset con-sisting of 11314/7532 train/test documents evenlydistributed among twenty fine-grained categories.

3.1 CNN ModelAs a document classifier we employ a word-basedCNN similar to Kim (2014) consisting of the fol-lowing sequence of layers:

Conv −→ ReLU −→ 1-Max-Pool −→ FC

By 1-Max-Pool we denote a max-poolinglayer where the pooling regions span the wholetext length, as introduced in (Collobert et al.,2011). Conv, ReLU and FC denote the con-volutional layer, rectified linear units activationand fully-connected linear layer. For buildingthe CNN numerical input we concatenate horizon-tally 300-dimensional pre-trained word2vec3 vec-tors (Mikolov et al., 2013), in the same order thecorresponding words appear in the pre-processed

2http://qwone.com/%7Ejason/20Newsgroups/3GoogleNews-vectors-negative300,

https://code.google.com/p/word2vec/

document, and further keep this input representa-tion fixed during training. The convolutional oper-ation we apply in the first neural network layer isone-dimensional and along the text sequence di-rection (i.e. along the horizontal direction). Thereceptive field of the convolutional layer neuronsspans the entire word embedding space in verti-cal direction, and covers two consecutive words inhorizontal direction. The convolutional layer filterbank contains 800 filters.

3.2 Experimental Setup

As pre-processing we remove the document head-ers, tokenize the text with NLTK4, filter out punc-tuation and numbers5, and finally truncate eachdocument to the first 400 tokens. We trainthe CNN by stochastic mini-batch gradient de-scent with momentum (with l2-norm penalty anddropout). Our trained classifier achieves a classifi-cation accuracy of 80.19%6.

Due to our input representation, applying LRPor SA to our neural classifier yields one relevancevalue per word-embedding dimension. From thesesingle input variable relevances to obtain word-level relevances, we sum up the relevances overthe word embedding space in case of LRP, and(unless otherwise stated) take the squared l2-normof the corresponding word gradient in case ofSA. More precisely, given an input document dconsisting of a sequence (w1, w2, ..., wN ) of Nwords, each word being represented by a D-dimensional word embedding, we compute the rel-evance R(wt) of the tth word in the input docu-ment, through the summation:

R(wt) =D∑i=1

Ri,t (3)

where Ri,t denotes the relevance of the input vari-able corresponding to the ith dimension of the tth

word embedding, obtained by LRP or SA as spec-ified in Sections 2.1 & 2.2.

4We employ NLTK’s version 3.1 recommended tok-enizers sent tokenize and word tokenize, modulenltk.tokenize.

5We retain only tokens composed of the following char-acters: alphabetic-character, apostrophe, hyphen and dot, andcontaining at least one alphabetic-character.

6To the best of our knowledge, the best published20Newsgroups accuracy is 83.0% (Paskov et al., 2013). How-ever we notice that for simplification we use a fixed-lengthdocument representation, and our main focus is on explain-ing classifier decisions, not on improving the classificationstate-of-the-art.

In particular, in case of SA, the above word rel-evance can equivalently be expressed as:

RSA(wt) = ‖∇wtf(d)‖22 (4)

where f(d) represents the classifier’s predictionfor document d.

Note that the resulting LRP word relevance issigned, while the SA word relevance is positive.

In all experiments, we use the term target classto identify the function f(x) to analyze in the rel-evance decomposition. This function maps theneural network input to the neural network outputvariable corresponding to the target class.

3.3 Evaluating Word-Level RelevancesIn order to evaluate different relevance models, weperform a sequence of “word deletions” (herebyfor deleting a word we simply set the word-vectorto zero in the input document representation), andtrack the impact of these deletions on the classifi-cation performance. We carry out two deletion ex-periments, starting either with the set of test docu-ments that are initially classified correctly, or withthose that are initially classified wrongly7. We es-timate the LRP/SA word relevances using as targetclass the true document class. Subsequently wedelete words in decreasing resp. increasing orderof the obtained word relevances.

Fig. 1 summarizes our results. We find thatLRP yields the best results in both deletion exper-iments. Thereby we provide evidence that LRPpositive relevance is targeted to words that sup-port a classification decision, while LRP negativerelevance is tuned upon words that inhibit this de-cision. In the first experiment the SA classifica-tion accuracy curve decreases significantly fasterthan the random curve representing the perfor-mance change when randomly deleting words, in-dicating that SA is able to identify relevant words.However, the SA curve is clearly above the LRPcurve indicating that LRP provides better expla-nations for the CNN predictions. Similar resultshave been reported for image classification tasks(Samek et al., 2015). The second experiment indi-cates that the classification performance increaseswhen deleting words with the lowest LRP rele-vance, while small SA values points to words thathave less influence on the classification perfor-mance than random word selection. This result

7For the deletion experiments we consider only the testdocuments whose pre-processed length is greater or equal to100 tokens, this amounts to a total of 4963 documents.

0 10 20 30 40 50Number of deleted words

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Acc

ura

cy (4

154 d

ocu

ments

)

LRP

SA

random

0 10 20 30 40 50Number of deleted words

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Acc

ura

cy (8

09 d

ocu

ments

)

LRP

SA

random

Figure 1: Word deletion on initially correct (left)and false (right) classified test documents, usingeither LRP or SA. The target class is the truedocument class, words are deleted in decreasing(left) and increasing (right) order of their rele-vance. Random deletion is averaged over 10 runs(std < 0.0141). A steep decline (left) and incline(right) indicate informative word relevances.

can partly be explained by the fact that in contrastto SA, LRP provides signed explanations. Moregenerally the different quality of the explanationsprovided by SA and LRP can be attributed to theirdifferent objectives: while LRP aims at decompos-ing the global amount of evidence for a class f(x),SA is build solely upon derivatives and as suchdescribes the effect of local variations of the in-put variables on the classifier decision. For a moredetailed view of SA, as well as an interpretationof the LRP propagation rules as a deep Taylor de-composition see Montavon et al. (2015).

3.4 Document HighlightingWord-level relevances can be used for highlightingpurposes. In Fig. 2 we provide such visualizationson one test document for different relevance targetclasses, using either LRP or SA relevance mod-els. We can observe that while the word rideis highly negative-relevant for LRP when the tar-get class is not rec.motorcycles, it is pos-itively highlighted (even though not heavily) bySA. This suggests that SA does not clearly dis-criminate between words speaking for or againsta specific classifier decision, while LRP is morediscerning in this respect.

3.5 Document VisualizationWord2vec embeddings are known to exhibit lin-ear regularities representing semantic relation-

ships between words (Mikolov et al., 2013). Weexplore if these regularities can be transferred toa document representation, when using as a docu-ment vector a linear combination of word2vec em-beddings. As a weighting scheme we employ LRPor SA scores, with the classifier’s predicted classas the target class for the relevance estimation. Forcomparison we perform uniform weighting, wherewe simply sum up the word embeddings of thedocument words (SUM).

For SA we use either the l2-norm or squared l2-norm for pooling word gradient values along theword2vec dimensions, i.e. in addition to the stan-dard SA word relevance defined in Eq. 4, we useas an alternative RSA(l2)(wt) = ‖∇wtf(d)‖2 anddenote this relevance model by SA(l2).

For both LRP and SA, we employ differentvariations of the weighting scheme. More pre-cisely, given an input document d composed ofthe sequence (w1, w2, ..., wN ) of D-dimensionalword2vec embeddings, we build new documentrepresentations d′ and d′e.w.

8 by either using word-level relevances R(wt) (as in Eq. 3), or throughelement-wise multiplication of word embeddingswith single input variable relevances (Ri,t)i∈[1,D]

(we recall that Ri,t is the relevance of the inputvariable corresponding to the ith dimension of thetth word in the input document d). More formallywe use:

d′ =N∑t=1

R(wt) · wt

or

d′e.w. =

N∑t=1

R1,t

R2,t...

RD,t

� wt

where � is an element-wise multiplication. Fi-nally we normalize the document vectors d′ resp.d′e.w. to unit l2-norm and perform a PCA projec-tion. In Fig. 3 we label the resulting 2D-projectedtest documents using five top-level document cat-egories.

For word-based models d′, we observe thatwhile standard SA and LRP both provide simi-lar visualization quality, the SA variant with sim-ple l2-norm yields partly overlapping and denseclusters, still all schemes are better than uniform9

8The subscript e.w. stands for element-wise.9We also performed a TFIDF weighting of word embed-

dings, the resulting 2D-visualization was very similar to uni-form weighting (SUM).

LRP SAIt is the body's reaction to a strange environment. It appears to be inducedpartly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.

It is the body's reaction to a strange environment. It appears to be inducedpartly to physical discomfort and part to mental distress. Some people are more prone to it than others, like some people are more prone to get sick on a roller coaster ride than others. The mental part is usually induced by a lack of clear indication of which way is up or down, ie: the Shuttle is normally oriented with its cargo bay pointed towards Earth, so the Earth (or ground) is "above" the head of the astronauts. About 50% of the astronauts experience some form of motion sickness, and NASA has done numerous tests in space to try to see how to keep the number of occurances down.





sci.space

sci.med

rec.motorcycles

Figure 2: Heatmaps for the test document sci.space 61393 (correctly classified), using either layer-wise relevance propagation (LRP) or sensitivity analysis (SA) for highlighting words. Positive relevanceis mapped to red, negative to blue. The target class for the LRP/SA explanation is indicated on the left.

LRP e.w. LRP SUM

comp

rec

sci

politics

religion

SA e.w. SA SA (l2 )

Figure 3: PCA projection of the 20Newsgroups test documents formed by linearly combining word2vecembeddings. The weighting scheme is based on word-level relevances, or on single input variable rel-evances (e.w.), or uniform (SUM). The target class for relevance estimation is the predicted documentclass. SA(l2) corresponds to a variant of SA with simple l2-norm pooling of word gradient values. Allvisualizations are provided on the same equal axis scale.

weighting. In case of SA note that, even thoughthe power to which word gradient norms are raised(l2 or l22) affects the present visualization experi-ment, it has no influence on the earlier described“word deletion” analysis.

For element-wise models d′e.w., we observeslightly better separated clusters for SA, and aclear-cut cluster structure for LRP.

4 Conclusion

Through word deleting we quantitatively evalu-ated and compared two classifier explanation mod-els, and pinpointed LRP to be more effective thanSA. We investigated the application of word-levelrelevance information for document highlightingand visualization. We derive from our empiricalanalysis that the superiority of LRP stems from thefact that it reliably not only links to determinantwords that support a specific classification deci-sion, but further distinguishes, within the preemi-nent words, those that are opposed to that decision.

Future work would include applying LRP toother neural network architectures (e.g. character-based or recurrent models) on further NLP tasks,as well as exploring how relevance informationcould be taken into account to improve the clas-sifier’s training procedure or prediction perfor-mance.

Acknowledgments

This work was supported by the German Ministryfor Education and Research as Berlin Big DataCenter BBDC (01IS14013A) and the Brain Korea21 Plus Program through the National ResearchFoundation of Korea funded by the Ministry ofEducation.

ReferencesS. Bach, A. Binder, G. Montavon, F. Klauschen, K.-

R. Muller, and W. Samek. 2015. On Pixel-WiseExplanations for Non-Linear Classifier Decisions byLayer-Wise Relevance Propagation. PLoS ONE,10(7):e0130140.

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin.2003. A Neural Probabilistic Language Model.JMLR, 3:1137–1155.

R. Collobert, J. Weston, L. Bottou, M. Karlen,K. Kavukcuoglu, and P. Kuksa. 2011. Natural Lan-guage Processing (Almost) from Scratch. JMLR,12:2493–2537.

M. Denil, A. Demiraj, and N. de Freitas. 2014. Extrac-tion of Salient Sentences from Labelled Documents.Technical report, University of Oxford.

Y. Dimopoulos, P. Bourret, and S. Lek. 1995. Use ofsome sensitivity criteria for choosing networks withgood generalization ability. Neural Processing Let-ters, 2(6):1–4.

D. Erhan, A. Courville, and Y. Bengio. 2010. Under-standing Representations Learned in Deep Architec-tures. Technical report, University of Montreal.

M. Gevrey, I. Dimopoulos, and S. Lek. 2003. Reviewand comparison of methods to study the contribu-tion of variables in artificial neural network models.Ecological Modelling, 160(3):249–264.

Y. Kim. 2014. Convolutional Neural Networks forSentence Classification. In Proc. of EMNLP, pages1746–1751.

R. Krishnan, G. Sivakumar, and P. Bhattacharya. 1999.Extracting decision trees from trained neural net-works. Pattern Recognition, 32(12):1999–2009.

W. Landecker, M. Thomure, L. Bettencourt,M. Mitchell, G. Kenyon, and S. Brumby. 2013. In-terpreting Individual Classifications of HierarchicalNetworks. In IEEE Symposium on ComputationalIntelligence and Data Mining (CIDM), pages32–38.

S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller,and W. Samek. 2016a. Analyzing Classifiers:Fisher Vectors and Deep Neural Networks. In Proc.of the IEEE Conference on Computer Vision andPattern Recognition (CVPR).

S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller,and W. Samek. 2016b. The Layer-wise RelevancePropagation Toolbox for Artificial Neural Networks.JMLR. in press.

J. Li, X. Chen, E. Hovy, and D. Jurafsky. 2015. Visu-alizing and Understanding Neural Models in NLP.arXiv, (1506.01066).

M. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013.Efficient Estimation of Word Representations inVector Space. In Workshop Proc. ICLR.

G. Montavon, S. Bach, A. Binder, W. Samek, and K.-R.Muller. 2015. Explaining NonLinear ClassificationDecisions with Deep Taylor Decomposition. arXiv,(1512.02479).

H.S. Paskov, R. West, J.C. Mitchell, and T. Hastie.2013. Compressive Feature Learning. In Adv. inNIPS.

W. Samek, A. Binder, G. Montavon, S. Bach, and K.-R. Muller. 2015. Evaluating the visualization ofwhat a Deep Neural Network has learned. arXiv,(1509.06321).

K. Simonyan, A. Vedaldi, and A. Zisserman. 2014.Deep Inside Convolutional Networks: VisualisingImage Classification Models and Saliency Maps. InWorkshop Proc. ICLR.

M. D. Zeiler and R. Fergus. 2014. Visualizing andUnderstanding Convolutional Networks. In ECCV,pages 818–833.