interpretable neural predictions with differentiable

Interpretable Neural Predictions with Differentiable Binary Variables

Jasmijn BastingsILLC

University of [email protected]

Wilker AzizILLC

University of [email protected]

Ivan TitovILLC, University of AmsterdamILCC, University of [email protected]

Abstract

The success of neural networks comes handin hand with a desire for more interpretabil-ity. We focus on text classifiers and make themmore interpretable by having them providea justification—a rationale—for their predic-tions. We approach this problem by jointlytraining two neural network models: a latentmodel that selects a rationale (i.e. a shortand informative part of the input text), anda classifier that learns from the words in therationale alone. Previous work proposed toassign binary latent masks to input positionsand to promote short selections via sparsity-inducing penalties such as L0 regularisation.We propose a latent model that mixes discreteand continuous behaviour allowing at the sametime for binary selections and gradient-basedtraining without REINFORCE. In our formu-lation, we can tractably compute the expectedvalue of penalties such as L0, which allows usto directly optimise the model towards a pre-specified text selection rate. We show that ourapproach is competitive with previous work onrationale extraction, and explore further usesin attention mechanisms.

1 Introduction

Neural networks are bringing incredible perfor-mance gains on text classification tasks (Howardand Ruder, 2018; Peters et al., 2018; Devlinet al., 2019). However, this power comes hand inhand with a desire for more interpretability, eventhough its definition may differ (Lipton, 2016).While it is useful to obtain high classificationaccuracy, with more data available than everbefore it also becomes increasingly important tojustify predictions. Imagine having to classify alarge collection of documents, while verifyingthat the classifications make sense. It would beextremely time-consuming to read each documentto evaluate the results. Moreover, if we do not

pours a dark amber color with decent head that doesnot recede much . it ’s a tad too dark to see the

carbonation , but fairs well . smells of roasted maltsand mouthfeel is quite strong in the sense that youcan get a good taste of it before you even swallow .

Rationale Extractor

pours a dark amber color with decent head that doesnot recede much . it ’s a tad too dark to see the

carbonation , but fairs well . smells of roasted maltsand mouthfeel is quite strong in the sense that youcan get a good taste of it before you even swallow .

Classifier

look: FFFF

Figure 1: Rationale extraction for a beer review.

know why a prediction was made, we do not knowif we can trust it.

What if the model could provide us the mostimportant parts of the document, as a justificationfor its prediction? That is exactly the focus of thispaper. We use a setting that was pioneered by Leiet al. (2016). A rationale is defined to be a shortyet sufficient part of the input text; short so that itmakes clear what is most important, and sufficientso that a correct prediction can be made from therationale alone. One neural network learns to ex-tract the rationale, while another neural network,with separate parameters, learns to make a predic-tion from just the rationale. Lei et al. model thisby assigning a binary Bernoulli variable to eachinput word. The rationale then consists of all thewords for which a 1 was sampled. Because gradi-ents do not flow through discrete samples, the ra-tionale extractor is optimized using REINFORCE(Williams, 1992). An L0 regularizer is used tomake sure the rationale is short.

We propose an alternative to purely discrete se-lectors for which gradient estimation is possiblewithout REINFORCE, instead relying on a repa-

2963Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977

Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

rameterization of a random variable that exhibitsboth continuous and discrete behavior (Louizoset al., 2017). To promote compact rationales,we employ a relaxed form of L0 regularization(Louizos et al., 2017), penalizing the objective asa function of the expected proportion of selectedtext. We also propose the use of Lagrangian re-laxation to target a specific rate of selected inputtext.

Our contributions are summarized as follows:1

1. we present a differentiable approach to ex-tractive rationales (§2) including an objectivethat allows for specifying how much text is tobe extracted (§4);

2. we introduce HardKuma (§3), which givessupport to binary outcomes and allows forreparameterized gradient estimates;

3. we empirically show that our approach iscompetitive with previous work and thatHardKuma has further applications, e.g. inattention mechanisms. (§6).

2 Latent Rationale

We are interested in making NN-based text clas-sifiers interpretable by (i) uncovering which partsof the input text contribute features for classifica-tion, and (ii) basing decisions on only a fractionof the input text (a rationale). Lei et al. (2016)approached (i) by inducing binary latent selectorsthat control which input positions are available toan NN encoder that learns features for classifica-tion/regression, and (ii) by regularising their archi-tectures using sparsity-inducing penalties on latentassignments. In this section we put their approachunder a probabilistic light, and this will then morenaturally lead to our proposed method.

In text classification, an input x is mapped to adistribution over target labels:

Y |x ∼ Cat(f(x; θ)) , (1)

where we have a neural network architecturef(·; θ) parameterize the model—θ collectively de-notes the parameters of the NN layers in f . Thatis, an NN maps from data space (e.g. sentences,short paragraphs, or premise-hypothesis pairs) tothe categorical parameter space (i.e. a vector ofclass probabilities). For the sake of concreteness,

1Code available at https://github.com/bastings/interpretable_predictions.

consider the input a sequence x = 〈x1, . . . , xn〉.A target y is typically a categorical outcome, suchas a sentiment class or an entailment decision, butwith an appropriate choice of likelihood it couldalso be a numerical score (continuous or integer).

Lei et al. (2016) augment this model with acollection of latent variables which we denote byz = 〈z1, . . . , zn〉. These variables are responsiblefor regulating which portions of the input x con-tribute with predictors (i.e. features) to the clas-sifier. The model formulation changes as follows:

Zi|x ∼ Bern(gi(x;φ))

Y |x, z ∼ Cat(f(x� z; θ))(2)

where an NN g(·;φ) predicts a sequence of nBernoulli parameters—one per latent variable—and the classifier is modified such that zi indicateswhether or not xi is available for encoding. Wecan think of the sequence z as a binary gatingmechanism used to select a rationale, which withsome abuse of notation we denote by x�z. Figure1 illustrates the approach.

Parameter estimation for this model can be doneby maximizing a lower bound E(φ, θ) on the log-likelihood of the data derived by application ofJensen’s inequality:2

logP (y|x) = logEP (z|x,φ) [P (y|x, z, θ)]JI≥ EP (z|x,φ) [logP (y|x, z, θ)] = E(φ, θ) .

(3)

These latent rationales approach the first objec-tive, namely, uncovering which parts of the inputtext contribute towards a decision. However notethat an NN controls the Bernoulli parameters, thusnothing prevents this NN from selecting the wholeof the input, thus defaulting to a standard text clas-sifier. To promote compact rationales, Lei et al.(2016) impose sparsity-inducing penalties on la-tent selectors. They penalise for the total numberof selected words, L0 in (4), as well as, for the to-tal number of transitions, fused lasso in (4), andapproach the following optimization problem

minφ,θ−E(φ, θ)+λ0

n∑i=1

zi︸︷︷︸L0(z)

+λ1

n−1∑i=1

|zi − zi+1|︸︷︷︸fused lasso

(4)

via gradient-based optimisation, where λ0 and λ1are fixed hyperparameters. The objective is how-ever intractable to compute, the lowerbound, in

2This can be seen as variational inference (Jordan et al.,1999) where we perform approximate inference using a data-dependent prior P (z|x, φ).

2964

https://github.com/bastings/interpretable_predictions

https://github.com/bastings/interpretable_predictions

particular, requires marginalization of O(2n) bi-nary sequences. For that reason, Lei et al. sam-ple latent assignments and work with gradient es-timates using REINFORCE (Williams, 1992).

The key ingredients are, therefore, binary la-tent variables and sparsity-inducing regulariza-tion, and therefore the solution is marked by non-differentiability. We propose to replace Bernoullivariables by rectified continuous random variables(Socci et al., 1998), for they exhibit both discreteand continuous behaviour. Moreover, they areamenable to reparameterization in terms of a fixedrandom source (Kingma and Welling, 2014), inwhich case gradient estimation is possible withoutREINFORCE. Following Louizos et al. (2017),we exploit one such distribution to relax L0 reg-ularization and thus promote compact rationaleswith a differentiable objective. In section 3, we in-troduce this distribution and present its properties.In section 4, we employ a Lagrangian relaxation toautomatically target a pre-specified selection rate.And finally, in section 5 we present an example forsentiment classification.

3 Hard Kumaraswamy Distribution

Key to our model is a novel distribution that ex-hibits both continuous and discrete behaviour, inthis section we introduce it. With non-negligibleprobability, samples from this distribution evalu-ate to exactly 0 or exactly 1. In a nutshell: i)we start from a distribution over the open inter-val (0, 1) (see dashed curve in Figure 2); ii) wethen stretch its support from l < 0 to r > 1 inorder to include {0} and {1} (see solid curve inFigure 2); finally, iii) we collapse the probabilitymass over the interval (l, 0] to {0}, and similarly,the probability mass over the interval [1, r) to {1}(shaded areas in Figure 2). This stretch-and-rectifytechnique was proposed by Louizos et al. (2017),who rectified samples from the BinaryConcrete(or GumbelSoftmax) distribution (Maddison et al.,2017; Jang et al., 2017). We adapted their tech-nique to the Kumaraswamy distribution motivatedby its close resemblance to a Beta distribution, forwhich we have stronger intuitions (for example, itstwo shape parameters transit rather naturally fromunimodal to bimodal configurations of the distri-bution). In the following, we introduce this newdistribution formally.3

3We use uppercase letters for random variables (e.g. K,T , and H) and lowercase for assignments (e.g. k, t, h). For a

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5Kuma(0.5, 0.5, 0.1, 1.1)Kuma(0.5, 0.5)

Figure 2: The HardKuma distribution: we start from aKuma(0.5, 0.5), and stretch its support to the interval(−0.1, 1.1), finally we collapse all mass before 0 to {0}and all mass after 1 to {1}.

3.1 Kumaraswamy distributionThe Kumaraswamy distribution (Kumaraswamy,1980) is a two-parameters distribution over theopen interval (0, 1), we denote a Kumaraswamy-distributed variable by K ∼ Kuma(a, b), wherea ∈ R>0 and b ∈ R>0 control the distribution’sshape. The dashed curve in Figure 2 illustrates thedensity of Kuma(0.5, 0.5). For more details in-cluding its pdf and cdf, consult Appendix A.

The Kumaraswamy is a close relative of theBeta distribution, though not itself an exponentialfamily, with a simple cdf whose inverse

F−1K (u; a, b) =(1− (1− u)1/b

)1/a, (5)

for u ∈ [0, 1], can be used to obtain samples

F−1K (U ;α, β) ∼ Kuma(α, β) (6)

by transformation of a uniform random sourceU ∼ U(0, 1). We can use this fact to reparame-terize expectations (Nalisnick and Smyth, 2016).

3.2 Rectified KumaraswamyWe stretch the support of the Kumaraswamy dis-tribution to include 0 and 1. The resulting variableT ∼ Kuma(a, b, l, r) takes on values in the openinterval (l, r) where l < 0 and r > 1, with cdf

FT (t; a, b, l, r) = FK((t− l)/(r − l); a, b) . (7)

We now define a rectified random variable, de-noted by H ∼ HardKuma(a, b, l, r), by passing

random variable K, fK(k;α) is the probability density func-tion (pdf), conditioned on parameters α, and FK(k;α) is thecumulative distribution function (cdf).

2965

a sample T ∼ Kuma(a, b, l, r) through a hard-sigmoid, i.e. h = min(1,max(0, t)). The re-sulting variable is defined over the closed inter-val [0, 1]. Note that while there is 0 probability ofsampling t = 0, sampling h = 0 corresponds tosampling any t ∈ (l, 0], a set whose mass underKuma(t|a, b, l, r) is available in closed form:

P(H = 0) = FK

(−lr−l ; a, b

). (8)

That is because all negative values of t are de-terministically mapped to zero. Similarly, sam-ples t ∈ [1, r) are all deterministically mapped toh = 1, whose total mass amounts to

P(H = 1) = 1− FK(1−lr−l ; a, b

). (9)

See Figure 2 for an illustration, and Appendix Afor the complete derivations.

3.3 Reparameterization and gradients

Because this rectified variable is built upon aKumaraswamy, it admits a reparameterisation interms of a uniform variable U ∼ U(0, 1). Weneed to first sample a uniform variable in the openinterval (0, 1) and transform the result to a Ku-maraswamy variable via the inverse cdf (10a), thenshift and scale the result to cover the stretched sup-port (10b), and finally, apply the rectifier in orderto get a sample in the closed interval [0, 1] (10c).

k = F−1K (u; a, b) (10a)

t = l + (r − l)k (10b)

h = min(1,max(0, t)) , (10c)

We denote this h = s(u; a, b, l, r) for short.Note that this transformation has two discontinuitypoints, namely, t = 0 and t = 1. Though recall,the probability of sampling t exactly 0 or exactly 1is zero, which essentially means stochasticity cir-cumvents points of non-differentiability of the rec-tifier (see Appendix A.3).

4 Controlled Sparsity

Following Louizos et al. (2017), we relax non-differentiable penalties by computing them on ex-pectation under our latent model p(z|x, φ). In ad-dition, we propose the use of Lagrangian relax-ation to target specific values for the penalties.

Thanks to the tractable Kumaraswamy cdf, the ex-pected value of L0(z) is known in closed form

Ep(z|x) [L0(z)]ind=

n∑i=1

Ep(zi|x) [I[zi 6= 0]]

=n∑i=1

1− P(Zi = 0) ,

(11)

where P(Zi = 0) = FK

(−lr−l ; ai, bi

). This quan-

tity is a tractable and differentiable function of theparameters φ of the latent model. We can alsocompute a relaxation of fused lasso by comput-ing the expected number of zero-to-nonzero andnonzero-to-zero changes:

Ep(z|x)

[n−1∑i=1

I[zi = 0, zi+1 6= 0]

]

+ Ep(z|x)

[n−1∑i=1

I[zi 6= 0, zi+1 = 0]

]ind=

n−1∑i=1

P(Zi = 0)(1− P(Zi+1 = 0))

+ (1− P(Zi = 0))P(Zi+1 = 0) .

(12)

In both cases, we make the assumption that latentvariables are independent given x, in AppendixB.1.2 we discuss how to estimate the regularizersfor a model p(zi|x, z<i) that conditions on the pre-fix z<i of sampled HardKuma assignments.

We can use regularizers to promote sparsity,but just how much text will our final model se-lect? Ideally, we would target specific values r andsolve a constrained optimization problem. In prac-tice, constrained optimisation is very challenging,thus we employ Lagrangian relaxation instead:

maxλ∈R

minφ,θ−E(φ, θ) + λ>(R(φ)− r) (13)

where R(φ) is a vector of regularisers, e.g. ex-pectedL0 and expected fused lasso, and λ is a vec-tor of Lagrangian multipliers λ. Note how this dif-fers from the treatment of Lei et al. (2016) shownin (4) where regularizers are computed for as-signments, rather than on expectation, and whereλ0, λ1 are fixed hyperparameters.

5 Sentiment Classification

As a concrete example, consider the case of senti-ment classification where x is a sentence and y is a

2966

5-way sentiment class (from very negative to verypositive). The model consists of

Zi ∼ HardKuma(ai, bi, l, r)

Y |x, z ∼ Cat(f(x� z; θ))(14)

where the shape parameters a, b = g(x;φ), i.e.two sequences of n strictly positive scalars, arepredicted by a NN, and the support boundaries(l, r) are fixed hyperparameters.

We first specify an architecture that parameter-izes latent selectors and then use a reparameterizedsample to restrict which parts of the input con-tribute encodings for classification:4

ei = emb(xi)

hn1 = birnn(en1 ;φr)

ui ∼ U(0, 1)

ai = fa(hi;φa)

bi = fb(hi;φb)

zi = s(ui; ai, bi, l, r)

where emb(·) is an embedding layer, birnn(·;φr)is a bidirectional encoder, fa(·;φa) and fb(·;φb)are feed-forward transformations with softplusoutputs, and s(·) turns the uniform sample ui intothe latent selector zi (see §3). We then use thesampled z to modulate inputs to the classifier:

ei = emb(xi)

h(fwd)i = rnn(h

(fwd)i−1 , zi ei; θfwd)

h(bwd)i = rnn(h

(bwd)i+1 , zi ei; θbwd)

o = fo(h(fwd)n ,h

(bwd)1 ; θo)

where rnn(·; θfwd) and rnn(·; θbwd) are recurrentcells such as LSTMs (Hochreiter and Schmidhu-ber, 1997) that process the sequence in differentdirections, and fo(·; θo) is a feed-forward transfor-mation with softmax output. Note how zi modu-lates features ei of the input xi that are availableto the recurrent composition function.

We then obtain gradient estimates of E(φ, θ) viaMonte Carlo (MC) sampling from

E(φ, θ) = EU(0,I) [logP (y|x, sφ(u, x), θ)] (15)

where z = sφ(u, x) is a shorthand for element-wise application of the transformation from uni-form samples to HardKuma samples. This repa-rameterisation is the key to gradient estimationthrough stochastic computation graphs (Kingmaand Welling, 2014; Rezende et al., 2014).

4We describe architectures using blocks denoted bylayer(inputs; subset of parameters), boldface letters for vec-tors, and the shorthand vn

1 for a sequence 〈v1, . . . ,vn〉.

SVM (Lei et al., 2016) 0.0154BiLSTM (Lei et al., 2016) 0.0094BiRCNN (Lei et al., 2016) 0.0087

BiLSTM (ours) 0.0089BiRCNN (ours) 0.0088

Table 1: MSE on the BeerAdvocate test set.

Deterministic predictions. At test time wemake predictions based on what is the most likelyassignment for each zi. We arg max across con-figurations of the distribution, namely, zi = 0,zi = 1, or 0 < zi < 1. When the continuousinterval is more likely, we take the expected valueof the underlying Kumaraswamy variable.

6 Experiments

We perform experiments on multi-aspect senti-ment analysis to compare with previous work, aswell as experiments on sentiment classificationand natural language inference. All models wereimplemented in PyTorch, and Appendix B pro-vides implementation details.

Goal. When rationalizing predictions, our goalis to perform as well as systems using the full inputtext, while using only a subset of the input text,leaving unnecessary words out for interpretability.

6.1 Multi-aspect Sentiment Analysis

In our first experiment we compare directly withprevious work on rationalizing predictions (Leiet al., 2016). We replicate their setting.

Data. A pre-processed subset of the BeerAdvo-cate5 data set is used (McAuley et al., 2012). Itconsists of 220,000 beer reviews, where multipleaspects (e.g. look, smell, palate) are rated. Asshown in Figure 1, a review typically consists ofmultiple sentences, and contains a 0-5 star rating(e.g. 3.5 stars) for each aspect. Lei et al. mappedthe ratings to scalars in [0, 1].

Model. We use the models described in §5 withtwo small modifications: 1) since this is a regres-sion task, we use a sigmoid activation in the outputlayer of the classifier rather than a softmax,6 and

5https://www.beeradvocate.com/6From a likelihood learning point of view, we would have

assumed a Logit-Normal likelihood, however, to stay closerto Lei et al. (2016), we employ mean squared error.

2967

https://www.beeradvocate.com/

MethodLook Smell Palate

% Precision % Selected % Precision % Selected % Precision % Selected

Attention (Lei et al.) 80.6 13 88.4 7 65.3 7Bernoulli (Lei et al.) 96.3 14 95.1 7 80.2 7Bernoulli (reimpl.) 94.8 13 95.1 7 80.5 7HardKuma 98.1 13 96.8 7 89.8 7

Table 2: Precision (% of selected words that was also annotated as the gold rationale) and selected (% of wordsnot zeroed out) per aspect. In the attention baseline, the top 13% (7%) of words with highest attention weights areused for classification. Models were selected based on validation loss.

2) we use an extra RNN to condition zi on z<i:

ai = fa(hi, si−1;φa) (16a)

bi = fb(hi, si−1;φb) (16b)

si = rnn(hi, zi, si−1;φs) (16c)

For a fair comparison we follow Lei et al. by usingRCNN7 cells rather than LSTM cells for encodingsentences on this task. Since this cell is not widelyused, we verified its performance in Table 1. Weobserve that the BiRCNN performs on par with theBiLSTM (while using 50% fewer parameters), andsimilarly to previous results.

Evaluation. A test set with sentence-level ratio-nale annotations is available. The precision of a ra-tionale is defined as the percentage of words withz 6= 0 that is part of the annotation. We also eval-uate the predictions made from the rationale usingmean squared error (MSE).

Baselines. For our baseline we reimplementedthe approach of Lei et al. (2016) which we callBernoulli after the distribution they use to samplez from. We also report their attention baseline,in which an attention score is computed for eachword, after which it is simply thresholded to selectthe top-k percent as the rationale.

Results. Table 2 shows the precision and the per-centage of selected words for the first three as-pects. The models here have been selected basedon validation MSE and were tuned to select a sim-ilar percentage of words (‘selected’). We observethat our Bernoulli reimplementation reaches theprecision similar to previous work, doing a littlebit worse for the ‘look’ aspect. Our HardKumamanaged to get even higher precision, and it ex-tracted exactly the percentage of text that we spec-

7An RCNN cell can replace any LSTM cell and workswell on text classification problems. See appendix B.

0% 20% 40% 60% 80% 100%Selected Text

0.008

0.009

0.010

0.011

0.012

0.013

MSE

Figure 3: MSE of all aspects for various percentages ofextracted text. HardKuma (blue crosses) has lower er-ror than Bernoulli (red circles; open circles taken fromLei et al. (2016)) for similar amount of extracted text.The full-text baseline (black star) gets the best MSE.

ified (see §4).8 Figure 3 shows the MSE for all as-pects for various percentages of extracted text. Weobserve that HardKuma does better with a smallerpercentage of text selected. The performance be-comes more similar as more text is selected.

6.2 Sentiment Classification

We also experiment on the Stanford SentimentTreebank (SST) (Socher et al., 2013). There are 5sentiment classes: very negative, negative, neutral,positive, and very positive. Here we use the Hard-Kuma model described in §5, a Bernoulli modeltrained with REINFORCE, as well as a BiLSTM.

Results. Figure 4 shows the classification accu-racy for various percentages of selected text. Weobserve that HardKuma outperforms the Bernoullimodel at each percentage of selected text. Hard-Kuma reaches full-text baseline performance al-ready around 40% extracted text. At that point,it obtains a test score of 45.84, versus 42.22 forBernoulli and 47.4±0.8 for the full-text baseline.

8We tried to use Lagrangian relaxation for the Bernoullimodel, but this led to instabilities (e.g. all words selected).

2968

0% 20% 40% 60% 80% 100%Selected Text

30%

35%

40%

45%

50%A

ccur

acy

Figure 4: SST validation accuracy for various percent-ages of extracted text. HardKuma (blue crosses) hashigher accuracy than Bernoulli (red circles) for similaramount of text, and reaches the full-text baseline (blackstar, 46.3± 2σ with σ = 0.7) around 40% text.

very

nega

tive

nega

tive

neutr

al

posit

ive

very

posit

ive

146

992

1823

1

1511

394

119

603 33

78

803

264

112

489

3806

795

299

TotalHardKumaBernoulli

Figure 5: The number of words in each sentiment classfor the full validation set, the HardKuma (24% selectedtext) and Bernoulli (25% text).

Analysis. We wonder what kind of words aredropped when we select smaller amounts of text.For this analysis we exploit the word-level senti-ment annotations in SST, which allows us to trackthe sentiment of words in the rationale. Figure 5shows that a large portion of dropped words haveneutral sentiment, and it seems plausible that ex-actly those words are not important features forclassification. We also see that HardKuma drops(relatively) more neutral words than Bernoulli.

6.3 Natural Language Inference

In Natural language inference (NLI), given apremise sentence x(p) and a hypothesis sentencex(h), the goal is to predict their relation y whichcan be contradiction, entailment, or neutral. Asour dataset we use the Stanford Natural LanguageInference (SNLI) corpus (Bowman et al., 2015).

Baseline. We use the Decomposable Attentionmodel (DA) of Parikh et al. (2016).9 DA does notmake use of LSTMs, but rather uses attention tofind connections between the premise and the hy-

9Better results e.g. Chen et al. (2017) and data sets forNLI exist, but are not the focus of this paper.

pothesis that are predictive of the relation. Eachword in the premise attends to each word in thehypothesis, and vice versa, resulting in a set ofcomparison vectors which are then aggregated fora final prediction. If there is no link between aword pair, it is not considered for prediction.

Model. Because the premise and hypothesis in-teract, it does not make sense to extract a ra-tionale for the premise and hypothesis indepen-dently. Instead, we replace the attention betweenpremise and hypothesis with HardKuma attention.Whereas in the baseline a similarity matrix issoftmax-normalized across rows (premise to hy-pothesis) and columns (hypothesis to premise) toproduce attention matrices, in our model each cellin the attention matrix is sampled from a Hard-Kuma parameterized by (a, b). To promote spar-sity, we use the relaxed L0 to specify the desiredpercentage of non-zero attention cells. The result-ing matrix does not need further normalization.

Results. With a target rate of 10%, the Hard-Kuma model achieved 8.5% non-zero attention.Table 3 shows that, even with so many zeros in theattention matrices, it only does about 1% worsecompared to the DA baseline. Figure 6 shows anexample of HardKuma attention, with additionalexamples in Appendix B. We leave further explo-rations with HardKuma attention for future work.

Model Dev Test

LSTM (Bowman et al., 2016) – 80.6DA (Parikh et al., 2016) – 86.3

DA (reimplementation) 86.9 86.5DA with HardKuma attention 86.0 85.5

Table 3: SNLI results (accuracy).

<s>

The

man

is wal

king

his

cat

.

<s>

Young

man

walking

dog

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 77 21 0 0 0 0

0 0 0 0 88 0 0 0

0 0 0 0 0 0 86 0

Figure 6: Example of HardKuma attention between apremise (rows) and hypothesis (columns) in SNLI (cellvalues shown in multiples of 10−2).

2969

7 Related Work

This work has connections with work on inter-pretability, learning from rationales, sparse struc-tures, and rectified distributions. We discuss eachof those areas.

Interpretability. Machine learning research hasbeen focusing more and more on interpretability(Gilpin et al., 2018). However, there are manynuances to interpretability (Lipton, 2016), andamongst them we focus on model transparency.

One strategy is to extract a simpler, inter-pretable model from a neural network, though thiscomes at the cost of performance. For example,Thrun (1995) extract if-then rules, while Cravenand Shavlik (1996) extract decision trees.

There is also work on making word vectorsmore interpretable. Faruqui et al. (2015) makeword vectors more sparse, and Herbelot and Vec-chi (2015) learn to map distributional word vectorsto model-theoretic semantic vectors.

Similarly to Lei et al. (2016), Titov and McDon-ald (2008) extract informative fragments of textby jointly training a classifier and a model pre-dicting a stochastic mask, while relying on Gibbssampling to do so. Their focus is on using thesentiment labels as a weak supervision signal foropinion summarization rather than on rationaliz-ing classifier predictions.

There are also related approaches that aim tointerpret an already-trained model, in contrast toLei et al. (2016) and our approach where the ra-tionale is jointly modeled. Ribeiro et al. (2016)make any classifier interpretable by approximat-ing it locally with a linear proxy model in anapproach called LIME, and Alvarez-Melis andJaakkola (2017) propose a framework that returnsinput-output pairs that are causally related.

Learning from rationales. Our work is differ-ent from approaches that aim to improve clas-sification using rationales as an additional input(Zaidan et al., 2007; Zaidan and Eisner, 2008;Zhang et al., 2016). Instead, our rationales are la-tent and we are interested in uncovering them. Weonly use annotated rationales for evaluation.

Sparse layers. Also arguing for enhanced inter-pretability, Niculae and Blondel (2017) propose aframework for learning sparsely activated atten-tion layers based on smoothing the max opera-tor. They derive a number of relaxations to max,

including softmax itself, but in particular, theytarget relaxations such as sparsemax (Martinsand Astudillo, 2016) which, unlike softmax, aresparse (i.e. produce vectors of probability valueswith components that evaluate to exactly 0). Theiractivation functions are themselves solutions toconvex optimization problems, to which they pro-vide efficient forward and backward passes. Thetechnique can be seen as a deterministic sparselyactivated layer which they use as a drop-in replace-ment to standard attention mechanisms. In con-trast, in this paper we focus on binary outcomesrather than K-valued ones. Niculae et al. (2018)extend the framework to structured discrete spaceswhere they learn sparse parameterizations of dis-crete latent models. In this context, parameter es-timation requires exact marginalization of discretevariables or gradient estimation via REINFORCE.They show that oftentimes distributions are sparseenough to enable exact marginal inference.

Peng et al. (2018) propose SPIGOT, a proxygradient to the non-differentiable arg max op-erator. This proxy requires an arg max solver(e.g. Viterbi for structured prediction) and, like thestraight-through estimator (Bengio et al., 2013), isa biased estimator. Though, unlike ST it is effi-cient for structured variables. In contrast, in thiswork we chose to focus on unbiased estimators.

Rectified Distributions. The idea of rectifieddistributions has been around for some time. Therectified Gaussian distribution (Socci et al., 1998),in particular, has found applications to factor anal-ysis (Harva and Kaban, 2005) and approximateinference in graphical models (Winn and Bishop,2005). Louizos et al. (2017) propose to stretch andrectify samples from the BinaryConcrete (or Gum-belSoftmax) distribution (Maddison et al., 2017;Jang et al., 2017). They use rectified variablesto induce sparsity in parameter space via a relax-ation to L0. We adapt their technique to promotesparse activations instead. Rolfe (2017) learns arelaxation of a discrete random variable based on atractable mixture of a point mass at zero and a con-tinuous reparameterizable density, thus enablingreparameterized sampling from the half-closed in-terval [0,∞). In contrast, with HardKuma we fo-cused on giving support to both 0s and 1s.

8 Conclusions

We presented a differentiable approach to extrac-tive rationales, including an objective that allows

2970

for specifying how much text is to be extracted.To allow for reparameterized gradient estimatesand support for binary outcomes we introducedthe HardKuma distribution. Apart from extract-ing rationales, we showed that HardKuma has fur-ther potential uses, which we demonstrated onpremise-hypothesis attention in SNLI. We leavefurther explorations for future work.

Acknowledgments

We thank Luca Falorsi for pointing us toLouizos et al. (2017), which inspired the Hard-Kumaraswamy distribution. This work has re-ceived funding from the European Research Coun-cil (ERC StG BroadSem 678254), the Euro-pean Union’s Horizon 2020 research and inno-vation programme (grant agreement No 825299,GoURMET), and the Dutch National ScienceFoundation (NWO VIDI 639.022.518, NWO VICI277-89-002).

ReferencesDavid Alvarez-Melis and Tommi Jaakkola. 2017. A

causal framework for explaining the predictions ofblack-box sequence-to-sequence models. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 412–421. Association for Computational Linguistics.

Yoshua Bengio, Nicholas Leonard, and AaronCourville. 2013. Estimating or propagating gradi-ents through stochastic neurons for conditional com-putation. arXiv preprint arXiv:1308.3432.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642. Association for Computational Linguis-tics.

Samuel R. Bowman, Jon Gauthier, Abhinav Ras-togi, Raghav Gupta, Christopher D. Manning, andChristopher Potts. 2016. A fast unified model forparsing and sentence understanding. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1466–1477. Association for Computa-tional Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced lstm fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668. Association for Computational Linguis-tics.

Mark Craven and Jude W Shavlik. 1996. Extractingtree-structured representations of trained networks.In Advances in neural information processing sys-tems, pages 24–30.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers). Association forComputational Linguistics.

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, ChrisDyer, and Noah A. Smith. 2015. Sparse overcom-plete word vector representations. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers), pages 1491–1500. Asso-ciation for Computational Linguistics.

Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Ba-jwa, Michael Specter, and Lalana Kagal. 2018. Ex-plaining explanations: An overview of interpretabil-ity of machine learning. In 2018 IEEE 5th Inter-national Conference on Data Science and AdvancedAnalytics (DSAA), pages 80–89. IEEE.

Markus Harva and Ata Kaban. 2005. A variationalbayesian method for rectified factor analysis. InProceedings. 2005 IEEE International Joint Confer-ence on Neural Networks, 2005., volume 1, pages185–190. IEEE.

Aurelie Herbelot and Eva Maria Vecchi. 2015. Build-ing a shared world: mapping distributional to model-theoretic semantic spaces. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 22–32. Association forComputational Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 328–339. Association for Com-putational Linguistics.

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categor-ical reparameterization with gumbel-softmax. Inter-national Conference on Learning Representations.

MichaelI. Jordan, Zoubin Ghahramani, TommiS.Jaakkola, and LawrenceK. Saul. 1999. An intro-duction to variational methods for graphical models.Machine Learning, 37(2):183–233.

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Con-ference on Learning Representations.

2971

https://doi.org/10.18653/v1/D17-1042

https://doi.org/10.18653/v1/D17-1042

https://doi.org/10.18653/v1/D17-1042

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/P16-1139

https://doi.org/10.18653/v1/P16-1139

https://doi.org/10.18653/v1/P17-1152

https://doi.org/10.18653/v1/P17-1152

https://doi.org/10.3115/v1/P15-1144

https://doi.org/10.3115/v1/P15-1144

https://doi.org/10.18653/v1/D15-1003

https://doi.org/10.18653/v1/D15-1003

https://doi.org/10.18653/v1/D15-1003

https://doi.org/10.1162/neco.1997.9.8.1735

http://aclweb.org/anthology/P18-1031


Ponnambalam Kumaraswamy. 1980. A generalizedprobability density function for double-boundedrandom processes. Journal of Hydrology, 46(1-2):79–88.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 107–117. Associ-ation for Computational Linguistics.

Zachary Chase Lipton. 2016. The mythos of modelinterpretability. ICML Workshop on Human Inter-pretability in Machine Learning (WHI 2016).

Christos Louizos, Max Welling, and Diederik PKingma. 2017. Learning sparse neural net-works through l 0 regularization. arXiv preprintarXiv:1712.01312.

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.2017. The concrete distribution: A continous re-laxation of discrete random variables. InternationalConference on Learning Representations.

Andre Martins and Ramon Astudillo. 2016. From soft-max to sparsemax: A sparse model of attention andmulti-label classification. In International Confer-ence on Machine Learning, pages 1614–1623.

Julian McAuley, Jure Leskovec, and Dan Jurafsky.2012. Learning attitudes and attributes from multi-aspect reviews. In Data Mining (ICDM), 2012 IEEE12th International Conference on, pages 1020–1025. IEEE.

Eric Nalisnick and Padhraic Smyth. 2016. Stick-breaking variational autoencoders. arXiv preprintarXiv:1605.06197.

Vlad Niculae and Mathieu Blondel. 2017. A regular-ized framework for sparse and structured neural at-tention. In Advances in Neural Information Process-ing Systems, pages 3338–3348.

Vlad Niculae, Andre F. T. Martins, and Claire Cardie.2018. Towards dynamic computation graphs viasparse latent structure. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 905–911. Association forComputational Linguistics.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2249–2255.Association for Computational Linguistics.

Hao Peng, Sam Thomson, and Noah A. Smith. 2018.Backpropagating through structured argmax using aspigot. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1863–1873. Asso-ciation for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.

Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra. 2014. Stochastic backpropagation and ap-proximate inference in deep generative models. InProceedings of the 31st International Conferenceon Machine Learning, volume 32 of Proceedings ofMachine Learning Research, pages 1278–1286, Be-jing, China. PMLR.

Marco Ribeiro, Sameer Singh, and Carlos Guestrin.2016. “why should i trust you?”: Explaining the pre-dictions of any classifier. In Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: Demon-strations, pages 97–101. Association for Computa-tional Linguistics.

Jason Tyler Rolfe. 2017. Discrete variational autoen-coders. In ICLR.

Nicholas D. Socci, Daniel D. Lee, and H. SebastianSeung. 1998. The rectified gaussian distribution. InM. I. Jordan, M. J. Kearns, and S. A. Solla, editors,Advances in Neural Information Processing Systems10, pages 350–356. MIT Press.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, pages 1631–1642. Association for Computa-tional Linguistics.

Sebastian Thrun. 1995. Extracting rules from artificialneural networks with distributed representations. InAdvances in neural information processing systems,pages 505–512.

Ivan Titov and Ryan McDonald. 2008. A joint modelof text and aspect ratings for sentiment summariza-tion. In Proceedings of ACL.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

John Winn and Christopher M Bishop. 2005. Varia-tional message passing. Journal of Machine Learn-ing Research, 6(Apr):661–694.

Omar Zaidan and Jason Eisner. 2008. Modeling anno-tators: A generative approach to learning from an-notator rationales. In Proceedings of the 2008 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 31–40, Honolulu, Hawaii. Asso-ciation for Computational Linguistics.

2972

https://doi.org/10.18653/v1/D16-1011

http://aclweb.org/anthology/D18-1108


https://doi.org/10.18653/v1/D16-1244

https://doi.org/10.18653/v1/D16-1244



https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202

http://proceedings.mlr.press/v32/rezende14.html

http://proceedings.mlr.press/v32/rezende14.html

https://doi.org/10.18653/v1/N16-3020

https://doi.org/10.18653/v1/N16-3020




https://www.aclweb.org/anthology/D08-1004



Omar Zaidan, Jason Eisner, and Christine Piatko. 2007.Using “annotator rationales” to improve machinelearning for text categorization. In Human Lan-guage Technologies 2007: The Conference of theNorth American Chapter of the Association forComputational Linguistics; Proceedings of the MainConference, pages 260–267. Association for Com-putational Linguistics.

Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016.Rationale-augmented convolutional neural networksfor text classification. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 795–804, Austin, Texas.Association for Computational Linguistics.

2973

http://aclweb.org/anthology/N07-1033

http://aclweb.org/anthology/N07-1033

https://doi.org/10.18653/v1/D16-1076

https://doi.org/10.18653/v1/D16-1076

A Kumaraswamy distribution

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5(0.5, 0.5)(5, 1)(2, 2)(2, 5)(0.1, 0.1)(1.0, 1.0)

Figure 7: Kuma plots for various (a, b) parameters.

A Kumaraswamy-distributed variable K ∼Kuma(a, b) takes on values in the open interval(0, 1) and has density

fK(k; a, b) = abka−1(1− ka)b−1 , (17)

where a ∈ R>0 and b ∈ R>0 are shape param-eters. Its cumulative distribution takes a simpleclosed-form expression

FK(k; a, b) =

∫ k

0fK(ξ|a, b)dξ (18a)

= 1− (1− ka)b , (18b)

with inverse

F−1K (u; a, b) =(1− (1− u)1/b

)1/a. (19)

A.1 Generalised-support Kumaraswamy

We can generalise the support of a Kumaraswamyvariable by specifying two constants l < r andtransforming a random variable K ∼ Kuma(a, b)to obtain T ∼ Kuma(a, b, l, r) as shown in (20,left).

t = l + (r − l)k k = (t− l)/(r − l) (20)

The density of the resulting variable is

fT (t; a, b, l, r) (21a)

= fK

(t−lr−l ; a, b

) ∣∣∣∣dkdt∣∣∣∣ (21b)

= fK

(t−lr−l ; a, b

) 1

(r − l)(21c)

where r − l > 0 by definition. This affine trans-formation leaves the cdf unchanged, i.e.

FT (t0; a, b, l, r) =

∫ t0

−∞fT (t; a, b, l, r)dt

=

∫ t0

−∞fK

(t−lr−l ; a, b

) 1

(r − l)dt

=

∫ t0−lr−l

−∞fK(k; a, b)

1

(r − l)(r − l)dk

= FK

(t0−lr−l ; a, b

).

(22)

Thus we can obtain samples from this generalised-support Kumaraswamy by sampling from a uni-form distribution U(0, 1), applying the inversetransform (19), then shifting and scaling the sam-ple according to (20, left).

A.2 Rectified KumaraswamyFirst, we stretch a Kumaraswamy distribution toinclude 0 and 1 in its support, that is, with l < 0and r > 1, we define T ∼ Kuma(a, b, l, r). Thenwe apply a hard-sigmoid transformation to thisvariable, that is, h = min(0,max(1, t)), whichresults in a rectified distribution which gives sup-port to the closed interval [0, 1]. We denote thisrectified variable by

H ∼ HardKuma(a, b, l, r) (23)

whose distribution function is

fH(h; a, b, l, r) =

P(h = 0)δ(h) + P(h = 1)δ(h− 1)

+ P(0 < h < 1)fT (h; a, b, l, r)1(0,1)(h)

P(0 < h < 1)

(24)

where

P(h = 0) = P(t ≤ 0)

= FT (0; a, b, l, r) = FK(− l/(r − l); a, b)(25)

is the probability of sampling exactly 0, where

P(h = 1) = P(t ≥ 1) = 1− P(t < 1)

= 1− FT (1; a, b, l, r)= 1− FK((1− l)/(r − l); a, b)

(26)

is the probability of sampling exactly 1, and

P(0 < h < 1) = 1−P(h = 0)−P(h = 1) (27)

is the probability of drawing a continuous valuein (0, 1). Note that we used the result in (22) toexpress these probabilities in terms of the tractablecdf of the original Kumaraswamy variable.

2974

A.3 Reparameterized gradients

Let us consider the case where we need deriva-tives of a function L(u) of the underlying uniformvariable u, as when we compute reparameterizedgradients in variational inference. Note that

∂L∂u

=∂L∂h× ∂h

∂t× ∂t

∂k× ∂k

∂u, (28)

by chain rule. The term ∂L∂h depends on a differen-

tiable observation model and poses no challenge;the term ∂h

∂t is the derivative of the hard-sigmoidfunction, which is 0 for t < 0 or t > 1, 1 for0 < t < 1, and undefined for t ∈ {0, 1}; theterm ∂t

∂k = r − l follows directly from (20, left);the term ∂k

∂u = ∂∂uF

−1K (u; a, b) depends on the

Kumaraswamy inverse cdf (19) and also poses nochallenge. Thus the only two discontinuities hap-pen for t ∈ {0, 1}, which is a 0 measure set underthe stretched Kumaraswamy: we say this reparam-eterisation is differentiable almost everywhere, auseful property which essentially circumvents thediscontinuity points of the rectifier.

A.4 HardKumaraswamy PDF and CDF

Figure 8 plots the pdf of the HardKumaraswamyfor various a and b parameters. Figure 9 does thesame but with the cdf.

Figure 8: HardKuma pdf for various (a, b).

B Implementation Details

B.1 Multi-aspect Sentiment Analysis

Our hyperparameters are taken from Lei et al.(2016) and listed in Table 4. The pre-trainedword embeddings and data sets are available on-line at http://people.csail.mit.edu/taolei/beer/. We train for 100 epochs and

Figure 9: HardKuma cdf for various (a, b).

select the best models based on validation loss.For the MSE trade-off experiments on all aspectscombined, we train for a maximum of 50 epochs.

Optimizer AdamLearning rate 0.0004Word embeddings 200D (Wiki, fixed)Hidden size 200Batch size 256Dropout 0.1, 0.2Weight decay 1 ∗ 10−6Cell RCNN

Table 4: Beer hyperparameters.

For the Bernoulli baselines we vary L0 weightλ1 among {0.0002, 0.0003, 0.0004}, just as in theoriginal paper. We set the fused lasso (coherence)weight λ2 to 2 ∗ λ1.

For the HardKuma models we set a target se-lection rate to the values targeted in Table 2, andoptimize to this end using the Lagrange multi-plier. We chose the fused lasso weight from{0.0001, 0.0002, 0.0003, 0.0004}.

B.1.1 Recurrent Unit

In our multi-aspect sentiment analysis experi-ments we use the RCNN of Lei et al. (2016). Intu-itively, the RCNN is supposed to capture n-gramfeatures that are not necessarily consecutive. Weuse the bigram version (filter width n = 2) used in

2975

http://people.csail.mit.edu/taolei/beer/

http://people.csail.mit.edu/taolei/beer/

Lei et al. (2016), which is defined as:

λt = σ(W λxt + Uλht−1 + bλ)

c(1)t = λt � c

(1)t−1 + (1− λt)�W1xt

c(2)t = λt � c

(2)t−1 + (1− λt)� (c

(1)t−1 +W2xt)

ht = tanh(c(2)t + b

)B.1.2 Expected values for dependent latent

variablesThe expected L0 is a chain of nested expectations,and we solve each term

Ep(zi|x,z<i) [I[zi 6= 0] | z<i]

= 1− FK(−lr−l ; ai, bi

) (29)

as a function of a sampled prefix, and the shapeparameters ai, bi = gi(x, z<i;φ) are predicted insequence.

B.2 Sentiment Classification (SST)For sentiment classification we make use of thePyTorch bidirectional LSTM module for encod-ing sentences, for both the rationale extractor andthe classifier. The BiLSTM final states are con-catenated, after which a linear layer followed bya softmax produces the prediction. Hyperparame-ters are listed in Table 5. We apply dropout to theembeddings and to the input of the output layer.

Optimizer AdamLearning rate 0.0002Word embeddings 300D Glove (fixed)Hidden size 150Batch size 25Dropout 0.5Weight decay 1 ∗ 10−6Cell LSTM

Table 5: SST hyperparameters.

B.3 Natural Language Inference (SNLI)Our hyperparameters are taken from Parikh et al.(2016) and listed in Table 6. Different from Parikhet al. is that we use Adam as the optimizer and abatch size of 64. Word embeddings are projectedto 200 dimensions with a trained linear layer. Un-known words are mapped to 100 unknown wordclasses based on the MD5 hash function, just asin Parikh et al. (2016), and unknown word vectorsare randomly initialized. We train for 100 epochs,

evaluate every 1000 updates, and select the bestmodel based on validation loss. Figure 10 showsa correct and incorrect example with HardKumaattention for each relation type (entailment, con-tradiction, neutral).

Optimizer AdamLearning rate 0.0001Word embeddings 300D (Glove, fixed)Hidden size 200Batch size 64Dropout 0.2

Table 6: SNLI hyperparameters.

2976

<s>

The

two

dogs

are

blac

k

.

<s>

Two

black

dogs

running

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 100 0

0 0 0 90 0 0 0

0 0 0 23 0 0 0

(a) Entailment (correct)

<s>

Fou

r

peop

le

in a kitc

hen

cook

ing

.

<s>

Four

people

in

a

kitchen

0 0 0 0 0 0 0 0

0 89 0 0 0 0 0 0

0 0 53 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 100 74 0

(b) Entailment (incorrect, pred: neutral)

<s>

Thr

ee

cats

race

on a trac

k

.

<s>

Three

dogs

racing

on

racetrack

0 0 0 0 0 0 0 0

0 84 0 0 0 0 0 0

0 0 100 0 0 0 18 0

0 0 0 87 0 0 43 0

0 0 0 0 0 0 0 0

0 0 33 48 0 0 73 0

(c) Contradiction (correct)

<s>

a coup

le

on a mot

orcy

cle

<s>

A

person

on

a

motorcycle

.

0 0 0 0 0 0

0 0 0 0 0 0

0 0 15 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 89

0 0 0 0 0 0

(d) Contradiction (incorrect, pred: entailment)

<s>

The

y

are

in the

dese

rt

.

<s>

People

walking

through

dirt

.

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 81 0

0 0 0 0 0 0 0

(e) Neutral (correct)

<s>

A dog

foun

d

a bone

<s>

A

dog

gnawing

on

a

bone

.

0 0 0 0 0 0

0 0 0 0 0 0

0 0 89 13 0 12

0 0 0 0 0 47

0 0 0 0 0 0

0 0 0 0 0 0

0 0 12 14 0 76

0 0 0 0 0 0

(f) Neutral (incorrect, pred: entailment)

Figure 10: HardKuma attention in SNLI for entailment, contradiction, and neutral.

2977

interpretable neural predictions with differentiable

Documents