enhanced lstm for natural language inference · enhanced lstm for natural language inference qian...

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1657–1668Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1152

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1657–1668Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1152

Enhanced LSTM for Natural Language Inference

Qian ChenUniversity of Science and

Technology of [email protected]

Xiaodan ZhuNational Research Council Canada

[email protected]

Zhenhua LingUniversity of Science and

Technology of [email protected]

Si WeiiFLYTEK Research

[email protected]

Hui JiangYork University

[email protected]

Diana InkpenUniversity of Ottawa

[email protected]

Abstract

Reasoning and inference are central to hu-man and artificial intelligence. Modelinginference in human language is very chal-lenging. With the availability of large an-notated data (Bowman et al., 2015), it hasrecently become feasible to train neuralnetwork based inference models, whichhave shown to be very effective. In thispaper, we present a new state-of-the-art re-sult, achieving the accuracy of 88.6% onthe Stanford Natural Language InferenceDataset. Unlike the previous top modelsthat use very complicated network architec-tures, we first demonstrate that carefully de-signing sequential inference models basedon chain LSTMs can outperform all previ-ous models. Based on this, we further showthat by explicitly considering recursive ar-chitectures in both local inference model-ing and inference composition, we achieveadditional improvement. Particularly, in-corporating syntactic parsing informationcontributes to our best result—it further im-proves the performance even when addedto the already very strong model.

1 Introduction

Reasoning and inference are central to both humanand artificial intelligence. Modeling inference inhuman language is notoriously challenging but isa basic problem towards true natural language un-derstanding, as pointed out by MacCartney andManning (2008), “a necessary (if not sufficient)

condition for true natural language understandingis a mastery of open-domain natural language in-ference.” The previous work has included extensiveresearch on recognizing textual entailment.

Specifically, natural language inference (NLI)is concerned with determining whether a natural-language hypothesis h can be inferred from apremise p, as depicted in the following examplefrom MacCartney (2009), where the hypothesis isregarded to be entailed from the premise.

p: Several airlines polled saw costs grow more thanexpected, even after adjusting for inflation.

h: Some of the companies in the poll reported costincreases.

The most recent years have seen advances inmodeling natural language inference. An impor-tant contribution is the creation of a much largerannotated dataset, the Stanford Natural LanguageInference (SNLI) dataset (Bowman et al., 2015).The corpus has 570,000 human-written Englishsentence pairs manually labeled by multiple humansubjects. This makes it feasible to train more com-plex inference models. Neural network models,which often need relatively large annotated data toestimate their parameters, have shown to achievethe state of the art on SNLI (Bowman et al., 2015,2016; Munkhdalai and Yu, 2016b; Parikh et al.,2016; Sha et al., 2016; Paria et al., 2016).

While some previous top-performing models userather complicated network architectures to achievethe state-of-the-art results (Munkhdalai and Yu,2016b), we demonstrate in this paper that enhanc-ing sequential inference models based on chain

1657

https://doi.org/10.18653/v1/P17-1152

https://doi.org/10.18653/v1/P17-1152

models can outperform all previous results, sug-gesting that the potentials of such sequential in-ference approaches have not been fully exploitedyet. More specifically, we show that our sequentialinference model achieves an accuracy of 88.0% onthe SNLI benchmark.

Exploring syntax for NLI is very attractive to us.In many problems, syntax and semantics interactclosely, including in semantic composition (Partee,1995), among others. Complicated tasks such asnatural language inference could well involve both,which has been discussed in the context of rec-ognizing textual entailment (RTE) (Mehdad et al.,2010; Ferrone and Zanzotto, 2014). In this pa-per, we are interested in exploring this within theneural network frameworks, with the presence ofrelatively large training data. We show that byexplicitly encoding parsing information with re-cursive networks in both local inference modelingand inference composition and by incorporatingit into our framework, we achieve additional im-provement, increasing the performance to a newstate of the art with an 88.6% accuracy.

2 Related Work

Early work on natural language inference has beenperformed on rather small datasets with more con-ventional methods (refer to MacCartney (2009)for a good literature survey), which includes alarge bulk of work on recognizing textual entail-ment, such as (Dagan et al., 2005; Iftene andBalahur-Dobrescu, 2007), among others. Morerecently, Bowman et al. (2015) made available theSNLI dataset with 570,000 human annotated sen-tence pairs. They also experimented with simpleclassification models as well as simple neural net-works that encode the premise and hypothesis in-dependently. Rocktäschel et al. (2015) proposedneural attention-based models for NLI, which cap-tured the attention information. In general, atten-tion based models have been shown to be effec-tive in a wide range of tasks, including machinetranslation (Bahdanau et al., 2014), speech recogni-tion (Chorowski et al., 2015; Chan et al., 2016), im-age caption (Xu et al., 2015), and text summariza-tion (Rush et al., 2015; Chen et al., 2016), amongothers. For NLI, the idea allows neural models topay attention to specific areas of the sentences.

A variety of more advanced networks have beendeveloped since then (Bowman et al., 2016; Ven-drov et al., 2015; Mou et al., 2016; Liu et al., 2016;

Munkhdalai and Yu, 2016a; Rocktäschel et al.,2015; Wang and Jiang, 2016; Cheng et al., 2016;Parikh et al., 2016; Munkhdalai and Yu, 2016b;Sha et al., 2016; Paria et al., 2016). Among them,more relevant to ours are the approaches proposedby Parikh et al. (2016) and Munkhdalai and Yu(2016b), which are among the best performing mod-els.

Parikh et al. (2016) propose a relatively sim-ple but very effective decomposable model. Themodel decomposes the NLI problem into subprob-lems that can be solved separately. On the otherhand, Munkhdalai and Yu (2016b) propose muchmore complicated networks that consider sequen-tial LSTM-based encoding, recursive networks,and complicated combinations of attention mod-els, which provide about 0.5% gain over the resultsreported by Parikh et al. (2016).

It is, however, not very clear if the potential ofthe sequential inference networks has been wellexploited for NLI. In this paper, we first revisit thisproblem and show that enhancing sequential infer-ence models based on chain networks can actuallyoutperform all previous results. We further showthat explicitly considering recursive architecturesto encode syntactic parsing information for NLIcould further improve the performance.

3 Hybrid Neural Inference Models

We present here our natural language inference net-works which are composed of the following majorcomponents: input encoding, local inference mod-eling, and inference composition. Figure 1 shows ahigh-level view of the architecture. Vertically, thefigure depicts the three major components, and hor-izontally, the left side of the figure represents oursequential NLI model named ESIM, and the rightside represents networks that incorporate syntacticparsing information in tree LSTMs.

In our notation, we have two sentences a =(a1, . . . ,a`a) and b = (b1, . . . ,b`b), where a is apremise and b a hypothesis. The ai or bj ∈ Rl isan embedding of l-dimensional vector, which canbe initialized with some pre-trained word embed-dings and organized with parse trees. The goal is topredict a label y that indicates the logic relationshipbetween a and b.

3.1 Input Encoding

We employ bidirectional LSTM (BiLSTM) as oneof our basic building blocks for NLI. We first use it

1658

Figure 1: A high-level view of our hybrid neuralinference networks.

to encode the input premise and hypothesis (Equa-tion (1) and (2)). Here BiLSTM learns to representa word (e.g., ai) and its context. Later we will alsouse BiLSTM to perform inference composition toconstruct the final prediction, where BiLSTM en-codes local inference information and its interac-tion. To bookkeep the notations for later use, wewrite as ai the hidden (output) state generated bythe BiLSTM at time i over the input sequence a.The same is applied to bj :

ai = BiLSTM(a, i),∀i ∈ [1, . . . , `a], (1)

bj = BiLSTM(b, j),∀j ∈ [1, . . . , `b]. (2)

Due to the space limit, we will skip the descrip-tion of the basic chain LSTM and readers can referto Hochreiter and Schmidhuber (1997) for details.Briefly, when modeling a sequence, an LSTM em-ploys a set of soft gates together with a memorycell to control message flows, resulting in an effec-tive modeling of tracking long-distance informa-tion/dependencies in a sequence.

A bidirectional LSTM runs a forward and back-ward LSTM on a sequence starting from the leftand the right end, respectively. The hidden states

generated by these two LSTMs at each time stepare concatenated to represent that time step andits context. Note that we used LSTM memoryblocks in our models. We examined other recurrentmemory blocks such as GRUs (Gated RecurrentUnits) (Cho et al., 2014) and they are inferior toLSTMs on the heldout set for our NLI task.

As discussed above, it is intriguing to explorethe effectiveness of syntax for natural languageinference; for example, whether it is useful evenwhen incorporated into the best-performing models.To this end, we will also encode syntactic parsetrees of a premise and hypothesis through tree-LSTM (Zhu et al., 2015; Tai et al., 2015; Le andZuidema, 2015), which extends the chain LSTM toa recursive network (Socher et al., 2011).

Specifically, given the parse of a premise or hy-pothesis, a tree node is deployed with a tree-LSTMmemory block depicted as in Figure 2 and com-puted with Equations (3–10). In short, at each node,an input vector xt and the hidden vectors of its twochildren (the left child hL

t−1 and the right hRt−1) are

taken in as the input to calculate the current node’shidden vector ht.

ct

Cell

× ht×

fLt

Left Forget Gate

× fRt

Right Forget Gate

×

itInput Gate otOutput Gate

xt

hLt−1

hRt−1

xt hRt−1hL

t−1xt hR

t−1hLt−1

xt hRt−1hL

t−1xt hRt−1hL

t−1 cLt−1 cRt−1

Figure 2: A tree-LSTM memory block.

We describe the updating of a node at a high levelwith Equation (3) to facilitate references later in thepaper, and the detailed computation is describedin (4–10). Specifically, the input of a node is usedto configure four gates: the input gate it, outputgate ot, and the two forget gates fLt and fRt . Thememory cell ct considers each child’s cell vector,cLt−1 and cRt−1, which are gated by the left forget

1659

gate fLt and right forget gate fRt , respectively.

ht = TrLSTM(xt,hLt−1,h

Rt−1), (3)

ht = ot � tanh(ct), (4)

ot = σ(Woxt + ULo h

Lt−1 + UR

o hRt−1), (5)

ct = fLt � cLt−1 + fRt � cRt−1 + it � ut, (6)

fLt = σ(Wfxt + ULLf hL

t−1 + ULRf hR

t−1), (7)

fRt = σ(Wfxt + URLf hL

t−1 + URRf hR

t−1), (8)

it = σ(Wixt + ULi h

Lt−1 + UR

i hRt−1), (9)

ut = tanh(Wcxt + ULc h

Lt−1 + UR

c hRt−1), (10)

where σ is the sigmoid function, � is the element-wise multiplication of two vectors, and all W ∈Rd×l, U ∈ Rd×d are weight matrices to be learned.

In the current input encoding layer, xt is used toencode a word embedding for a leaf node. Sincea non-leaf node does not correspond to a specificword, we use a special vector x′t as its input, whichis like an unknown word. However, in the inferencecomposition layer that we discuss later, the goalof using tree-LSTM is very different; the input xt

will be very different as well—it will encode localinference information and will have values at alltree nodes.

3.2 Local Inference Modeling

Modeling local subsentential inference between apremise and hypothesis is the basic component fordetermining the overall inference between thesetwo statements. To closely examine local infer-ence, we explore both the sequential and syntactictree models that have been discussed above. Theformer helps collect local inference for words andtheir context, and the tree LSTM helps collect lo-cal information between (linguistic) phrases andclauses.

Locality of inference Modeling local inferenceneeds to employ some forms of hard or soft align-ment to associate the relevant subcomponents be-tween a premise and a hypothesis. This includesearly methods motivated from the alignment inconventional automatic machine translation (Mac-Cartney, 2009). In neural network models, this isoften achieved with soft attention.

Parikh et al. (2016) decomposed this process:the word sequence of the premise (or hypothesis)is regarded as a bag-of-word embedding vectorand inter-sentence “alignment” (or attention) iscomputed individually to softly align each word

to the content of hypothesis (or premise, respec-tively). While their basic framework is very effec-tive, achieving one of the previous best results, us-ing a pre-trained word embedding by itself does notautomatically consider the context around a wordin NLI. Parikh et al. (2016) did take into accountthe word order and context information through anoptional distance-sensitive intra-sentence attention.

In this paper, we argue for leveraging attentionover the bidirectional sequential encoding of theinput, as discussed above. We will show that thisplays an important role in achieving our best results,and the intra-sentence attention used by Parikh et al.(2016) actually does not further improve over ourmodel, while the overall framework they proposedis very effective.

Our soft alignment layer computes the attentionweights as the similarity of a hidden state tuple<ai, bj> between a premise and a hypothesis withEquation (11). We did study more complicatedrelationships between ai and bj with multilayerperceptrons, but observed no further improvementon the heldout data.

eij = aTi bj . (11)

In the formula, ai and bj are computed earlierin Equations (1) and (2), or with Equation (3) whentree-LSTM is used. Again, as discussed above, wewill use bidirectional LSTM and tree-LSTM to en-code the premise and hypothesis, respectively. Inour sequential inference model, unlike in Parikhet al. (2016) which proposed to use a functionF (ai), i.e., a feedforward neural network, to mapthe original word representation for calculating eij ,we instead advocate to use BiLSTM, which en-codes the information in premise and hypothesisvery well and achieves better performance shown inthe experiment section. We tried to apply the F (.)function on our hidden states before computing eijand it did not further help our models.

Local inference collected over sequences Lo-cal inference is determined by the attention weighteij computed above, which is used to obtain thelocal relevance between a premise and hypothesis.For the hidden state of a word in a premise, i.e., ai(already encoding the word itself and its context),the relevant semantics in the hypothesis is iden-tified and composed using eij , more specifically

1660

with Equation (12).

ai =

`b∑

j=1

exp(eij)∑`bk=1 exp(eik)

bj ,∀i ∈ [1, . . . , à], (12)

bj =

à∑

i=1

exp(eij)∑àk=1 exp(ekj)

ai,∀j ∈ [1, . . . , `b], (13)

where ai is a weighted summation of {bj}`bj=1. In-

tuitively, the content in {bj}`bj=1 that is relevant toai will be selected and represented as ai. The sameis performed for each word in the hypothesis withEquation (13).

Local inference collected over parse trees Weuse tree models to help collect local inference in-formation over linguistic phrases and clauses inthis layer. The tree structures of the premise andhypothesis are produced by a constituency parser.

Once the hidden states of a tree are all computedwith Equation (3), we treat all tree nodes equallyas we do not have further heuristics to discrimi-nate them, but leave the attention weights to figureout their relationship. So, we use Equation (11)to compute the attention weights for all node pairsbetween a premise and hypothesis. This connectsall words, constituent phrases, and clauses betweenthe premise and hypothesis. We then collect the in-formation between all the pairs with Equations (12)and (13) and feed them into the next layer.

Enhancement of local inference informationIn our models, we further enhance the local in-ference information collected. We compute thedifference and the element-wise product for the tu-ple <a, a> as well as for <b, b>. We expect thatsuch operations could help sharpen local inferenceinformation between elements in the tuples and cap-ture inference relationships such as contradiction.The difference and element-wise product are thenconcatenated with the original vectors, a and a,or b and b, respectively (Mou et al., 2016; Zhanget al., 2017). The enhancement is performed forboth the sequential and the tree models.

ma = [a; a; a− a; a� a], (14)

mb = [b; b; b− b; b� b]. (15)

This process could be regarded as a special caseof modeling some high-order interaction betweenthe tuple elements. Along this direction, we have

also further modeled the interaction by feeding thetuples into feedforward neural networks and addedthe top layer hidden states to the above concate-nation. We found that it does not further help theinference accuracy on the heldout dataset.

3.3 Inference CompositionTo determine the overall inference relationship be-tween a premise and hypothesis, we explore a com-position layer to compose the enhanced local in-ference information ma and mb. We perform thecomposition sequentially or in its parse contextusing BiLSTM and tree-LSTM, respectively.

The composition layer In our sequential infer-ence model, we keep using BiLSTM to composelocal inference information sequentially. The for-mulas for BiLSTM are similar to those in Equations(1) and (2) in their forms so we skip the details, butthe aim is very different here—they are used to cap-ture local inference information ma and mb andtheir context here for inference composition.

In the tree composition, the high-level formulasof how a tree node is updated to compose localinference is as follows:

va,t = TrLSTM(F (ma,t),hLt−1,h

Rt−1), (16)

vb,t = TrLSTM(F (mb,t),hLt−1,h

Rt−1). (17)

We propose to control model complexity in thislayer, since the concatenation we described aboveto compute ma and mb can significantly increasethe overall parameter size to potentially overfit themodels. We propose to use a mapping F as inEquation (16) and (17). More specifically, we use a1-layer feedforward neural network with the ReLUactivation. This function is also applied to BiLSTMin our sequential inference composition.

Pooling Our inference model converts the result-ing vectors obtained above to a fixed-length vectorwith pooling and feeds it to the final classifier todetermine the overall inference relationship.

We consider that summation (Parikh et al., 2016)could be sensitive to the sequence length and henceless robust. We instead suggest the following strat-egy: compute both average and max pooling, andconcatenate all these vectors to form the final fixedlength vector v. Our experiments show that thisleads to significantly better results than summa-tion. The final fixed length vector v is calculated

1661

as follows:

va,ave =

à∑

i=1

va,i

à, va,max =

àmaxi=1

va,i, (18)

vb,ave =

`b∑

j=1

vb,j

`b, vb,max =

`bmaxj=1

vb,j , (19)

v = [va,ave;va,max;vb,ave;vb,max]. (20)

Note that for tree composition, Equation (20)is slightly different from that in sequential com-position. Our tree composition will concatenatealso the hidden states computed for the roots withEquations (16) and (17), which are not shown here.

We then put v into a final multilayer perceptron(MLP) classifier. The MLP has a hidden layer withtanh activation and softmax output layer in our ex-periments. The entire model (all three componentsdescribed above) is trained end-to-end. For train-ing, we use multi-class cross-entropy loss.

Overall inference models Our model can bebased only on the sequential networks by removingall tree components and we call it Enhanced Se-quential Inference Model (ESIM) (see the left partof Figure 1). We will show that ESIM outperformsall previous results. We will also encode parse in-formation with tree LSTMs in multiple layers asdescribed (see the right side of Figure 1). We trainthis model and incorporate it into ESIM by averag-ing the predicted probabilities to get the final labelfor a premise-hypothesis pair. We will show thatparsing information complements very well withESIM and further improves the performance, andwe call the final model Hybrid Inference Model(HIM).

4 Experimental Setup

Data The Stanford Natural Language Inference(SNLI) corpus (Bowman et al., 2015) focuses onthree basic relationships between a premise and apotential hypothesis: the premise entails the hy-pothesis (entailment), they contradict each other(contradiction), or they are not related (neutral).The original SNLI corpus contains also “the other”category, which includes the sentence pairs lackingconsensus among multiple human annotators. Asin the related work, we remove this category. Weused the same split as in Bowman et al. (2015) andother previous work.

The parse trees used in this paper are producedby the Stanford PCFG Parser 3.5.3 (Klein and Man-ning, 2003) and they are delivered as part of theSNLI corpus. We use classification accuracy as theevaluation metric, as in related work.

Training We use the development set to selectmodels for testing. To help replicate our results,we publish our code1. Below, we list our trainingdetails. We use the Adam method (Kingma andBa, 2014) for optimization. The first momentumis set to be 0.9 and the second 0.999. The initiallearning rate is 0.0004 and the batch size is 32. Allhidden states of LSTMs, tree-LSTMs, and wordembeddings have 300 dimensions.

We use dropout with a rate of 0.5, which isapplied to all feedforward connections. We usepre-trained 300-D Glove 840B vectors (Penning-ton et al., 2014) to initialize our word embeddings.Out-of-vocabulary (OOV) words are initialized ran-domly with Gaussian samples. All vectors includ-ing word embedding are updated during training.

5 Results

Overall performance Table 1 shows the resultsof different models. The first row is a baselineclassifier presented by Bowman et al. (2015) thatconsiders handcrafted features such as BLEU scoreof the hypothesis with respect to the premise, theoverlapped words, and the length difference be-tween them, etc.

The next group of models (2)-(7) are basedon sentence encoding. The model of Bowmanet al. (2016) encodes the premise and hypothe-sis with two different LSTMs. The model in Ven-drov et al. (2015) uses unsupervised “skip-thoughts”pre-training in GRU encoders. The approach pro-posed by Mou et al. (2016) considers tree-basedCNN to capture sentence-level semantics, whilethe model of Bowman et al. (2016) introduces astack-augmented parser-interpreter neural network(SPINN) which combines parsing and interpreta-tion within a single tree-sequence hybrid model.The work by Liu et al. (2016) uses BiLSTM to gen-erate sentence representations, and then replacesaverage pooling with intra-attention. The approachproposed by Munkhdalai and Yu (2016a) presentsa memory augmented neural network, neural se-mantic encoders (NSE), to encode sentences.

The next group of methods in the table, models

1https://github.com/lukecq1231/nli

1662

Model #Para. Train Test

(1) Handcrafted features (Bowman et al., 2015) - 99.7 78.2

(2) 300D LSTM encoders (Bowman et al., 2016) 3.0M 83.9 80.6(3) 1024D pretrained GRU encoders (Vendrov et al., 2015) 15M 98.8 81.4(4) 300D tree-based CNN encoders (Mou et al., 2016) 3.5M 83.3 82.1(5) 300D SPINN-PI encoders (Bowman et al., 2016) 3.7M 89.2 83.2(6) 600D BiLSTM intra-attention encoders (Liu et al., 2016) 2.8M 84.5 84.2(7) 300D NSE encoders (Munkhdalai and Yu, 2016a) 3.0M 86.2 84.6

(8) 100D LSTM with attention (Rocktäschel et al., 2015) 250K 85.3 83.5(9) 300D mLSTM (Wang and Jiang, 2016) 1.9M 92.0 86.1(10) 450D LSTMN with deep attention fusion (Cheng et al., 2016) 3.4M 88.5 86.3(11) 200D decomposable attention model (Parikh et al., 2016) 380K 89.5 86.3(12) Intra-sentence attention + (11) (Parikh et al., 2016) 580K 90.5 86.8(13) 300D NTI-SLSTM-LSTM (Munkhdalai and Yu, 2016b) 3.2M 88.5 87.3(14) 300D re-read LSTM (Sha et al., 2016) 2.0M 90.7 87.5(15) 300D btree-LSTM encoders (Paria et al., 2016) 2.0M 88.6 87.6

(16) 600D ESIM 4.3M 92.6 88.0(17) HIM (600D ESIM + 300D Syntactic tree-LSTM) 7.7M 93.5 88.6

Table 1: Accuracies of the models on SNLI. Our final model achieves the accuracy of 88.6%, the bestresult observed on SNLI, while our enhanced sequential encoding model attains an accuracy of 88.0%,which also outperform the previous models.

(8)-(15), are inter-sentence attention-based model.The model marked with Rocktäschel et al. (2015)is LSTMs enforcing the so called word-by-wordattention. The model of Wang and Jiang (2016) ex-tends this idea to explicitly enforce word-by-wordmatching between the hypothesis and the premise.Long short-term memory-networks (LSTMN) withdeep attention fusion (Cheng et al., 2016) link thecurrent word to previous words stored in memory.Parikh et al. (2016) proposed a decomposable atten-tion model without relying on any word-order in-formation. In general, adding intra-sentence atten-tion yields further improvement, which is not verysurprising as it could help align the relevant textspans between premise and hypothesis. The modelof Munkhdalai and Yu (2016b) extends the frame-work of Wang and Jiang (2016) to a full n-ary treemodel and achieves further improvement. Sha et al.(2016) proposes a special LSTM variant which con-siders the attention vector of another sentence as aninner state of LSTM. Paria et al. (2016) use a neu-ral architecture with a complete binary tree-LSTMencoders without syntactic information.

The table shows that our ESIM model achievesan accuracy of 88.0%, which has already outper-formed all the previous models, including thoseusing much more complicated network architec-tures (Munkhdalai and Yu, 2016b).

We ensemble our ESIM model with syntactictree-LSTMs (Zhu et al., 2015) based on syntacticparse trees and achieve significant improvementover our best sequential encoding model ESIM, at-taining an accuracy of 88.6%. This shows that syn-tactic tree-LSTMs complement well with ESIM.

Model Train Test

(17) HIM (ESIM + syn.tree) 93.5 88.6(18) ESIM + tree 91.9 88.2(16) ESIM 92.6 88.0(19) ESIM - ave./max 92.9 87.1(20) ESIM - diff./prod. 91.5 87.0(21) ESIM - inference BiLSTM 91.3 87.3(22) ESIM - encoding BiLSTM 88.7 86.3(23) ESIM - P-based attention 91.6 87.2(24) ESIM - H-based attention 91.4 86.5(25) syn.tree 92.9 87.8

Table 2: Ablation performance of the models.

Ablation analysis We further analyze the ma-jor components that are of importance to help usachieve good performance. From the best model,we first replace the syntactic tree-LSTM with thefull tree-LSTM without encoding syntactic parseinformation. More specifically, two adjacent wordsin a sentence are merged to form a parent node, and

1663

1 -

3 -

5 -

7 -

21 -

23 -

25 -

27 -

29standing

28while

26newspaper

24a

22reading

8 -

16 -

18 -

20jeans

19blue

17a

9 -

15and

10 -

12 -

14shirt

13white

11a

6wearing

4man

2A

(a) Binarized constituency tree of premise

1 -

5 -

17.

6 -

8 -

12 -

14 -

16newspaper

15a

13reading

9 -

11down

10sitting

7is

2 -

4man

3A

(b) Binarized constituency tree of hypothesis

(c) Normalized attention weights of tree-LSTM

(d) Input gate of tree-LSTM in inference composi-tion (l2-norm)

(e) Input gate of BiLSTM in inference composition(l2-norm)

(f) Normalized attention weights of BiLSTM

Figure 3: An example for analysis. Subfigures (a) and (b) are the constituency parse trees of the premiseand hypothesis, respectively. “-” means a non-leaf or a null node. Subfigures (c) and (f) are attentionvisualization of the tree model and ESIM, respectively. The darker the color, the greater the value. Thepremise is on the x-axis and the hypothesis is on y-axis. Subfigures (d) and (e) are input gates’ l2-norm oftree-LSTM and BiLSTM in inference composition, respectively.

this process continues and results in a full binarytree, where padding nodes are inserted when thereare no enough leaves to form a full tree. Each treenode is implemented with a tree-LSTM block (Zhuet al., 2015) same as in model (17). Table 2 showsthat with this replacement, the performance drops

to 88.2%.Furthermore, we note the importance of the layer

performing the enhancement for local inference in-formation in Section 3.2 and the pooling layer ininference composition in Section 3.3. Table 2 sug-gests that the NLI task seems very sensitive to the

1664

layers. If we remove the pooling layer in infer-ence composition and replace it with summationas in Parikh et al. (2016), the accuracy drops to87.1%. If we remove the difference and element-wise product from the local inference enhancementlayer, the accuracy drops to 87.0%. To providesome detailed comparison with Parikh et al. (2016),replacing bidirectional LSTMs in inference compo-sition and also input encoding with feedforwardneural network reduces the accuracy to 87.3% and86.3% respectively.

The difference between ESIM and each of theother models listed in Table 2 is statistically signif-icant under the one-tailed paired t-test at the 99%significance level. The difference between model(17) and (18) is also significant at the same level.Note that we cannot perform significance test be-tween our models with the other models listed inTable 1 since we do not have the output of the othermodels.

If we remove the premise-based attention fromESIM (model 23), the accuracy drops to 87.2% onthe test set. The premise-based attention meanswhen the system reads a word in a premise, it usessoft attention to consider all relevant words in hy-pothesis. Removing the hypothesis-based atten-tion (model 24) decrease the accuracy to 86.5%,where hypothesis-based attention is the attentionperformed on the other direction for the sentencepairs. The results show that removing hypothesis-based attention affects the performance of ourmodel more, but removing the attention from theother direction impairs the performance too.

The stand-alone syntactic tree-LSTM modelachieves an accuracy of 87.8%, which is compa-rable to that of ESIM. We also computed the or-acle score of merging syntactic tree-LSTM andESIM, which picks the right answer if either isright. Such an oracle/upper-bound accuracy on testset is 91.7%, which suggests how much tree-LSTMand ESIM could ideally complement each other. Asfar as the speed is concerned, training tree-LSTMtakes about 40 hours on Nvidia-Tesla K40M andESIM takes about 6 hours, which is easily extendedto larger scale of data.

Further analysis We showed that encoding syn-tactic parsing information helps recognize naturallanguage inference—it additionally improves thestrong system. Figure 3 shows an example wheretree-LSTM makes a different and correct decision.In subfigure (d), the larger values at the input gates

on nodes 9 and 10 indicate that those nodes areimportant in making the final decision. We observethat in subfigure (c), nodes 9 and 10 are aligned tonode 29 in the premise. Such information helps thesystem decide that this pair is a contradiction. Ac-cordingly, in subfigure (e) of sequential BiLSTM,the words sitting and down do not play an impor-tant role for making the final decision. Subfigure (f)shows that sitting is equally aligned with readingand standing and the alignment for word down isnot that useful.

6 Conclusions and Future Work

We propose neural network models for natural lan-guage inference, which achieve the best resultsreported on the SNLI benchmark. The results arefirst achieved through our enhanced sequential in-ference model, which outperformed the previousmodels, including those employing more compli-cated network architectures, suggesting that thepotential of sequential inference models have notbeen fully exploited yet. Based on this, we furthershow that by explicitly considering recursive ar-chitectures in both local inference modeling andinference composition, we achieve additional im-provement. Particularly, incorporating syntacticparsing information contributes to our best result: itfurther improves the performance even when addedto the already very strong model.

Future work interesting to us includes exploringthe usefulness of external resources such as Word-Net and contrasting-meaning embedding (Chenet al., 2015) to help increase the coverage of word-level inference relations. Modeling negation moreclosely within neural network frameworks (Socheret al., 2013; Zhu et al., 2014) may help contradic-tion detection.

Acknowledgments

The first and the third author of this paper weresupported in part by the Science and TechnologyDevelopment of Anhui Province, China (GrantsNo. 2014z02006), the Fundamental ResearchFunds for the Central Universities (Grant No.WK2350000001) and the Strategic Priority Re-search Program of the Chinese Academy of Sci-ences (Grant No. XDB02070006).

1665

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. CoRR abs/1409.0473.http://arxiv.org/abs/1409.0473.

Samuel Bowman, Gabor Angeli, Christopher Potts, andD. Christopher Manning. 2015. A large annotatedcorpus for learning natural language inference. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing. Associa-tion for Computational Linguistics, pages 632–642.https://doi.org/10.18653/v1/D15-1075.

Samuel Bowman, Jon Gauthier, Abhinav Rastogi,Raghav Gupta, D. Christopher Manning, andChristopher Potts. 2016. A fast unified model forparsing and sentence understanding. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers).Association for Computational Linguistics, pages1466–1477. https://doi.org/10.18653/v1/P16-1139.

William Chan, Navdeep Jaitly, Quoc V. Le, andOriol Vinyals. 2016. Listen, attend and spell:A neural network for large vocabulary conversa-tional speech recognition. In 2016 IEEE Interna-tional Conference on Acoustics, Speech and Sig-nal Processing, ICASSP 2016, Shanghai, China,March 20-25, 2016. IEEE, pages 4960–4964.https://doi.org/10.1109/ICASSP.2016.7472621.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,and Hui Jiang. 2016. Distraction-based neural net-works for modeling document. In Subbarao Kamb-hampati, editor, Proceedings of the Twenty-FifthInternational Joint Conference on Artificial Intel-ligence, IJCAI 2016, New York, NY, USA, 9-15July 2016. IJCAI/AAAI Press, pages 2754–2760.http://www.ijcai.org/Abstract/16/391.

Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen,Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re-visiting word embedding for contrasting meaning.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers). Associ-ation for Computational Linguistics, pages 106–115.https://doi.org/10.3115/v1/P15-1011.

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.Long short-term memory-networks for machinereading. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing.Association for Computational Linguistics, pages551–561. http://aclweb.org/anthology/D16-1053.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the proper-ties of neural machine translation: Encoder-decoderapproaches. In Dekai Wu, Marine Carpuat, XavierCarreras, and Eva Maria Vecchi, editors, Proceed-ings of SSST@EMNLP 2014, Eighth Workshop on

Syntax, Semantics and Structure in Statistical Trans-lation, Doha, Qatar, 25 October 2014. Associ-ation for Computational Linguistics, pages 103–111. http://aclweb.org/anthology/W/W14/W14-4012.pdf.

Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,Kyunghyun Cho, and Yoshua Bengio. 2015.Attention-based models for speech recognition. InCorinna Cortes, Neil D. Lawrence, Daniel D.Lee, Masashi Sugiyama, and Roman Garnett, ed-itors, Advances in Neural Information Process-ing Systems 28: Annual Conference on Neu-ral Information Processing Systems 2015, De-cember 7-12, 2015, Montreal, Quebec, Canada.pages 577–585. http://papers.nips.cc/paper/5847-attention-based-models-for-speech-recognition.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The PASCAL recognising textual entailmentchallenge. In Machine Learning Challenges, Eval-uating Predictive Uncertainty, Visual Object Classi-fication and Recognizing Textual Entailment, FirstPASCAL Machine Learning Challenges Workshop,MLCW 2005, Southampton, UK, April 11-13, 2005,Revised Selected Papers. pages 177–190.

Lorenzo Ferrone and Massimo Fabio Zanzotto. 2014.Towards syntax-aware compositional distributionalsemantic models. In Proceedings of COLING 2014,the 25th International Conference on ComputationalLinguistics: Technical Papers. Dublin City Univer-sity and Association for Computational Linguistics,pages 721–730. http://aclweb.org/anthology/C14-1068.

Sepp Hochreiter and Jürgen Schmidhu-ber. 1997. Long short-term memory.Neural Computation 9(8):1735–1780.https://doi.org/10.1162/neco.1997.9.8.1735.

Adrian Iftene and Alexandra Balahur-Dobrescu. 2007.Proceedings of the ACL-PASCAL Workshop onTextual Entailment and Paraphrasing, Associationfor Computational Linguistics, chapter Hypothe-sis Transformation and Semantic Variability RulesUsed in Recognizing Textual Entailment, pages 125–130. http://aclweb.org/anthology/W07-1421.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRRabs/1412.6980. http://arxiv.org/abs/1412.6980.

Dan Klein and Christopher D. Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings of the41st Annual Meeting of the Association for Computa-tional Linguistics. http://aclweb.org/anthology/P03-1054.

Phong Le and Willem Zuidema. 2015. Compositionaldistributional semantics with long short term mem-ory. In Proceedings of the Fourth Joint Conferenceon Lexical and Computational Semantics. Associ-ation for Computational Linguistics, pages 10–19.https://doi.org/10.18653/v1/S15-1002.

1666

Yang Liu, Chengjie Sun, Lei Lin, and Xiao-long Wang. 2016. Learning natural languageinference using bidirectional LSTM modeland inner-attention. CoRR abs/1605.09090.http://arxiv.org/abs/1605.09090.

Bill MacCartney. 2009. Natural Language Inference.Ph.D. thesis, Stanford University.

Bill MacCartney and Christopher D. Manning.2008. Modeling semantic containment andexclusion in natural language inference. InProceedings of the 22Nd International Confer-ence on Computational Linguistics - Volume 1.Association for Computational Linguistics, Strouds-burg, PA, USA, COLING ’08, pages 521–528.http://dl.acm.org/citation.cfm?id=1599081.1599147.

Yashar Mehdad, Alessandro Moschitti, and Mas-simo Fabio Zanzotto. 2010. Syntactic/semanticstructures for textual entailment recognition. InHuman Language Technologies: The 2010 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics. Associ-ation for Computational Linguistics, pages 1020–1028. http://aclweb.org/anthology/N10-1146.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang,Rui Yan, and Zhi Jin. 2016. Natural languageinference by tree-based convolution and heuris-tic matching. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers). Associa-tion for Computational Linguistics, pages 130–136.https://doi.org/10.18653/v1/P16-2022.

Tsendsuren Munkhdalai and Hong Yu. 2016a. Neu-ral semantic encoders. CoRR abs/1607.04315.http://arxiv.org/abs/1607.04315.

Tsendsuren Munkhdalai and Hong Yu. 2016b. Neu-ral tree indexers for text understanding. CoRRabs/1607.04492. http://arxiv.org/abs/1607.04492.

Biswajit Paria, K. M. Annervaz, Ambedkar Dukkipati,Ankush Chatterjee, and Sanjay Podder. 2016. A neu-ral architecture mimicking humans end-to-end fornatural language inference. CoRR abs/1611.04741.http://arxiv.org/abs/1611.04741.

Ankur Parikh, Oscar Täckström, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, pages 2249–2255.http://aclweb.org/anthology/D16-1244.

Barbara Partee. 1995. Lexical semantics and composi-tionality. Invitation to Cognitive Science 1:311–360.

Jeffrey Pennington, Richard Socher, and Christo-pher Manning. 2014. GloVe: Global vectorsfor word representation. In Proceedings of the2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP). Association

for Computational Linguistics, pages 1532–1543.https://doi.org/10.3115/v1/D14-1162.

Tim Rocktäschel, Edward Grefenstette, Karl MoritzHermann, Tomás Kociský, and Phil Blun-som. 2015. Reasoning about entailmentwith neural attention. CoRR abs/1509.06664.http://arxiv.org/abs/1509.06664.

Alexander Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. In Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing. Associa-tion for Computational Linguistics, pages 379–389.https://doi.org/10.18653/v1/D15-1044.

Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li.2016. Reading and thinking: Re-read LSTM unitfor textual entailment recognition. In Proceedingsof COLING 2016, the 26th International Confer-ence on Computational Linguistics: Technical Pa-pers. The COLING 2016 Organizing Committee,pages 2870–2879. http://aclweb.org/anthology/C16-1270.

Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng,and Christopher D. Manning. 2011. Parsing natu-ral scenes and natural language with recursive neu-ral networks. In Lise Getoor and Tobias Scheffer,editors, Proceedings of the 28th International Con-ference on Machine Learning, ICML 2011, Bellevue,Washington, USA, June 28 - July 2, 2011. Omnipress,pages 129–136.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, D. Christopher Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing.Association for Computational Linguistics, pages1631–1642. http://aclweb.org/anthology/D13-1170.

Sheng Kai Tai, Richard Socher, and D. ChristopherManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers).Association for Computational Linguistics, pages1556–1566. https://doi.org/10.3115/v1/P15-1150.

Ivan Vendrov, Ryan Kiros, Sanja Fidler, andRaquel Urtasun. 2015. Order-embeddings ofimages and language. CoRR abs/1511.06361.http://arxiv.org/abs/1511.06361.

Shuohang Wang and Jing Jiang. 2016. Learning nat-ural language inference with LSTM. In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies. Asso-ciation for Computational Linguistics, pages 1442–1451. https://doi.org/10.18653/v1/N16-1170.

1667

Kelvin Xu, Jimmy Ba, Ryan Kiros, KyunghyunCho, Aaron C. Courville, Ruslan Salakhutdi-nov, Richard S. Zemel, and Yoshua Bengio.2015. Show, attend and tell: Neural imagecaption generation with visual attention. InProceedings of the 32nd International Confer-ence on Machine Learning, ICML 2015, Lille,France, 6-11 July 2015. pages 2048–2057.http://jmlr.org/proceedings/papers/v37/xuc15.html.

Junbei Zhang, Xiaodan Zhu, Qian Chen, LirongDai, Si Wei, and Hui Jiang. 2017. Ex-ploring question understanding and adapta-tion in neural-network-based question an-swering. CoRR abs/arXiv:1703.04617v2.https://arxiv.org/abs/1703.04617.

Xiaodan Zhu, Hongyu Guo, Saif Mohammad, and Svet-lana Kiritchenko. 2014. An empirical study on theeffect of negation words on sentiment. In Proceed-ings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics,pages 304–313. https://doi.org/10.3115/v1/P14-1029.

Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo.2015. Long short-term memory over recursivestructures. In Proceedings of the 32nd InternationalConference on Machine Learning, ICML 2015,Lille, France, 6-11 July 2015. pages 1604–1612.http://jmlr.org/proceedings/papers/v37/zhub15.html.

1668

enhanced lstm for natural language inference · enhanced lstm for natural language inference qian...

Documents