Table Of Content

Long Short-Term Memory-Networks for Machine Reading JianpengCheng,LiDong and MirellaLapata SchoolofInformatics,UniversityofEdinburgh 10CrichtonStreet,EdinburghEH89AB {jianpeng.cheng,li.dong}@ed.ac.uk,[email protected] 6 1 0 2 Abstract tor, drawing inspiration from human language pro- p e cessing and the fact language comprehension is in- S In this paper we address the question of how cremental with readers continuously extracting the 0 to render sequence-level networks better at meaningofutterancesonaword-by-wordbasis. 2 handling structured input. We propose a ma- In order to understand texts, our machine reader chine reading simulator which processes text ] should provide facilities for extracting and repre- L incrementally from left to right and performs sentingmeaningfromnaturallanguagetext,storing C shallow reasoning with memory and atten- meaningsinternally,andworkingwithstoredmean- tion. ThereaderextendstheLongShort-Term . s Memoryarchitecturewithamemorynetwork ings to derive further consequences. Ideally, such c [ in place of a single memory cell. This en- a system should be robust, open-domain, and de- ables adaptive memory usage during recur- grade gracefully in the presence of semantic rep- 7 rencewithneuralattention, offeringawayto v resentations which may be incomplete, inaccurate, 3 weakly induce relations among tokens. The or incomprehensible. It would also be desirable to 3 systemisinitiallydesignedtoprocessasingle simulate the behavior of English speakers who pro- 7 sequencebutwealsodemonstratehowtointe- 6 grateitwithanencoder-decoderarchitecture. cess text sequentially, from left to right, fixating 0 Experimentsonlanguagemodeling,sentiment nearly every word while they read (Rayner, 1998) . 1 analysis,andnaturallanguageinferenceshow andcreatingpartialrepresentationsforsentencepre- 0 that our model matches or outperforms the fixes(Konieczny,2000;Tanenhausetal.,1995). 6 1 stateoftheart. Languagemodelingtoolssuchasrecurrentneural : networks(RNN)bodewellwithhumanreadingbe- v i havior(FrankandBod,2011). RNNstreateachsen- X 1 Introduction tence as a sequence of words and recursively com- r a How can a sequence-level network induce relations pose each word with its previous memory, until the which are presumed latent during text processing? meaningofthewholesentencehasbeenderived. In How can a recurrent network attentively memorize practice, however, sequence-level networks are met longer sequences in a way that humans do? In this withatleastthreechallenges. Thefirstoneconcerns paperwedesignamachinereaderthatautomatically model training problems associated with vanishing learns to understand text. The term machine read- and exploding gradients (Hochreiter, 1991; Bengio ing is related to a wide range of tasks from answer- etal.,1994),whichcanbepartiallyamelioratedwith ing reading comprehension questions (Clark et al., gated activation functions, such as the Long Short- 2013), to fact and relation extraction (Etzioni et al., Term Memory (LSTM) (Hochreiter and Schmidhu- 2011; Fader et al., 2011), ontology learning (Poon ber, 1997), and gradient clipping (Pascanu et al., and Domingos, 2010), and textual entailment (Da- 2013). The second issue relates to memory com- ganetal.,2005). Ratherthanfocusingonaspecific pressionproblems. Astheinputsequencegetscom- task, we develop a general-purpose reading simula- pressed and blended into a single dense vector, suf- TheFBIischasingacriminalontherun. son over shallow structures. The resulting model, TThhee FBIischasingacriminalontherun. which we termLong Short-Term Memory-Network TThhee FFBBII ischasingacriminalontherun. (LSTMN), is a reading simulator that can be used TThhee FFBBII iiss chasingacriminalontherun. TThhee FFBBII iiss cchhaassiinngg acriminalontherun. forsequenceprocessingtasks. TThhee FFBBII iiss cchhaassiinngg aa criminalontherun. Figure 1 illustrates the reading behavior of the TThhee FFBBII iiss cchhaassiinngg aa ccrriimmiinnaall ontherun. LSTMN. The model processes text incrementally TThhee FFBBII iiss cchhaassiinngg aa ccrriimmiinnaall oonn therun. whilelearningwhichpasttokensinthememoryand TThhee FFBBII iiss cchhaassiinngg aa ccrriimmiinnaall oonn tthhee run. towhatextenttheyrelatetothecurrenttokenbeing TThhee FFBBII iiss cchhaassiinngg aa ccrriimmiinnaall oonn tthhee rruunn . processed. Asaresult,themodelinducesundirected Figure1: Illustrationofourmodelwhilereadingthe relations among tokens as an intermediate step of sentence The FBI is chasing a criminal on the run. learning representations. We validate the perfor- Colorred representsthecurrentwordbeingfixated, mance of the LSTMN in language modeling, sen- bluerepresentsmemories. Shadingindicatesthede- timent analysis, and natural language inference. In greeofmemoryactivation. all cases, we train LSTMN models end-to-end with task-specific supervision signals, achieving perfor- mancecomparableorbettertostate-of-the-artmod- ficiently large memory capacity is required to store elsandsuperiortovanillaLSTMs. past information. As a result, the network general- izespoorlytolongsequenceswhilewastingmemory 2 RelatedWork onshorterones. Finally, itshouldbeacknowledged that sequence-level networks lack a mechanism for Ourmachinereaderisarecurrentneuralnetworkex- handling the structure of the input. This imposes hibiting two important properties: it is incremental, an inductive bias which is at odds with the fact that simulating human behavior, and performs shallow language has inherent structure. In this paper, we structurereasoningoverinputstreams. develop a text processing system which addresses Recurrentneuralnetwork(RNNs)havebeensuc- theselimitationswhilemaintainingtheincremental, cessfully applied to various sequence modeling and generativepropertyofarecurrentlanguagemodel. sequence-to-sequence transduction tasks. The latter Recent attempts to render neural networks more have assumed several guises in the literature such structureawarehaveseentheincorporationofexter- asmachinetranslation(Bahdanauetal.,2014),sen- nal memories in the contextof recurrent neural net- tence compression (Rush et al., 2015), and reading works(Westonetal.,2015;Sukhbaataretal.,2015; comprehension (Hermann et al., 2015). A key con- Grefenstetteetal.,2015). Theideaistousemultiple tributing factor to their success has been the abil- memory slots outside the recurrence to piece-wise ity to handle well-known problems with exploding store representations of the input; read and write orvanishinggradients(Bengioetal.,1994),leading operations for each slot can be modeled as an at- to models with gated activation functions (Hochre- tention mechanism with a recurrent controller. We iter and Schmidhuber, 1997; Cho et al., 2014), and also leverage memory and attention to empower a more advanced architectures that enhance the in- recurrentnetworkwithstrongermemorizationcapa- formation flow within the network (Koutn´ık et al., bility and more importantly the ability to discover 2014;Chungetal.,2015;Yaoetal.,2015). relations among tokens. This is realized by insert- A remaining practical bottleneck for RNNs is ingamemorynetworkmoduleintheupdateofare- memorycompression(Bahdanauetal.,2014): since currentnetworktogetherwithattentionformemory the inputs are recursively combined into a single addressing. The attention acts as a weak inductive memory representation which is typically too small module discovering relations between input tokens, intermsofparameters, itbecomesdifficulttoaccu- andistrainedwithoutdirectsupervision. Asapoint ratelymemorizesequences(ZarembaandSutskever, of departure from previous work, the memory net- 2014). In the encoder-decoder architecture, this work we employ is internal to the recurrence, thus problemcanbesidesteppedwithanattentionmech- strengthening the interaction of the two and lead- anismwhichlearnssoftalignmentsbetweenthede- ing to a representation learner which is able to rea- codingstatesandtheencodedmemories(Bahdanau et al., 2014). In our model, memory and attention ory(LSTM)unitwithanextendedmemorytapethat are added within a sequence encoder allowing the explicitly simulates the human memory span. The networktouncoverlexicalrelationsbetweentokens. model performs implicit relation analysis between The idea of introducing a structural bias to neu- tokens with an attention-based memory addressing ral models is by no means new. For example, it is mechanism at every time step. In the following, we reflected in the work of Socher et al. (2013a) who first review the standard Long Short-Term Memory applyrecursiveneuralnetworksforlearningnatural andthendescribeourmodel. language representations. In the context of recur- rentneuralnetworks,effortstobuildmodular,struc- 3.1 LongShort-TermMemory tured neural models date back to Das et al. (1992) ALongShort-TermMemory(LSTM)recurrentneu- whoconnectarecurrentneuralnetworkwithanex- ral network processes a variable-length sequence ternalmemorystackforlearningcontextfreegram- x = (x ,x ,···,x ) by incrementally adding new 1 2 n mars. Recently,Westonetal.(2015)proposeMem- content into a single memory slot, with gates con- ory Networks to explicitly segregate memory stor- trolling the extent to which new content should be agefromthecomputationofneuralnetworksingen- memorized, old content should be erased, and cur- eral. Theirmodelistrainedend-to-endwithamem- rent content should be exposed. At time step t, the oryaddressingmechanismcloselyrelatedtosoftat- memory c and the hidden state h are updated with t t tention (Sukhbaatar et al., 2015) and has been ap- thefollowingequations: plied to machine translation (Meng et al., 2015). Grefenstette et al. (2015) define a set of differen-     i σ t tiabledatastructures(stacks,queues,anddequeues) as memories controlled by a recurrent neural net- oftt= σσ W·[ht−1,xt] (1) work. Tranetal.(2016)combinetheLSTMwithan cˆ tanh t external memory block component which interacts with its hidden state. Kumar et al. (2016) employ c = f (cid:12)c +i (cid:12)cˆ (2) t t t−1 t t a structured neural network with episodic memory modules for natural language and also visual ques- h =o (cid:12)tanh(c ) (3) t t t tionanswering(Xiongetal.,2016). Similar to the above work, we leverage memory where i, f, and o are gate activations. Compared andattentioninarecurrentneuralnetworkforinduc- tothestandardRNN,theLSTMusesadditivemem- ingrelationsbetweentokensasamoduleinalarger ory updates and it separates the memory c from the network responsible for representation learning. As hiddenstateh,whichinteractswiththeenvironment a property of soft attention, all intermediate rela- whenmakingpredictions. tions we aim to capture are soft and differentiable. This is in contrast to shift-reduce type neural mod- 3.2 LongShort-TermMemory-Network els (Dyer et al., 2015; Bowman et al., 2016) where The first question that arises with LSTMs is the ex- the intermediate decisions are hard and induction is tent to which they are able to memorize sequences moredifficult. Finally,notethatourmodelcaptures under recursive compression. LSTMs can produce undirectedlexicalrelationsandisthusdistinctfrom a list of state representations during composition, workondependencygrammarinduction(Kleinand however,thenextstateisalwayscomputedfromthe Manning,2004)wherethelearnedhead-modifierre- current state. That is to say, given the current state lationsaredirected. h ,thenextstateh isconditionallyindependentof t t+1 statesh ···h andtokensx ···x . Whiletherecur- 3 TheMachineReader 1 t−1 1 t sivestateupdateisperformedinaMarkovmanner,it Inthissectionwepresentourmachinereaderwhich is assumed that LSTMs maintain unbounded mem- isdesignedtoprocessstructuredinputwhileretain- ory(i.e.,thecurrentstatealonesummarizeswellthe ingtheincrementalityofarecurrentneuralnetwork. tokensithasseensofar). Thisassumptionmayfail The core of our model is a Long Short-Term Mem- in practice, for example when the sequence is long environment(e.g.,computingattention),andamem- ory tape used to represent what is actually stored in memory.1 Therefore, each token is associated with a hidden vector and a memory vector. Let x de- t notethecurrentinput;C =(c ,···,c )denotes t−1 1 t−1 the current memory tape, and H =(h ,···,h ) t−1 1 t−1 the previous hidden tape. At time stept, the model computes the relation between x and x ···x t 1 t−1 throughh ···h withanattentionlayer: 1 t−1 at =vTtanh(W h +W x +W h˜ ) (4) i h i x t h˜ t−1 st =softmax(at) (5) i i Thisyieldsaprobabilitydistributionoverthehidden Figure 2: Long Short-Term Memory-Network. state vectors of previous tokens. We can then com- Colorindicatesdegreeofmemoryactivation. pute an adaptive summary vector for the previous hidden tape and memory tape denoted by c˜ and h˜ , t t or when the memory size is not large enough. An- respectively: otherundesiredpropertyofLSTMsconcernsmodel- ingstructuredinput. AnLSTMaggregatesinforma- (cid:20)h˜ (cid:21) t−1 (cid:20)h(cid:21) t = ∑st· i (6) tion on a token-by-token basis in sequential order, c˜ i c t i=1 i butthereisnoexplicitmechanismforreasoningover structureandmodelingrelationsbetweentokens. and use them for computing the values of ct and ht Our model aims to address both limitations. Our intherecurrentupdateas: solution is to modify the standard LSTM structure     i σ by replacing the memory cell with a memory net- t wShoorkrt-T(WeremstoMnemetoaryl.-,N2e0tw15o)r.k (TLhSeTMresNu)ltsintogreLsotnhge oftt= σσ W·[h˜t,xt] (7) cˆ tanh contextual representation of each input token with t a unique memory slot and the size of the memory c = f (cid:12)c˜ +i (cid:12)cˆ (8) t t t t t growswithtimeuntilanupperboundofthememory h =o (cid:12)tanh(c ) (9) span is reached. This design enables the LSTM to t t t reasonaboutrelationsbetweentokenswithaneural wherev,W ,W andW arethenewweighttermsof h x h˜ attention layer and then perform non-Markov state thenetwork. updates. Although it is feasible to apply both write AkeyideabehindtheLSTMNistouseattention and read operations to the memories with attention, for inducing relations between tokens. These rela- we concentrate on the latter. We conceptualize the tions are soft and differentiable, and components of read operation as attentively linking the current to- a larger representation learning network. Although ken to previous memories and selecting useful con- it is appealing to provide direct supervision for the tent when processing it. Although not the focus of attention layer, e.g., with evidence collected from this work, the significance of the write operation a dependency treebank, we treat it as a submod- can be analogously justified as a way of incremen- ule being optimized within the larger network in a tally updating previous memories, e.g., to correct downstream task. It is also possible to have a more wrong interpretations when processing garden path structured relational reasoning module by stacking sentences(FerreiraandHenderson,1991). multiple memory and hidden layers in an alternat- The architecture of the LSTMN is shown in Fig- ing fashion, resembling a stacked LSTM (Graves, ure 2 and the formal definition is provided as fol- 1For comparison, LSTMs maintain a hidden vector and a lows. The model maintains two sets of vectors memoryvector;memorynetworks(Westonetal.,2015)havea storedinahiddenstatetapeusedtointeractwiththe setofkeyvectorsandasetofvaluevectors. 2013) or a multi-hop memory network (Sukhbaatar inter-attention between the input at time step t and et al., 2015). This can be achieved by feeding the tokensintheentiresourcesequenceasfollows: output hk of the lower layer k as input to the upper t layer (k+1). The attention at the (k+1)th layer is bt =uTtanh(W γ +W x +W γ˜ ) (11) j γ j x t γ˜ t−1 computedas: ati,k+1=vTtanh(Whhki+1+Wlhtk+Wh˜h˜tk−+11) (10) ptj =softmax(btj) (12) Afterthatwecomputetheadaptiverepresentationof Skip-connections (Graves, 2013) can be applied to thesourcememorytapeα˜ andhiddentapeγ˜ as: feedx toupperlayersaswell. t t t 4 ModelingTwoSequenceswithLSTMN (cid:20)γ˜ (cid:21) m (cid:20)γ (cid:21) t = ∑ pt · j (13) α˜ j α Natural language processing tasks such as machine t j=1 j translation and textual entailment are concerned We can then transfer the adaptive source represen- with modeling two sequences rather than a single tation α˜ to the target memory with another gating one. A standard tool for modeling two sequences t operationr ,analogoustothegatesinEquation(7). with recurrent networks is the encoder-decoder ar- t chitecture where the second sequence (also known as the target) is being processed conditioned on the rt =σ(Wr·[γ˜t,xt]) (14) first one (also known as the source). In this section we explain how to combine the LSTMN which ap- The new target memory includes inter-alignment plies attention for intra-relation reasoning, with the rt(cid:12)α˜t, intra-relation ft(cid:12)c˜t, and the new input in- encoder-decoder network whose attention module formationit(cid:12)cˆt: learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combina- c =r (cid:12)α˜ + f (cid:12)c˜ +i (cid:12)cˆ (15) t t t t t t t tion. Wedescribethemodelsmoreformallybelow. ShallowAttentionFusion Shallowfusionsimply ht =ot(cid:12)tanh(ct) (16) treats the LSTMN as a separate module that can AsshownintheequationsaboveandFigure3b,the be readily used in an encoder-decoder architecture, major change of deep fusion lies in the recurrent in lieu of a standard RNN or LSTM. As shown in storage of the inter-alignment vector in the target Figure 3a, both encoder and decoder are modeled memory network, as a way to help the target net- as LSTMNs with intra-attention. Meanwhile, inter- workreviewsourceinformation. attention is triggered when the decoder reads a tar- gettoken,similartotheinter-attentionintroducedin 5 Experiments Bahdanauetal.(2014). Deep Attention Fusion Deep fusion combines In this section we present our experiments for eval- inter- and intra-attention (initiated by the decoder) uating the performance of the LSTMN machine whencomputingstateupdates. Weusedifferentno- reader. We start with language modeling as it tationtorepresentthetwosetsofattention. Follow- is a natural testbed for our model. We then as- ing Section 3.2,C and H denote the target memory sess the model’s ability to extract meaning repre- tapeandhiddentape,whichstorerepresentationsof sentations for generic sentence classification tasks the target symbols that have been processed so far. such as sentiment analysis. Finally, we examine The computation of intra-attention follows Equa- whether the LSTMN can recognize the semantic tions(4)–(9). Additionally,weuseA=[α ,···,α ] relationship between two sentences by applying it 1 m and Y = [γ ,···,γ ] to represent the source mem- to a natural language inference task. Our code 1 m orytapeandhiddentape,withmbeingthelengthof is available at https://github.com/cheng6076/ thesourcesequenceconditionedupon. Wecompute SNLI-attention. (a)Decoderwithshallowattentionfusion. (b)Decoderwithdeepattentionfusion. Figure3: LSTMNsforsequence-to-sequencemodeling. Theencoderusesintra-attention,whilethedecoder incorporates both intra- and inter-attention. The two figures present two ways to combine the intra- and inter-attentioninthedecoder. Models Layers Perplexity thewordembeddingsweresetto150forallmodels. KN5 — 141 In this suite of experiments we compared the RNN 1 129 LSTMN against a variety of baselines. The first LSTM 1 115 oneisaKneser-Ney5-gramlanguagemodel(KN5) LSTMN 1 108 sLSTM 3 115 which generally serves as a non-neural baseline for gLSTM 3 107 the language modeling task. We also present per- dLSTM 3 109 plexity results for the standard RNN and LSTM LSTMN 3 102 models. We also implemented more sophisti- Table 1: Language model perplexity on the Penn catedLSTMarchitectures,suchasastackedLSTM Treebank. Thesizeofmemoryis300forallmodels. (sLSTM),agated-feedbackLSTM(gLSTM;Chung et al. (2015)) and a depth-gated LSTM (dLSTM; 5.1 LanguageModeling Yao et al. (2015)). The gated-feedback LSTM has feedback gates connecting the hidden states across Our language modeling experiments were con- multiple time steps as an adaptive control of the in- ducted on the English Penn Treebank dataset. Fol- formationflow. Thedepth-gatedLSTMusesadepth lowingcommonpractice(Mikolovetal., 2010), we gate to connect memory cells of vertically adjacent trained on sections 0–20 (1M words), used sec- layers. In general, both gLSTM and dLSTM are tions 21–22 for validation (80K words), and sec- abletocapturelong-termdependenciestosomede- tions 23–24 (90K words for testing). The dataset gree,buttheydonotexplicitlykeeppastmemories. contains approximately 1 million tokens and a vo- Wesetthenumberoflayersto3inthisexperiment, cabulary size of 10K. The average sentence length mainly to agree with the language modeling exper- is 21. We use perplexity as our evaluation metric: iments of Chung et al. (2015). Also note that that PPL=exp(NLL/T), where NLL denotes the nega- there are no single-layer variants for gLSTM and tive log likelihood of the entire test set and T the dLSTM;theyhavetobeimplementedasmulti-layer corresponding number of tokens. We used stochas- systems. ThehiddenunitsizeoftheLSTMNandall tic gradient descent for optimization with an ini- comparisonmodels(exceptKN5)wassetto300. tial learning rate of 0.65, which decays by a factor of0.85perepochifnosignificantimprovementhas The results of the language modeling task are been observed on the validation set. We renormal- shown in Table 1. Perplexity results for KN5 and ize the gradient if its norm is greater than 5. The RNN are taken from Mikolov et al. (2015). As can mini-batch size was set to 40. The dimensions of beseen,thesingle-layerLSTMNoutperformsthese Models Fine-grained Binary RAE(Socheretal.,2011) 43.2 82.4 RNTN(Socheretal.,2013b) 45.7 85.4 DRNN(IrsoyandCardie,2014) 49.8 86.6 DCNN(Blunsometal.,2014) 48.5 86.8 CNN-MC(Kim,2014) 48.0 88.1 T-CNN(Leietal.,2015) 51.2 88.6 PV(LeandMikolov,2014) 48.7 87.8 CT-LSTM(Taietal.,2015) 51.0 88.0 LSTM(Taietal.,2015) 46.4 84.9 2-layerLSTM(Taietal.,2015) 46.0 86.3 LSTMN 47.6 86.3 2-layerLSTMN 47.9 87.0 Table2: Modelaccuracy(%)ontheSentimentTree- Figure 4: Examples of intra-attention (language bank(testset). ThememorysizeofLSTMNmodels modeling). Bold lines indicate higher attention is set to 168 to be compatible with previously pub- scores. Arrowsdenotewhichwordisbeingfocused lishedLSTMvariants(Taietal.,2015). whenattentioniscomputed,butnotthedirectionof therelation. tences for training, 872 for validation and 1,821 for testing. Table 2 reports results on both fine-grained two baselines and the LSTM by a significant mar- andbinaryclassificationtasks. gin. Amongst all deep architectures, the three-layer We experimented with 1- and 2-layer LSTMNs. LSTMNalsoperformsbest. Wecanstudythemem- For the latter model, we predict the sentiment la- ory activation mechanism of the machine reader by bel of the sentence based on the averaged hidden visualizing the attention scores. Figure 4 shows vector passed to a 2-layer neural network classifier foursentencessampledfromthePennTreebankval- with ReLU as the activation function. The mem- idation set. Although we explicitly encourage the ory size for both LSTMN models was set to 168 to readertoattendtoanymemoryslot,muchattention be compatible with previous LSTM models (Tai et focuses on recent memories. This agrees with the al., 2015) applied to the same task. We used pre- linguistic intuition that long-term dependencies are trained 300-D Glove 840B vectors (Pennington et relatively rare. As illustrated in Figure 4 the model al., 2014) to initialize the word embeddings. The captures some valid lexical relations (e.g., the de- gradient for words with Glove embeddings, was pendencybetweensits andat,sits andplays,every- scaledby0.35inthefirstepochafterwhichallword one and is, is and watching). Note that arcs here embeddingswereupdatednormally. are undirected and are different from the directed We used Adam (Kingma and Ba, 2015) for op- arcsdenotinghead-modifierrelationsindependency timization with the two momentum parameters set graphs. to 0.9 and 0.999 respectively. The initial learning 5.2 SentimentAnalysis ratewassetto2E-3. Theregularizationconstantwas Our second task concerns the prediction of senti- 1E-4 and the mini-batch size was 5. A dropout rate mentlabelsofsentences. WeusedtheStanfordSen- of0.5wasappliedtotheneuralnetworkclassifier. timent Treebank (Socher et al., 2013a), which con- Wecomparedourmodelwithawiderangeoftop- tains fine-grained sentiment labels (very positive, performing systems. Most of these models (includ- positive,neutral,negative,verynegative)for11,855 ingours)areLSTMvariants(thirdblockinTable2), sentences. Followingpreviousworkonthisdataset, recursive neural networks (first block), or convolu- we used 8,544 sentences for training, 1,101 for val- tional neural networks (CNNs; second block). Re- idation,and2,210fortesting. Theaveragesentence cursive models assume the input sentences are rep- length is 19.1. In addition, we also performed a bi- resented as parse trees and can take advantage of nary classification task (positive, negative) after re- annotations at the phrase level. LSTM-type models movingtheneutrallabel. Thisresultedin6,920sen- and CNNs are trained on sequential input, with the Recent approaches use two sequential LSTMs to encode the premise and the hypothesis respectively, andapplyneuralattentiontoreasonabouttheirlogi- calrelationship(Rockta¨scheletal.,2016;Wangand Jiang,2016). Furthermore,Rockta¨scheletal.(2016) show that a non-standard encoder-decoder architecture which processes the hypothesis conditioned on Figure 5: Examples of intra-attention (sentiment thepremiseresultssignificantlyboostsperformance. analysis). Bold lines (red) indicate attention be- We use a similar approach to tackle this task with tweensentimentimportantwords. LSTMNs. Specifically,weusetwoLSTMNstoread the premise and hypothesis, and then match them by comparing their hidden state tapes. We perform exception of CT-LSTM (Tai et al., 2015) which op- average pooling for the hidden state tape of each erates over tree-structured network topologies such LSTMN, and concatenate the two averages to form asconstituenttrees. Forcomparison,wealsoreport the input to a 2-layer neural network classifier with theperformanceoftheparagraphvectormodel(PV; ReLUastheactivationfunction. Le and Mikolov (2014); see Table 2, second block) We used pre-trained 300-D Glove 840B vectors which neither operates on trees nor sequences but (Pennington et al., 2014) to initialize the word em- learns distributed document representations param- beddings. Out-of-vocabulary (OOV) words were eterizeddirectly. initialized randomly with Gaussian samples (µ=0, The results in Table 2 show that both 1- and σ=1). We only updated OOV vectors in the first 2-layer LSTMNs outperform the LSTM baselines epoch, after which all word embeddings were up- while achieving numbers comparable to state of the datednormally. Thedropoutratewasselectedfrom art. Thenumberoflayersforourmodelswassetto [0.1,0.2,0.3,0.4]. WeusedAdam(KingmaandBa, be comparable to previously published results. On 2015)foroptimizationwiththetwomomentumpa- the fine-grained and binary classification tasks our rameters set to 0.9 and 0.999 respectively, and the 2-layer LSTMN performs close to the best system initiallearningratesetto1E-3. Themini-batchsize T-CNN(Leietal., 2015). Figure5showsexamples was set to 16 or 32. For a fair comparison against ofintra-attentionforsentimentwords. Interestingly, previous work, we report results with different hid- the network learns to associate sentiment important den/memorydimensions(i.e.,100,300,and450). wordssuchasthough andfantastic ornot andgood. We compared variants of our model against different types of LSTMs (see the second block in Ta- 5.3 NaturalLanguageInference ble 3). Specifically, these include a model which The ability to reason about the semantic relation- encodes the premise and hypothesis independently ship between two sentences is an integral part of with two LSTMs (Bowman et al., 2015), a shared textunderstanding. Wethereforeevaluateourmodel LSTM (Rockta¨schel et al., 2016), a word-by-word on recognizing textual entailment, i.e., whether two attention model (Rockta¨schel et al., 2016), and a premise-hypothesis pairs are entailing, contradic- matchingLSTM(mLSTM;WangandJiang(2016)). tory, or neutral. For this task we used the Stan- This model sequentially processes the hypothesis, ford Natural Language Inference (SNLI) dataset and at each position tries to match the current word (Bowman et al., 2015), which contains premise- with an attention-weighted representation of the hypothesis pairs and target labels indicating their premise(ratherthanbasingitspredictionsonwhole relation. After removing sentences with unknown sentenceembeddings). Wealsocomparedourmod- labels, we end up with 549,367 pairs for training, elswithabag-of-wordsbaselinewhichaveragesthe 9,842 for development and 9,824 for testing. The pre-trained embeddings for the words in each sen- vocabulary size is 36,809 and the average sentence tenceandconcatenatesthemtocreatefeaturesfora length is 22. We performed lower-casing and tok- logisticregressionclassifier(firstblockinTable3). enizationfortheentiredataset. LSTMNs achieve better performance compared Models h |θ|M Test orsuperiortostateoftheart. BOWconcatenation — — 59.8 AlthoughourexperimentsfocusedonLSTMs,the LSTM(Bowmanetal.,2015) 100 221k 77.6 ideaofbuildingmorestructureawareneuralmodels LSTM-att(Rockta¨scheletal.,2016) 100 252k 83.5 is general and can be applied to other types of net- mLSTM(WangandJiang,2016) 300 1.9M 86.1 works. When direct supervision is provided, simi- LSTMN 100 260k 81.5 lararchitecturescanbeadaptedtotaskssuchasde- LSTMNshallowfusion 100 280k 84.3 pendency parsing and relation extraction. In the fu- LSTMNdeepfusion 100 330k 84.5 ture, we hope to develop more linguistically plausi- LSTMNshallowfusion 300 1.4M 85.2 ble neural architectures able to reason over nested LSTMNdeepfusion 300 1.7M 85.7 structures and neural models that learn to discover LSTMNshallowfusion 450 2.8M 86.0 compositionalitywithweakorindirectsupervision. LSTMNdeepfusion 450 3.4M 86.3 Acknowledgments Table 3: Parameter counts |θ| , size of hidden M unit h, and model accuracy (%) on the natural lan- We thank members of the ILCC at the School of guageinferencetask. Informatics and the anonymous reviewers for help- ful comments. The support of the European Re- to LSTMs (with and without attention; 2nd block searchCouncilunderawardnumber681760“Trans- in Table 3). We also observe that fusion is gen- lating Multiple Modalities into Text” is gratefully erally beneficial, and that deep fusion slightly im- acknowledged. proves over shallow fusion. One explanation is that with deep fusion the inter-attention vectors are re- currently memorized by the decoder with a gating References operation,whichalsoimprovestheinformationflow [Andreasetal.2016] Jacob Andreas, Marcus Rohrbach, ofthenetwork. Withstandardtraining,ourdeepfu- Trevor Darrell, and Dan Klein. 2016. Learning to sion yields the state-of-the-art performance in this compose neural networks for question answering. In task. Althoughencouraging,thisresultshouldbein- Proceedings of the 2016 NAACL: HLT, pages 1545– terpreted with caution since our model has substan- 1554,SanDiego,California. tiallymoreparameterscomparedtorelatedsystems. [Bahdanauetal.2014] Dzmitry Bahdanau, Kyunghyun We could compare different models using the same Cho, and Yoshua Bengio. 2014. Neural machine numberoftotalparameters. However,thiswouldin- translation by jointly learning to align and translate. evitably introduce other biases, e.g., the number of InProceedingsofthe2014ICLR,Banff,Alberta. hyper-parameterswouldbecomedifferent. [Bengioetal.1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term depen- 6 Conclusions dencieswithgradientdescentisdifficult. NeuralNet- works,IEEETransactionson,5(2):157–166. Inthispaperweproposedamachinereadingsimula- [Blunsometal.2014] PhilBlunsom,EdwardGrefenstette, tortoaddressthelimitationsofrecurrentneuralnet- andNalKalchbrenner. 2014. Aconvolutionalneural works when processing inherently structured input. network for modelling sentences. In Proceedings of Our model is based on a Long Short-Term Mem- the52ndACL,pages655–665,Baltimore,Maryland. oryarchitectureembeddedwithamemorynetwork, [Bowmanetal.2015] Samuel R Bowman, Gabor Angeli, explicitly storing contextual representations of in- ChristopherPotts,andChristopherDManning. 2015. put tokens without recursively compressing them. Alargeannotatedcorpusforlearningnaturallanguage inference. InProceedingsofthe2015EMNLP,pages More importantly, an intra-attention mechanism is 22–32,Lisbon,Portugal. employed for memory addressing, as a way to in- [Bowmanetal.2016] Samuel R Bowman, Jon Gauthier, duce undirected relations among tokens. The at- AbhinavRastogi,RaghavGupta,ChristopherDMan- tention layer is not optimized with a direct super- ning, and Christopher Potts. 2016. A fast unified vision signal but with the entire network in down- modelforparsingandsentenceunderstanding. InPro- streamtasks. Experimentalresultsacrossthreetasks ceedings of the 54th ACL, pages 1466–1477, Berlin, showthatourmodelyieldsperformancecomparable Germany. [Choetal.2014] KyunghyunCho,BartVanMerrie¨nboer, Phil Blunsom. 2015. Learning to transduce with un- CaglarGulcehre,DzmitryBahdanau,FethiBougares, boundedmemory. InAdvancesinNeuralInformation HolgerSchwenk,andYoshuaBengio. 2014. Learning ProcessingSystems,pages1819–1827. phrase representations using RNN encoder-decoder [Hermannetal.2015] Karl Moritz Hermann, Tomas Ko- for statistical machine translation. In Proceedings of cisky,EdwardGrefenstette,LasseEspeholt,WillKay, the2014EMNLP,pages1724–1734,Doha,Qatar. Mustafa Suleyman, and Phil Blunsom. 2015. Teach- [Chungetal.2015] Junyoung Chung, Caglar Gulcehre, ingmachinestoreadandcomprehend. InAdvancesin Kyunghyun Cho, and Yoshua Bengio. 2015. Gated Neural Information Processing Systems, pages 1684– feedbackrecurrentneuralnetworks. InProceedingsof 1692. the32ndICML,pages2067–2075,Lille,France. [HochreiterandSchmidhuber1997] Sepp Hochreiter and [Clarketal.2013] Peter Clark, Phil Harrison, and Niran- Ju¨rgenSchmidhuber. 1997. Longshort-termmemory. jan Balasubramanian. 2013. A study of the knowl- Neuralcomputation,9(8):1735–1780. edgebaserequirementsforpassinganelementarysci- [Hochreiter1991] Sepp Hochreiter. 1991. Untersuchun- encetest. InProceedingsofthe3rdWorkshoponAu- gen zu dynamischen neuronalen netzen. Diploma, tomatedKBConstruction,SanFrancisco,California. TechnischeUniversita¨tMu¨nchen. [Daganetal.2005] Ido Dagan, Oren Glickman, and [IrsoyandCardie2014] Ozan Irsoy and Claire Cardie. Bernardo Magnini. 2005. The PASCAL recognising 2014. Deep recursive neural networks for composi- textual entailment challenge. In Proceedings of the tionalityinlanguage. InAdvancesinNeuralInforma- PASCAL Challenges Workshop on Recognising Tex- tionProcessingSystems,pages2096–2104. tualEntailment. [Dasetal.1992] Sreerupa Das, C. Lee Giles, and Guo [Kim2014] Yoon Kim. 2014. Convolutional neural net- zheng Sun. 1992. Learning context-free grammars: works for sentence classification. In Proceedings of Capabilities and limitations of a recurrent neural net- the2014EMNLP,pages1746–1751,Doha,Qatar. work with an externalstack memory. In Proceedings [KingmaandBa2015] Diederik Kingma and Jimmy Ba. of the 14th Annual Conference of the Cognitive Sci- 2015. Adam: A method for stochastic optimization. enceSociety,pages791–795.MorganKaufmannPub- InProceedingsofthe2015ICLR,SanDiego,Califor- lishers. nia. [Dyeretal.2015] Chris Dyer, Miguel Ballesteros, Wang [KleinandManning2004] Dan Klein and Christopher Ling, Austin Matthews, and Noah A Smith. 2015. Manning. 2004. Corpus-based induction of syntac- Transition-based dependency parsing with stack long ticstructure: Modelsofdependencyandconstituency. short-termmemory. InProceedingsofthe53rdACL, In Proceedings of the 42nd ACL, pages 478–485, pages334–343,Beijing,China. Barcelona,Spain. [Etzionietal.2011] Oren Etzioni, Anthony Fader, Janara [Konieczny2000] Lars Konieczny. 2000. Locality and Christensen, Stephen Soderland, and Mausam. 2011. parsing complexity. Journal of Psycholinguistics, Open information extraction: The second genera- 29(6):627–645. tion. In Proceedings of the 22nd IJCAI, pages 3–10, [Koutn´ıketal.2014] Jan Koutn´ık, Klaus Greff, Faustino Barcelona,Spain. Gomez,andJu¨rgenSchmidhuber. 2014. Aclockwork [Faderetal.2011] Anthony Fader, Stephen Soderland, RNN. InProceedingsofthe31stICML,pages1863– andOrenEtzioni. 2011. Identifyingrelationsforopen 1871,Beijing,China. information extraction. In Proceedings of the 2011 [Kumaretal.2016] Ankit Kumar, Ozan Irsoy, Jonathan EMNLP,pages1535–1545,Edinburgh,Scotland,UK. Su,JamesBradbury,RobertEnglish,BrianPierce,Pe- [FerreiraandHenderson1991] Fernanda Ferreira and ter Ondruska, Ishaan Gulrajani, and Richard Socher. JohnM.Henderson. 1991. Recoveryfrommisanaly- 2016. Ask me anything: Dynamic memory networks sesofgarden-pathsentences. JournalofMemoryand fornaturallanguageprocessing. InProceedingsofthe Language,30:725–745. 33rdICML,NewYork,NY. [FrankandBod2011] Stefan L. Frank and Rens Bod. 2011. Insensitivityofthehumansentence-processing [LeandMikolov2014] Quoc V Le and Tomas Mikolov. system to hierarchical structure. Pyschological Sci- 2014. Distributed representations of sentences and ence,22(6):829–834. documents. In Proceedings of the 31st ICML, pages [Graves2013] AlexGraves. 2013. Generatingsequences 1188–1196,Beijing,China. with recurrent neural networks. arXiv preprint [Leietal.2015] Tao Lei, Regina Barzilay, and Tommi arXiv:1308.0850. Jaakkola. 2015. Molding cnns for text: non-linear, [Grefenstetteetal.2015] Edward Grefenstette, non-consecutive convolutions. In Proceedings of the Karl Moritz Hermann, Mustafa Suleyman, and 2015EMNLP,pages1565–1575,Lisbon,Portugal.