Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching∗ WenpengYin,HinrichSchu¨tze TheCenterforInformationandLanguageProcessing LMUMunich,Germany [email protected] Abstract This work studies comparatively two typ- ical sentence matching tasks: textual en- tailment (TE) and answer selection (AS), observing that weaker phrase alignments are more critical in TE, while stronger 7 phrase alignments deserve more attention 1 in AS. The key to reach this observation 0 lies in phrase detection, phrase represen- 2 tation, phrase alignment, and more im- n portantly how to connect those aligned a J phrasesofdifferentmatchingdegreeswith 9 thefinalclassifier. ] Prior work (i) has limitations in phrase Figure1: AlignmentexamplesinTE(top)andAS L generation and representation, or (ii) con- (bottom). Green color: identical (subset) align- C ducts alignment at word and phrase lev- ment;bluecolor: relatednessalignment;redcolor: . s els by handcrafted features or (iii) utilizes unrelatedalignment. Q:thefirstsentenceinTEor c [ a single framework of alignment without thequestioninAS;C+,C−: thecorrectorincor- 1 considering the characteristics of specific rectcounterpartinthesentencepair(Q,C). v tasks, whichlimitstheframework’seffec- 9 tivenessacrosstasks. 4 1 We propose an architecture based on Figure 1 shows examples of human annotated 2 GatedRecurrentUnitthatsupports(i)rep- 0 phrase alignments. In the TE example, we try to . resentation learning of phrases of arbi- figure out Q entails C+ (positive) or C− (nega- 1 trary granularity and (ii) task-specific at- 0 tive). As human beings, we discover the relation- 7 tentive pooling of phrase alignments be- ship of two sentences by studying the alignments 1 tweentwosentences. Experimentalresults : betweenlinguisticunits. Weseethatsomephrases v on TE and AS match our observation and are kept: “are playing outdoors” (between Q and i X showtheeffectivenessofourapproach. C+), “are playing ” (between Q and C−). Some r phrasesarechangedintorelatedsemanticsonpur- a 1 Introduction pose: “the young boys” (Q) → “the kids” (C+ & How to model a pair of sentences is a critical is- C−), “the man is smiling nearby” (Q) → “near sue in many NLP tasks, including textual entail- a man with a smile” (C+) or → “an old man is ment(Marellietal.,2014a;Bowmanetal.,2015a; standing in the background” (C−) . We can see Yin et al., 2016a) and answer selection (Yu et al., thatthekeptpartshavestrongeralignments(green 2014; Yang et al., 2015; Santos et al., 2016). A color),andchangedpartshaveweakeralignments key challenge common to these tasks is the lack (bluecolor). Here,by“strong”/“weak”wemean of explicit alignment annotation between the sen- how semantically close the two aligned phrases tences of the pair. Thus, inferring and assessing are. To successfully identify the relationships of thesemanticrelationsbetweenwordsandphrases (Q,C+)or(Q,C−),studyingthechangedpartsis inthetwosentencesisacoreissue. crucial. Hence,wearguethatTEshouldpaymore ∗EACL’2017longpaper attentiontoweakeralignments. In AS, we try to figure out: does sentence C+ show that considering the pairs in which overlap- orsentenceC− answerquestionQ? Roughly,the ping tokens are removed can give a boost. This contentincandidatesC+andC−canbeclassified simple trick matches our motivation that weaker into aligned part (e.g., repeated or relevant parts) alignment should be given more attention in TE. andnegligiblepart. ThisdiffersfromTE,inwhich However, Yin et al. (2016a) remove overlapping itishardtoclaimthatsomepartsarenegligibleor tokens completely, potentially obscuring complex play a minor role, as TE requires to make clear alignment configurations. In addition, Yin et al. thateachpartcanentailorbeentailed. Hence,TE (2016a)usethesameattentionmechanismforTE is considerably sensitive to those “unseen” parts. andAS,whichislessoptimalbasedonourobser- Incontrast,ASismoretolerantofnegligibleparts vations. and less related parts. From the AS example in This motivates us in this work to introduce Figure 1, we see that “Auburndale Florida” (Q) DNNs with a flexible attention mechanism that canfindrelatedpart“thecity”(C+),and“Auburn- is adaptable for specific tasks. For TE, it can dale”,“acity”(C−);“howbig”(Q)alsomatches make our system pay more attention to weaker “hadapopulationof12,381”(C+)verywell. And alignments; for AS, it enables our system to fo- some unaligned parts exist, denoted by red color. cus on stronger alignments. We can treat the Hence, we argue that stronger alignments in AS pre-processing in (Yin et al., 2016a) as a hard deservemoreattention. way, and ours as a soft way, as our phrases have Theaboveanalysissuggeststhat: (i)alignments moreflexiblelengthsandtheexistenceofoverlap- connecting two sentences can happen between ping phrases decreases the risk of losing impor- phrases of arbitrary granularity; (ii) phrase align- tantalignments. Inexperiments,wewillshowthat ments can have different intensities; (iii) tasks of thisattentionschemeisveryeffectivefordifferent differentpropertiesrequirepayingdifferentatten- tasks. tiontoalignmentsofdifferentintensities. We make the following contributions. (i) We useGRU(GatedRecurrentUnit(Choetal.,2014)) Alignments at word level (Yih et al., 2013) or to learn representations for phrases of arbitrary phraselevel(Yaoetal.,2013)bothhavebeenstud- granularity. Based on phrase representations, we ied before. For example, Yih et al. (2013) make can detect phrase alignments of different intensi- use of WordNet (Miller, 1995) and Probase (Wu ties. (ii) We propose attentive pooling to achieve etal.,2012)foridentifyinghyper-andhyponymy. flexible choice among alignments, depending on Yaoetal.(2013)usePOStags,WordNetandpara- the characteristics of the task. (iii) We achieve phrasedatabaseforalignmentidentification. Their state-of-the-artonTEtask. approachesrelyonmanualfeaturedesignandlin- guistic resources. We develop a deep neural net- 2 RelatedWork work(DNN)tolearnrepresentationsofphrasesof arbitrary lengths. As a result, alignments can be Non-DNN for sentence pair modeling. Heil- searchedinamoreautomaticandexhaustiveway. man and Smith (2010) describe tree edit mod- DNNs have been intensively investigated in els that generalize tree edit distance by allow- sentence pair classifications (Blacoe and Lapata, ing operations that better account for complex re- 2012; Socher et al., 2011; Yin and Schu¨tze, ordering phenomena and by learning from data 2015b), and attention mechanisms are also ap- how different edits should affect the model’s de- plied to individual tasks (Santos et al., 2016; cisions about sentence relations. Wang and Man- Rockta¨schel et al., 2016; Wang and Jiang, 2016); ning (2010) cope with the alignment between a however, most attention-based DNNs have im- sentence pair by using a probabilistic model that plicitassumptionthatstrongeralignmentsdeserve models tree-edit operations on dependency parse more attention (Yin et al., 2016a; Santos et al., trees. Their model treats alignments as structured 2016; Yin et al., 2016b). Our examples in Fig- latentvariables,andoffersaprincipledframework ure 1, instead, show that this assumption does forincorporatingcomplexlinguisticfeatures. Guo not hold invariably. Weaker alignments in certain and Diab (2012) identify the degree of sentence tasks such as TE can be the indicator of the final similarity by modeling the missing words (words decision. Our inspiration comes from the analy- that are not in the sentence) so as to relieve the sisofsomepriorwork. ForTE,Yinetal.(2016a) sparseness issue of sentence modeling. Yih et al. (2013) try to improve the shallow semantic component,lexicalsemantics,byformulatingsen- tence pair as a semantic matching problem with a latent word-alignment structure as in (Chang et al., 2010). More fine-grained word overlap and alignment between two sentences are explored in (Lai and Hockenmaier, 2014), in which negation, hypernym/hyponym, synonym and antonym rela- tions are used. Yao et al. (2013) extend word-to- word alignment to phrase-to-phrase alignment by Figure2: GatedRecurrentUnit a semi-Markov CRF. Such approaches often re- quire more computational resources. In addition, ments, e.g., in machine translation (Bahdanau et using syntactic/semantic parsing during run-time al., 2015; Luongetal., 2015)andtextreconstruc- to find the best matching between structured rep- tion (Li et al., 2015; Rush et al., 2015). In addi- resentationofsentencesisnottrivial. tion, attention-based alignment is also applied in DNN for sentence pair classification. There naturallanguageinference(e.g.,Rockta¨scheletal. recentlyhasbeengreatinterestinusingDNNsfor (2016),Wang and Jiang (2016)). However, most classifying sentence pairs as they can reduce the of this work aligns word-by-word. As Figure 1 burdenoffeatureengineering. shows,manysentencerelationscanbebetteriden- For TE, Bowman et al. (2015b) employ recur- tifiedthroughphraselevelalignments. Thisisone siveDNNtoencodeentailmentonSICK(Marelli motivationofourwork. etal.,2014b). Rockta¨scheletal.(2016)presentan attention-based LSTM (long short-term memory, 3 Model HochreiterandSchmidhuber(1997))fortheSNLI ThissectionfirstgivesabriefintroductionofGRU corpus(Bowmanetal.,2015a). and how it performs phrase representation learn- For AS, Yu et al. (2014) present a bigram ing,thendescribesthedifferentattentivepoolings CNN (convolutional neural network (LeCun et forphrasealignmentsw.r.tTEandAStasks. al., 1998)) to model question and answer candi- dates. Yang et al. (2015) extend this method and 3.1 GRUIntroduction get state-of-the-art performance on the WikiQA GRU is a simplified version of LSTM. Both are dataset. Fengetal.(2015)testvarioussetupsofa found effective in sequence modeling, as they are bi-CNN architecture on an insurance domain QA order-sensitive and can capture long-range con- dataset. Tan et al. (2015) explore bidirectional text. The tradeoffs between GRU and its com- LSTMonthesamedataset. Othersentencematch- petitor LSTM have not been fully explored yet. ingtaskssuchasparaphraseidentification(Socher According to empirical evaluations in (Chung et et al., 2011; Yin and Schu¨tze, 2015a), question – al., 2014; Jozefowicz et al., 2015), there is not Freebasefactmatching(Yinetal.,2016b)etc. are a clear winner. In many tasks both architectures alsoinvestigated. yield comparable performance and tuning hyper- Some prior work aims to solve a general sen- parameterslikelayersizeisprobablymoreimpor- tence matching problem. Hu et al. (2014) present tantthanpickingtheidealarchitecture. GRUhave two CNN architectures for paraphrasing, sen- fewerparametersandthusmaytrainabitfasteror tence completion (SC), tweet-response matching needlessdatatogeneralize. Hence,weuseGRU, tasks. YinandSchu¨tze(2015b)proposetheMulti- asshowninFigure2,tomodeltext: GranCNN architecture to model general sentence matching based on phrase matching on multiple z = σ(xtUz +st−1Wz) (1) levels of granularity. Wan et al. (2016) try to r = σ(x Ur +s Wr) (2) t t−1 match two sentences in AS and SC by multiple h = tanh(x Uh+(s ◦r)Wh) (3) t t t−1 sentencerepresentations,eachcomingfromthelo- s = (1−z)◦h +z◦s (4) t t t−1 calrepresentationsoftwoLSTMs. Attention-based DNN for alignment. DNNs xistheinputsentencewithtokenx ∈ Rdatposi- t have been successfully developed to detect align- tiont,s ∈ Rh isthehiddenstateatt,supposedto t ate phrase-level semantic units, but at the same timewemaintaintheorderinformation. Hence, the sentence “how big is Auburndale Florida” in Figure 1 will be reformatted into “(how) (big) (how big) (is) (big is) (how big is) (Auburndale)(isAuburndale)(bigisAuburndale) (how big is Auburndale) (Florida) (Auburndale Florida) (is Auburndale Florida) (big is Auburn- Figure 3: Phrase representation learning by GRU dale Florida) (how big is Auburndale Florida)”. (left),sentencereformatting(right) We can see that phrases are exhaustively detected andrepresented. Intheexperimentsofthiswork,weexplorethe encode the history x , ···, x . z and r are two 1 t−1 phrases of maximal length 7 instead of arbitrary gates. All U ∈ Rd×h,W ∈ Rh×h are parameters lengths. inGRU. 3.3 AttentivePooling 3.2 RepresentationLearningforPhrases As each sentence s∗ consists of a sequence of For a general sentence s with five consecutive phrases,andeachphraseisdenotedbyarepresen- words: ABCDE, with each word represented by tation vector generated by GRU, we can compute a word embedding of dimensionality d, we first an alignment matrix A between two sentences s∗ create four fake sentences, s1: “BCDEA”, s2: ands∗, bycomparingeachtwophrases, onefrom1 “CDEAB”, s3: “DEABC” and s4: “EABCD”, s∗ an2d one from s∗. Let s∗ and s∗ also denote 1 2 1 2 thenputtheminamatrix(Figure3,left). lengths respectively, thus A ∈ Rs∗1×s∗2. While We run GRUs on each row of this matrix in there are many ways of computing the entries of parallel. As GRU is able to encode the whole A,wefoundthatcosineworkswellinoursetting. sequence up to current position, this step gener- Thefirststepthenistodetectthebestalignment atesrepresentationsforanyconsecutivephrasesin for each phrase by leveraging A. To be concrete, original sentence s. For example, the GRU hid- forsentences∗,wedorow-wisemax-poolingover 1 den state at position “E” at coordinates (1,5) (i.e., Aasattentionvectora : 1 1strow,5thcolumn)denotestherepresentationof the phrase “ABCDE” which in fact is s itself, the a = max(A[i,:]) (5) 1,i hidden state at “E” (2,4) denotes the representa- tion of phrase “BCDE”, ... , the hidden state of In a , the entry a denotes the best alignment 1 1,i “E” (5,1) denotes phrase representation of “E” it- for ith phrase in sentence s∗. Similarly, we can 1 self. Hence, foreachtoken, wecanlearntherep- docolumn-wisemax-poolingtogenerateattention resentationsforallphrasesendingwiththistoken. vectora forsentences∗. 2 2 Finally, all phrases of any lengths in s can get a Now, the problem is that we need to pay representation vector. GRUs in those rows are set most attention to the phrases aligned very well or to share weights so that all phrase representations phrases aligned badly. According to the analysis arecomparableinthesamespace. of the two examples in Figure 1, we need to pay Now,wereformatsentence“ABCDE”intos∗ = more attention to weaker (resp. stronger) align- “(A) (B) (AB) (C) (BC) (ABC) (D) (CD) (BCD) mentsinTE(resp.AS).Tothisend,weadoptdif- (ABCD) (E) (DE) (CDE) (BCDE) (ABCDE)”, as ferent second step over attention vector a (i = i shown by arrows in Figure 3 (right), the arrow 1,2)forTEandAS. direction means phrase order. Each sequence in For TE, in which weaker alignments are sup- parentheses is a phrase (we use parentheses just posed to contribute more, we do k-min-pooling for making the phrase boundaries clear). Ran- over a , i.e., we only keep the k phrases which i domly taking a phrase “CDE” as an example, its arealignedworst. Forthe(Q, C+)pairinTEex- representationcomesfromthehiddenstateat“E” ample of Figure 1, we expect this step is able to (3,3) in Figure 3 (left). Shaded parts are dis- putmostofourattentiontothephrases“thekids”, carded. The main advantage of reformatting sen- “the young boys”, “near a man with a smile” and tence“ABCDE”intothenewsentences∗istocre- “andthemanissmilingnearby”astheyhaverela- Figure4: Thewholearchitecture tively weaker alignments while their relations are 3.4 TheWholeArchitecture theindicatorofthefinaldecision. Now, we present the whole system in Figure 4. We take sentences s “ABC” and s “DEFG” as 1 2 For AS, in which stronger alignments are sup- illustration. Each token, i.e., A to F, in the fig- posed to contribute more, we do k-max-pooling ureisdenotedbyanembeddingvector,henceeach overai,i.e.,weonlykeepthekphraseswhichare sentence is represented as an order-3 tensor as in- alignedbest. Forthe(Q,C+)pairinASexample put (they are depicted as rectangles just for sim- of Figure 1, we expect this k-max-pooling is able plicity). Based on tensor-style sentence input, we to put most of our attention to the phrases “how have described the phrase representation learning big” “Auburndale Florida”, “the city” and “had by GRU1 in Section 3.2 and attentive pooling in a population of 12,381” as they have relatively Section3.3. stronger alignments and their relations are the in- Attentive pooling generates a new feature map dicator of the final decision. We keep the orig- for each sentence, as shown in Figure 4 (the third inal order of extracted phrases after k-min/max- layer from the bottom), and each column repre- pooling. sentation in the feature map denotes a key phrase in this sentence that, based on our modeling as- Insummary,forTE,wefirstdorow-wisemax- sumptions, should be a good basis for the correct pooling over alignment matrix, then do k-min- finaldecision. Forinstance,weexpectsuchafea- pooling over generated alignment vector; we use ture map to contain representations of “the young k-min-max-pooling to denote the whole process. boys”, “outdoors” and “and the man is smiling In contrast, we use k-max-max-pooling for AS. nearby” for the sentence Q in the TE example of We refer to this method of using two successive Figure1. minormaxpoolingstepsasattentivepooling. Now,wedoanotherGRU2 stepfor: 1)thenew d lr bs L2 div k Common Baselines. (i) Addition. We sum up TE [256,256] .0001 1 .0006 .06 5 word embeddings element-wise to form sentence AS [50,50] .0001 1 .0006 .06 6 representation,thenconcatenatetwosentencerep- Table 1: Hyperparameters. d: dimensionality of resentation vectors (s01, s02) as classifier input. (ii) hidden states in GRU layers; lr: learning rate; bs: A-LSTM. The pioneering attention based LSTM mini-batchsize;L : L normalization;div: diver- system for a specific sentence pair classification 2 2 sityregularizer;k: k-min/max-pooling. task “natural language inference” (Rockta¨schel et al., 2016). A-LSTM has the same dimension- featuremapofeachsentencementionedabove,to ality as our GRU system in terms of initialized encode all the key phrases as the sentence repre- word representations and the hidden states. (iii) sentation;2)aconcatenatedfeaturemapofthetwo ABCNN (Yin et al., 2016a). The state-of-the-art new sentence feature maps, to encode all the key systeminbothTEandAS. phrases in the two sentences sequentially as the Based on the motivation in Section 1, the main representationofthesentencepair. AsGRUgen- hypothesis to be tested in experiments is: k-min- erates a hidden state at each position, we always max-pooling is superior for TE and k-max-max- choose the last hidden state as the representation pooling is superior for AS. In addition, we would of the sentence or sentence pair. In Figure 4 (the liketodeterminewhetherthesecondpoolingstep fourthlayer),thesefinalGRU-generatedrepresen- in attention pooling, i.e., the k-min/max-pooling, tations for sentence s1, s2 and the sentence pair ismoreeffectivethana“full-pooling”inwhichall are depicted as green columns: s1, s2 and sp re- the generated phrases are forwarded into the next spectively. layer. As for the input of the final classifier, it can be flexible,suchasrepresentationvectors(rep),sim- 4.2 TextualEntailment ilarity scores between s and s (simi), and extra 1 2 SemEval2014Task1(Marellietal.,2014a)evalu- linguisticfeatures(extra). Thiscanvarybasedon atessystempredictionsoftextualentailment(TE) thespecifictasks. WegivedetailsinSection4. relations on sentence pairs from the SICK dataset (Marelli et al., 2014b). The three classes are en- 4 Experiments tailment, contradiction and neutral. The sizes of We test the proposed architectures on TE and AS SICK train, dev and test sets are 4439, 495 and benchmarkdatasets. 4906 pairs, respectively. We choose SICK bench- mark dataset so that our result is directly compa- 4.1 CommonSetup rablewiththatof(Yinetal.,2016a),inwhichnon- overlappingtextareutilizedexplicitlytoboostthe ForbothTEandAS,wordsareinitializedby300- dimensional GloVe embeddings1 (Pennington et performance. Thattrickinspiresthiswork. al., 2014) and not changed during training. A Following Lai and Hockenmaier (2014), we single randomly initialized embedding is created trainourfinalsystem(afterfixingofhyperparame- forallunknownwordsbyuniformsamplingfrom ters)ontrainanddev(4,934pairs). Ourevaluation [−.01,.01]. We use ADAM (Kingma and Ba, measureisaccuracy. 2015), with a first momentum coefficient of 0.9 andasecondmomentumcoefficientof0.999,2 L 4.2.1 FeatureVector 2 regularizationandDiversityRegularization(Xieet The final feature vector as input of classifier con- al., 2015). Table 1 shows the values of the hyper- tainsthreeparts: rep,simi,extra. parameters,tunedondev. Rep. Totallyfivevectors,threearethetopsen- Classifier. FollowingYinetal.(2016a),weuse tence representation s , s and the top sentence 1 2 three classifiers – logistic regression in DNN, lo- pair representation s (shown in green in Fig- p gistic regression and linear SVM with default pa- ure4),twoares0,s0 fromAdditionbaseline. 1 2 rameters3 directly on the feature vector – and re- Simi. Four similarity scores, cosine similarity portperformanceofthebest. and euclidean distance between s and s , cosine 1 2 similarity and euclidean distance between s0 and 1nlp.stanford.edu/projects/glove/ 1 2StandardconfigurationrecommendedbyKingmaandBa s02. Euclidean distance (cid:107) · (cid:107) is transformed into 3http://scikit-learn.org/stable/forboth. 1/(1+ (cid:107) · (cid:107)). method acc method MAP MRR al (Jimenezetal.,2014) 83.1 CNN-Cnt 0.6520 0.6652 Evp3 (Zhaoetal.,2014) 83.6 Addition 0.5021 0.5069 mo T Se (LaiandHockenmaier,2014) 84.6 s Addition-Cnt 0.5888 0.5929 e TrRNTN (Bowmanetal.,2015b) 76.9 in A-LSTM 0.5321 0.5469 l e nofeatures 73.1 as A-LSTM-Cnt 0.6388 0.6529 Addition B AP-CNN 0.6886 0.6957 plusfeatures 79.4 ABCNN 0.6921 0.7127 nofeatures 78.0 A-LSTM ABCNN p(Yluisnfeetaatul.r,e2s016a) 8816..72 GRUmax-maxblation ––rseimpi 00..66971634 00..66989745 –rep 86.4 k-a –extra 0.6802 0.6899 GRU k-min-max-pooling 0.6674 0.6791 k-min-max –simi 85.1 GRU full-pooling 0.6693 0.6785 ablation –extra 85.5 k-max-max-pooling 0.7124∗ 0.7237∗ k-max-max-pooling 84.9 GRU full-pooling 85.2 Table 3: Results on WikiQA. Significant im- k-min-max-pooling 87.1∗ provementoverbothk-min-max-poolingandfull- poolingismarkedwith∗(t-test,p<.05). STOA: Table 2: Results on SICK. Significant improve- 74.17(MAP)/75.88(MRR)in(Tymoshenkoetal., ment over both k-max-max-pooling and full- 2016) pooling is marked with ∗ (test of equal propor- tions,p<.05). overlappingtokensintwosentencesareremoved. This instead hints that non-overlapping units can Extra. We include the same 22 linguistic fea- do a big favor for this task, which is indeed the turesasYinetal.(2016a). Theycover15machine superiorityofour“k-min-max-pooling”. translation metrics between the two sentences; whetherornotthetwosentencescontainnegation 4.3 AnswerSelection tokens like “no”, “not” etc; whether or not they We use WikiQA4 subtask that assumes there contain synonyms, hypernyms or antonyms; two is at least one correct answer for a question. sentence lengths. See Yin et al. (2016a) for de- This dataset consists of 20,360, 1130 and 2352 tails. question-candidate pairs in train, dev and test, re- 4.2.2 Results spectively. FollowingYangetal.(2015),wetrun- cate answers to 40 tokens and report mean av- Table2showsthatGRUwithk-min-max-pooling erage precision (MAP) and mean reciprocal rank gets state-of-the-art performance on SICK and (MRR). significantlyoutperformsk-max-max-poolingand ApartfromthecommonbaselinesAddition,A- full-pooling. Full-pooling has more phrase input LSTMandABCNN,wecomparefurtherwith: (i) than the combination of k-max-max-pooling and CNN-Cnt(Yangetal.,2015): combineCNNwith k-min-max-pooling, this might bring two prob- two linguistic features “WordCnt” (the number lems: (i) noisy alignments increase; (ii) sentence of non-stopwords in the question that also occur pair representation s is no longer discriminative p in the answer) and “WgtWordCnt” (reweight the – s does not know its semantics comes from p counts by the IDF values of the question words); phrases of s or s : as different sentences have 1 2 (ii)AP-CNN(Santosetal.,2016). differentlengths,theboundarylocationseparating twosentencesvariesacrosspairs. However,thisis 4.3.1 FeatureVector crucialtodeterminewhethers entailss . 1 2 The final feature vector in AS has the same (rep, ABCNN (Yin et al., 2016a) is based on simi, extra) structure as TE, except that simi con- assumptions similar to k-max-max-pooling: sists of only two cosine similarity scores, and ex- words/phrases with higher matching values tra consists of four entries: two sentence lengths, should contribute more in this task. However, WordCntandWgtWordCnt. ABCNN gets the optimal performance by com- bining a reformatted SICK version in which 4http://aka.ms/WikiQA(Yangetal.,2015) 0.5 0.45 0.4 0.35 0.3 attention values0.25 0.2 0.15 0.1 0.05 0thekidtshe kids arekids artehe...areplayairne gplkiadyis.n..gpltahyie.n..gplayipnlogautyidnog orosuatrde.o..oroskuitddso..o.rostuthde.o..orosutdoorsoutndeoaorrpsl anyienarg...neaarre...nkeiadrs...ntehare...near aneoautr daoors.p.l.aaying...aare...akids...athe...a mana nmeaanor.u.t.dmoaorns.p.l.amyianng...maarne...mkiadns...mtahne...man witmhan witha...wnitehaoru.t..dwoitorhs.p.l.awyiitnhg...witarhe...wkiitdhs...witthhe...with awith aman...aa...aneaor.u.t.daoors.p.l.aaying...aare...akids...athe...asmilea swmiitlhe...smmialne...smilae...snemilaore.u.t.dsomoirlse.p.l.asyiminlge...smairlee...skimidlse...stmihlee...smile (a) Attentiondistributionforphrasesin“Q”ofTEexampleinFigure1 0.5 0.45 0.4 0.35 0.3 attention values0.25 0.2 0.15 0.1 0.05 0thekidtshe kids arekids artehe...areplayairne gplkiadyis.n..gpltahyie.n..gplayipnlogautyidnog orosuatrde.o..oroskuitddso..o.rostuthde.o..orosutdoorsoutndeoaorrpsl anyienarg...neaarre...nkeiadrs...ntehare...near aneoautr daoors.p.l.aaying...aare...akids...athe...a mana nmeaanor.u.t.dmoaorns.p.l.amyianng...maarne...mkiadns...mtahne...man witmhan witha...wnitehaoru.t..dwoitorhs.p.l.awyiitnhg...witarhe...wkiitdhs...witthhe...with awith aman...aa...aneaor.u.t.daoors.p.l.aaying...aare...akids...athe...asmilea swmiitlhe...smmialne...smilae...snemilaore.u.t.dsomoirlse.p.l.asyiminlge...smairlee...skimidlse...stmihlee...smile (b) Attentiondistributionforphrasesin“C+”ofTEexampleinFigure1 Figure5: AttentionVisualization 4.3.2 Results can be due to that: i) our linguistic units cover Table3showsthatGRUwithk-max-max-pooling more exhaustive phrases, it enables alignments in is significantly better than its k-min-max-pooling a wider range; ii) we have two max-pooling steps andfull-poolingversions. GRUwithk-max-max- inourattentionpooling,especiallythesecondone poolinghassimilarassumptionwithABCNN(Yin is able to remove some noisily aligned phrases. et al., 2016a) and AP-CNN (Santos et al., 2016): Both ABCNN and AP-CNN are based on convo- unitswithhighermatchingscoresaresupposedto lutionallayers,thephrasedetectionisconstrained contribute more in this task. Our improvement byfiltersizes. EventhoughABCNNtriesasecond CNNlayertodetectbigger-granularphrases,their 90 phrasesindifferentCNNlayerscannotbealigned 85 directly as they are in different spaces. GRU in this work uses the same weights to learn repre- %)80 sentationsofarbitrary-granularphrases,hence,all n dev (75 phrases can share the representations in the same e o nc spaceandcanbecompareddirectly. ma70 or erf P65 4.4 VisualAnalysis 60 WikiQA (MAP) In this subsection, we visualize the attention dis- SICK (acc) 55 tributions over phrases, i.e., ai in Equation 5, of 1 2 3 4 5 6 7 8 9 10 k example sentences in Figure 1 (for space limit, weonlyshowthisforTEexample). Figures5(a)- Figure 6: Effects of pooling size k (cf. Table Ta- 5(b)respectivelyshowtheattentionvaluesofeach ble1) phrasein(Q,C+)pairinTEexampleinFigure1. Wecanfindthatk-min-poolingoverthisdistribu- manticunitsnow. tions can indeed detect some key phrases that are supposed to determine the pair relations. Taking 4.5 EffectsofPoolingSizek Figure5(a)asanexample,phrases“youngboys”, phrases ending with “and”, phrases “smiling”, “is The key idea of the proposed method is achieved smiling”,“nearby”andacoupleofphrasesending by the k-min/max pooling. We show how the hy- with “nearby” have lowest attention values. Ac- perparameterk influencestheresultsbytuningon cording to our k-min-pooling step, these phrases thedevsets. will be detected as key phrases. Considering fur- In Figure 6, we can see the performance trends ther the Figure 5(b), phrases “kids”, phrases end- of changing k value between 1 and 10 in the two ing with “near”, and a couple of phrases ending tasks. Roughlyk > 4cangivecompetitiveresults, with“smile”aredetectedaskeyphrases. butlargervaluesbringperformancedrop. If we look at the key phrases in both sen- 5 Conclusion tences, we can find that the discovering of those key phrases matches our analysis in Section 1 for Inthiswork,weinvestigatethecontributionofdif- TEexample: “kids”correspondsto“youngboys”, ferent intensities of phrase alignments for differ- “smilingnearby”correspondsto“near...smile”. enttasks. Wearguethatitisnottruethatstronger Another interesting phenomenon is that, taking alignmentsalwaysmattermore. WefoundTEtask Figure 5(b) as example, even though “are play- prefers weaker alignments while AS task prefers ing outdoors” can be well aligned as it appears in stronger alignments. We proposed flexible atten- both sentences, nevertheless the visualization fig- tive poolings in GRU system to satisfy the differ- ures show that the attention values of “are play- ent requirements of different tasks. Experimental ingoutdoorsand”inQand“areplayingoutdoors results show the soundness of our argument and near”dropdramatically. Thishintsthatourmodel the effectiveness of our attention pooling based can get rid of some surface matching, as the key GRUsystems. token“and”or“near”makesthesemanticsof“are As future work, we plan to investigate phrase playing outdoors and” and “are playing outdoors representationlearningincontextandhowtocon- near”beprettydifferentwiththeirsub-phrase“are ducttheattentivepoolingautomaticallyregardless playing outdoors”. This is important as “and” or ofthecategoriesofthetasks. “near”iscrucialunittoconnectthefollowingkey phrases“smilingnearby”inQor“asmile”inC+. Acknowledgments If we connect those key phrases sequentially as a new fake sentence, as we did in attentive pooling We gratefully acknowledge the support of layerofFigure4,wecanseethatthefakesentence Deutsche Forschungsgemeinschaft for this work roughly“reconstructs”themeaningoftheoriginal (SCHU2246/8-2). sentence while it is composed of phrase-level se- References Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. Anempiricalexplorationofrecur- DzmitryBahdanau,KyunghyunCho,andYoshuaBen- rentnetworkarchitectures. InProceedingsofICML, gio. 2015. Neural machine translation by jointly pages2342–2350. learning to align and translate. In Proceedings of ICLR. Diederik Kingma and Jimmy Ba. 2015. Adam: A methodforstochasticoptimization. InProceedings William Blacoe and Mirella Lapata. 2012. A com- ofICLR. parison of vector-based representations for seman- ticcomposition. InProceedingsofEMNLP-CoNLL, AliceLaiandJuliaHockenmaier. 2014. Illinois-lh: A pages546–556. denotational and distributional approach to seman- tics. SemEval,pages329–334. Samuel R Bowman, Gabor Angeli, Christopher Potts, andChristopherDManning. 2015a. Alargeanno- YannLeCun,Le´onBottou,YoshuaBengio,andPatrick tatedcorpusforlearningnaturallanguageinference. Haffner. 1998. Gradient-based learning applied to InProceedingsofEMNLP,pages632–642. document recognition. Proceedings of the IEEE, 86(11):2278–2324. Samuel R Bowman, Christopher Potts, and Christo- pher D Manning. 2015b. Recursive neural net- JiweiLi,Minh-ThangLuong,andDanJurafsky. 2015. works can learn logical semantics. In Proceedings A hierarchical neural autoencoder for paragraphs ofCVSCworkshop,pages12–21. and documents. In Proceedings of ACL, pages 1106–1115. Ming-Wei Chang, Dan Goldwasser, Dan Roth, and VivekSrikumar. 2010. Discriminativelearningover Minh-Thang Luong, Hieu Pham, and Christopher D constrained latent representations. In Proceedings Manning. 2015. Effective approaches to attention- ofNAACL-HLT,pages429–437. basedneuralmachinetranslation. InProceedingsof EMNLP,pages1412–1421. KyunghyunCho,BartvanMerrie¨nboer,DzmitryBah- Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf- danau,andYoshuaBengio. 2014. Ontheproperties faella Bernardi, Stefano Menini, and Roberto Zam- ofneuralmachinetranslation: Encoder-decoderap- parelli. 2014a. Semeval-2014 task 1: Evaluation proaches. Eighth Workshop on Syntax, Semantics of compositional distributional semantic models on andStructureinStatisticalTranslation. fullsentencesthroughsemanticrelatednessandtex- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, tualentailment. SemEval,pages1–8. andYoshuaBengio. 2014. Empiricalevaluationof Marco Marelli, Stefano Menini, Marco Baroni, Luisa gatedrecurrentneuralnetworksonsequencemodel- Bentivogli, Raffaella Bernardi, and Roberto Zam- ing. arXivpreprintarXiv:1412.3555. parelli. 2014b. A sick cure for the evaluation of Minwei Feng, Bing Xiang, Michael R Glass, Lidan compositional distributional semantic models. In Wang, and Bowen Zhou. 2015. Applying deep ProceedingsofLREC,pages216–223. learning to answer selection: A study and an open GeorgeA.Miller. 1995. Wordnet: Alexicaldatabase task. ProceedingsofIEEEASRUWorkshop. forenglish. Commun.ACM,38(11):39–41. Weiwei Guo and Mona Diab. 2012. Modeling sen- JeffreyPennington,RichardSocher,andChristopherD tences in the latent space. In Proceedings of ACL, Manning. 2014. GloVe: Global vectors for word pages864–872. representation. In Proceedings of EMNLP, pages 1532–1543. MichaelHeilmanandNoahASmith. 2010. Treeedit models for recognizing textual entailments, para- Tim Rockta¨schel, Edward Grefenstette, Karl Moritz phrases, and answers to questions. In Proceedings Hermann,Toma´sˇKocˇisky`,andPhilBlunsom. 2016. ofNAACL-HLT,pages1011–1019. Reasoning about entailment with neural attention. InProceedingsofICLR. Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term memory. Neural computation, AlexanderMRush, SumitChopra, andJasonWeston. 9(8):1735–1780. 2015. Aneuralattentionmodelforabstractivesen- tence summarization. In Proceedings of EMNLP, Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai pages379–389. Chen. 2014. Convolutional neural network archi- tecturesformatchingnaturallanguagesentences. In CicerodosSantos,MingTan,BingXiang,andBowen ProceedingsofNIPS,pages2042–2050. Zhou. 2016. Attentive pooling networks. arXiv preprintarXiv:1602.03609. Sergio Jimenez, George Duenas, Julia Baquero, Alexander Gelbukh, Av Juan Dios Ba´tiz, and RichardSocher,EricHHuang,JeffreyPennin,Christo- Av Mendiza´bal. 2014. Unal-nlp: Combining soft pher D Manning, and Andrew Y Ng. 2011. Dy- cardinality features for semantic textual similarity, namicpoolingandunfoldingrecursiveautoencoders relatedness and entailment. SemEval, pages 732– for paraphrase detection. In Proceedings of NIPS, 742. pages801–809.