Table Of Content

Language to Logical Form with Neural Attention LiDong and MirellaLapata InstituteforLanguage,CognitionandComputation SchoolofInformatics,UniversityofEdinburgh 10CrichtonStreet,EdinburghEH89AB li.dong@ed.ac.uk,mlap@inf.ed.ac.uk Abstract Attention Layer answer(J,(compa what microsoft jobs Semantic parsing aims at mapping nat- do not require a MTSL MTSL noyb((JJ,)',mnoict(r(orseoqf_t'd),ej bscs? ural language to machine interpretable g(J,'bscs'))))) meaning representations. Traditional ap- Input Sequence Sequence/Tree Logical proaches rely on high-quality lexicons, Utterance Encoder Decoder Form 6 manually-built templates, and linguis- Figure 1: Input utterances and their logical forms 1 tic features which are either domain- are encoded and decoded with neural networks. 0 or representation-specific. In this pa- 2 Anattentionlayerisusedtolearnsoftalignments. per we present a general method based n onanattention-enhancedencoder-decoder u J model. We encode input utterances into 6 vector representations, and generate their Encoder-decoder architectures based on recur- ] logical forms by conditioning the output rent neural networks have been successfully ap- L sequences or trees on the encoding vec- plied to a variety of NLP tasks ranging from syn- C tors. Experimentalresultsonfourdatasets tactic parsing (Vinyals et al., 2015a), to machine . s show that our approach performs compet- translation (Kalchbrenner and Blunsom, 2013; c [ itivelywithoutusinghand-engineeredfea- Cho et al., 2014; Sutskever et al., 2014), and tures and is easy to adapt across domains image description generation (Karpathy and Fei- 2 v andmeaningrepresentations. Fei, 2015; Vinyals et al., 2015b). As shown in 0 Figure 1, we adapt the general encoder-decoder 8 1 Introduction paradigm to the semantic parsing task. Our 2 1 model learns from natural language descriptions 0 Semantic parsing is the task of translating text paired with meaning representations; it encodes . to a formal meaning representation such as log- 1 sentences and decodes logical forms using recur- 0 ical forms or structured queries. There has re- rentneuralnetworkswithlongshort-termmemory 6 cently been a surge of interest in developing ma- 1 (LSTM) units. We present two model variants, : chine learning methods for semantic parsing (see the first one treats semantic parsing as a vanilla v the references in Section 2), due in part to the i sequence transduction task, whereas our second X existence of corpora containing utterances anno- modelisequippedwithahierarchicaltreedecoder r tated with formal meaning representations. Fig- a whichexplicitlycapturesthecompositionalstruc- ure 1 shows an example of a question (left hand- ture of logical forms. We also introduce an atten- side) and its annotated logical form (right hand- tion mechanism (Bahdanau et al., 2015; Luong et side),takenfromJOBS(TangandMooney,2001), al.,2015b)allowingthemodeltolearnsoftalign- awell-knownsemanticparsingbenchmark. Inor- mentsbetweennaturallanguageandlogicalforms der to predict the correct logical form for a given andpresentanargumentidentificationsteptohan- utterance, most previous systems rely on prede- dlerarementionsofentitiesandnumbers. fined templates and manually designed features, which often render the parsing model domain- or Evaluationresultsdemonstratethatcomparedto representation-specific. In this work, we aim to previous methods our model achieves similar or useasimpleyeteffectivemethodtobridgethegap better performance across datasets and meaning between natural language and logical form with representations,despiteusingnohand-engineered minimaldomainknowledge. domain-orrepresentation-specificfeatures. 2 RelatedWork Our model works on more well-defined meaning representations (such as Prolog and lambda cal- Our work synthesizes two strands of research, culus) and is conceptually simpler; it does not namelysemanticparsingandtheencoder-decoder employbidirectionalityormulti-levelalignments. architecturewithneuralnetworks. Grefenstette et al. (2014) propose a different ar- The problem of learning semantic parsers has chitectureforsemanticparsingbasedonthecom- received significant attention, dating back to bination of two neural network models. The first Woods (1973). Many approaches learn from sen- model learns shared representations from pairs of tences paired with logical forms following vari- questions and their translations into knowledge ous modeling strategies. Examples include the basequeries,whereasthesecondmodelgenerates useofparsingmodels(Milleretal.,1996;Geand thequeriesconditionedonthelearnedrepresenta- Mooney, 2005; Lu et al., 2008; Zhao and Huang, tions. However,theydonotreportempiricaleval- 2015), inductive logic programming (Zelle and uationresults. Mooney, 1996; Tang and Mooney, 2000; Thom- spon and Mooney, 2003), probabilistic automata 3 ProblemFormulation (HeandYoung,2006),string/tree-to-treetransfor- Our aim is to learn a model which maps natural mation rules (Kate et al., 2005), classifiers based language input q = x ···x to a logical form on string kernels (Kate and Mooney, 2006), ma- 1 |q| representation of its meaning a = y ···y . The chinetranslation(WongandMooney,2006;Wong 1 |a| conditionalprobabilityp(a|q)isdecomposedas: and Mooney, 2007; Andreas et al., 2013), and combinatory categorial grammar induction tech- |a| (cid:89) niques (Zettlemoyer and Collins, 2005; Zettle- p(a|q) = p(y |y ,q) (1) t <t moyer and Collins, 2007; Kwiatkowski et al., t=1 2010; Kwiatkowski et al., 2011). Other work wherey = y ···y . <t 1 t−1 learnssemanticparserswithoutrelyingonlogical- Our method consists of an encoder which en- fromannotations,e.g.,fromsentencespairedwith codesnaturallanguageinputq intoavectorrepre- conversationallogs(ArtziandZettlemoyer,2011), sentation and a decoder which learns to generate system demonstrations (Chen and Mooney, 2011; y ,··· ,y conditioned on the encoding vector. 1 |a| Goldwasser and Roth, 2011; Artzi and Zettle- In the following we describe two models varying moyer, 2013), question-answer pairs (Clarke et inthewayinwhichp(a|q)iscomputed. al., 2010; Liang et al., 2013), and distant supervi- sion (Krishnamurthy and Mitchell, 2012; Cai and 3.1 Sequence-to-SequenceModel Yates,2013;Reddyetal.,2014). This model regards both input q and output a as Our model learns from natural language de- sequences. AsshowninFigure2,theencoderand scriptions paired with meaning representations. decoderaretwodifferentL-layerrecurrentneural Most previous systems rely on high-quality lex- networks with long short-term memory (LSTM) icons, manually-built templates, and features unitswhichrecursivelyprocesstokensonebyone. which are either domain- or representation- Thefirst|q|timestepsbelongtotheencoder,while specific. Weinsteadpresentageneralmethodthat thefollowing|a|timestepsbelongtothedecoder. can be easily adapted to different domains and Let hl ∈ Rn denote the hidden vector at time t meaning representations. We adopt the general steptandlayerl. hl isthencomputedby: t encoder-decoder framework based on neural net- (cid:16) (cid:17) workswhichhasbeenrecentlyrepurposedforvar- hl = LSTM hl ,hl−1 (2) t t−1 t iousNLPtaskssuchassyntacticparsing(Vinyals et al., 2015a), machine translation (Kalchbrenner where LSTM refers to the LSTM function being andBlunsom,2013;Choetal.,2014;Sutskeveret used. In our experiments we follow the architec- al.,2014),imagedescriptiongeneration(Karpathy ture described in Zaremba et al. (2015), however and Fei-Fei, 2015; Vinyals et al., 2015b), ques- other types of gated activation functions are pos- tion answering (Hermann et al., 2015), and sum- sible (e.g., Cho et al. (2014)). For the encoder, marization(Rushetal.,2015). h0 = W e(x ) is the word vector of the current t q t Mei et al. (2016) use a sequence-to-sequence input token, with Wq ∈ Rn×|Vq| being a parame- modeltomapnavigationalinstructionstoactions. termatrix,ande(·)theindexofthecorresponding lambda $0 e <n> </s> L L L L L L L S S S S S S S SL SL SL SL SL SL MT MT MT MT MT MT MT T T T T T T M M M M M M and <n> <n> </s> L L L L S S S S L L L L L L MT MT MT MT S S S S S S T T T T T T M M M M M M > <n>1600:ti </s> from $0dallas:ci</s> TSL TSL TSL TSL TSL TSL TSL TSL M M M M M M M M Figure 2: Sequence-to-sequence (SEQ2SEQ) departure <n> Nonterminal _time $0 </s> modelwithtwo-layerrecurrentneuralnetworks. Start decoding Parent feeding SL SL SL T T T LSTM Encoder unit M M M LSTM Decoder unit token. For the decoder, h0 = W e(y ) is the t a t−1 wordvectorofthepreviouspredictedword,where Figure 3: Sequence-to-tree (SEQ2TREE) model Wa ∈ Rn×|Va|. Notice that the encoder and de- withahierarchicaltreedecoder. coderhavedifferentLSTMparameters. Once the tokens of the input sequence x1,··· ,x|q| are encoded into vectors, they are we define a “nonterminal” <n> token which in- usedtoinitializethehiddenstatesofthefirsttime dicates subtrees. As shown in Figure 3, we pre- stepinthedecoder. Next,thehiddenvectorofthe processthelogicalform“lambda$0e(and(>(de- topmost LSTM hLt in the decoder is used to pre- parture time$0)1600:ti)(from$0dallas:ci))”toa dictthet-thoutputtokenas: treebyreplacingtokensbetweenpairsofbrackets p(y |y ,q) = softmax(cid:0)W hL(cid:1)(cid:124)e(y ) (3) with nonterminals. Special tokens <s> and <(> t <t o t t denotethebeginningofasequenceandnontermi- nalsequence, respectively(omittedfromFigure3 where Wo ∈ R|Va|×n is a parameter matrix, and due to lack of space). Token </s> represents the e(yt) ∈ {0,1}|Va| aone-hotvectorforcomputing endofsequence. y ’sprobabilityfromthepredicteddistribution. t Afterencodinginputq,thehierarchicaltreede- We augment every sequence with a “start-of- coder uses recurrent neural networks to generate sequence” <s> and “end-of-sequence” </s> to- tokens at depth 1 of the subtree corresponding to ken. Thegenerationprocessterminatesonce</s> parts of logical form a. If the predicted token ispredicted. Theconditionalprobabilityofgener- is <n>, we decode the sequence by conditioning ating the whole sequence p(a|q) is then obtained on the nonterminal’s hidden vector. This process usingEquation(1). terminates when no more nonterminals are emit- 3.2 Sequence-to-TreeModel ted. Inotherwords,asequencedecoderisusedto hierarchicallygeneratethetreestructure. The SEQ2SEQ model has a potential drawback in that it ignores the hierarchical structure of logical In contrast to the sequence decoder described forms. As a result, it needs to memorize various in Section 3.1, the current hidden state does not piecesofauxiliaryinformation(e.g.,bracketpairs) only depend on its previous time step. In order to to generate well-formed output. In the following betterutilizetheparentnonterminal’sinformation, we present a hierarchical tree decoder which is we introduce a parent-feeding connection where morefaithfultothecompositionalnatureofmean- thehiddenvectoroftheparentnonterminaliscon- ing representations. A schematic description of catenatedwiththeinputsandfedintoLSTM. themodelisshowninFigure3. As an example, Figure 4 shows the decoding Thepresentmodelsharesthesameencoderwith tree corresponding to the logical form “A B (C)”, thesequence-to-sequencemodeldescribedinSec- where y ···y are predicted tokens, and t ···t 1 6 1 6 tion 3.1 (essentially it learns to encode input q as denote different time steps. Span “(C)” corre- vectors). However, its decoder is fundamentally spondstoasubtree. Decodinginthisexamplehas different as it generates logical forms in a top- twosteps: onceinputq hasbeenencoded,wefirst downmanner. Inordertorepresenttreestructure, generate y ···y at depth 1 until token </s> is 1 4 y =A y =B y =<n>y =</s> 1 2 3 4 q t t t t 1 2 3 4 Attention y5=Cy6=</s> Scores <s> t t SL SL SL SL SL 5 6 T T T T T M M M M M <(> Figure4: A SEQ2TREE decodingexampleforthe logicalform“AB(C)”. Figure 5: Attention scores are computed by the currenthiddenvectorandallthehiddenvectorsof encoder. Then, the encoder-side context vector ct predicted; next, we generate y5,y6 by condition- is obtained in the form of a weighted sum, which ingonnonterminalt3’shiddenvectors. Theprob- isfurtherusedtopredictyt. abilityp(a|q)istheproductofthesetwosequence decodingsteps: p(y |y ,q) = softmax(cid:0)W hatt(cid:1)(cid:124)e(y ) (8) t <t o t t p(a|q) = p(y1y2y3y4|q)p(y5y6|y≤3,q) (4) where Wo ∈ R|Va|×n and W1,W2 ∈ Rn×n are three parameter matrices, and e(y ) is a one-hot t where Equation (3) is used for the prediction of vectorusedtoobtainy ’sprobability. t eachoutputtoken. 3.4 ModelTraining 3.3 AttentionMechanism Ourgoalistomaximizethelikelihoodofthegen- As shown in Equation (3), the hidden vectors of erated logical forms given natural language utter- the input sequence are not directly used in the ancesasinput. Sotheobjectivefunctionis: decoding process. However, it makes intuitively (cid:88) sensetoconsiderrelevantinformationfromthein- minimize− logp(a|q) (9) put to better predict the current token. Following (q,a)∈D this idea, various techniques have been proposed where D is the set of all natural language-logical to integrate encoder-side information (in the form form training pairs, and p(a|q) is computed as of a context vector) at each time step of the de- showninEquation(1). coder(Bahdanauetal.,2015;Luongetal.,2015b; The RMSProp algorithm (Tieleman and Hin- Xuetal.,2015). ton, 2012) is employed to solve this non-convex As shown in Figure 5, in order to find rele- optimization problem. Moreover, dropout is used vant encoder-side context for the current hidden forregularizingthemodel(Zarembaetal.,2015). statehLofdecoder,wecomputeitsattentionscore t Specifically, dropout operators are used between withthek-thhiddenstateintheencoderas: different LSTM layers and for the hidden lay- exp{hL·hL} ers before the softmax classifiers. This technique st = k t (5) k (cid:80)|q| exp{hL·hL} can substantially reduce overfitting, especially on j=1 j t datasetsofsmallsize. where hL,··· ,hL are the top-layer hidden vec- 3.5 Inference 1 |q| torsoftheencoder. Then,thecontextvectoristhe Attesttime,wepredictthelogicalformforanin- weightedsumofthehiddenvectorsintheencoder: pututteranceq by: |q| aˆ = argmaxp(cid:0)a(cid:48)|q(cid:1) (10) (cid:88) ct = stkhLk (6) a(cid:48) k=1 where a(cid:48) represents a candidate output. How- InlieuofEquation(3),wefurtherusethiscon- ever, it is impractical to iterate over all possible textvectorwhichactsasasummaryoftheencoder results to obtain the optimal prediction. Accord- tocomputetheprobabilityofgeneratingy as: ing to Equation (1), we decompose the probabil- t ity p(a|q) so that we can use greedy search (or hatt = tanh(cid:0)W hL+W ct(cid:1) (7) beamsearch)togeneratetokensonebyone. t 1 t 2 and “job(ANS), salary greater than(ANS, num , 0 year)”. We use the pre-processed examples as trainingdata. Atinferencetime,wealsomasken- titiesandnumberswiththeirtypesandIDs. Once we obtain the decoding result, a post-processing steprecoversallthemarkerstype totheircorre- i spondinglogicalconstants. 4 Experiments We compare our method against multiple previous systems on four datasets. We describe these datasets below, and present our experimental set- tingsandresults. Finally,weconductmodelanal- ysisinordertounderstandwhatthemodellearns. Algorithm1describesthedecodingprocessfor The code is available at https://github. SEQ2TREE. The time complexity of both de- com/donglixp/lang2logic. coders is O(|a|), where |a| is the length of output. The extra computation of SEQ2TREE com- 4.1 Datasets pared with SEQ2SEQ is to maintain the nonter- Our model was trained on the following datasets, minal queue, which can be ignored because most covering different domains and using different of time is spent on matrix operations. We imple- meaning representations. Examples for each do- mentthehierarchicaltreedecoderinabatchmode, mainareshowninTable1. so that it can fully utilize GPUs. Specifically, as shown in Algorithm 1, every time we pop multi- JOBS This benchmark dataset contains 640 plenonterminalsfromthequeueanddecodethese queries to a database of job listings. Specifically, nonterminalsinonebatch. questionsarepairedwithProlog-stylequeries. We used the same training-test split as Zettlemoyer 3.6 ArgumentIdentification and Collins (2005) which contains 500 training and 140 test instances. Values for the variables The majority of semantic parsing datasets have company, degree, language, platform, location, beendevelopedwithquestion-answeringinmind. jobarea,andnumberareidentified. Inthetypicalapplicationsetting,naturallanguage questions are mapped into logical forms and ex- GEO Thisisastandardsemanticparsingbench- ecuted on a knowledge base to obtain an answer. mark which contains 880 queries to a database of Due to the nature of the question-answering task, U.S. geography. GEO has 880 instances split into many natural language utterances contain entities a training set of 680 training examples and 200 or numbers that are often parsed as arguments in test examples (Zettlemoyer and Collins, 2005). the logical form. Some of them are unavoidably We used the same meaning representation based rareordonotappearinthetrainingsetatall(this on lambda-calculus as Kwiatkowski et al. (2011). is especially true for small-scale datasets). Con- Values for the variables city, state, country, river, ventional sequence encoders simply replace rare andnumberareidentified. words with a special unknown word symbol (Lu- ong et al., 2015a; Jean et al., 2015), which would ATIS This dataset has 5,410 queries to a flight bedetrimentalforsemanticparsing. booking system. The standard split has 4,480 We have developed a simple procedure for ar- traininginstances,480developmentinstances,and gument identification. Specifically, we identify 450 test instances. Sentences are paired with entities and numbers in input questions and re- lambda-calculus expressions. Values for the vari- place them with their type names and unique ablesdate,time,city,aircraftcode,airport,airline, IDs. For instance, we pre-process the training andnumberareidentified. example “jobs with a salary of 40000” and its logicalform“job(ANS),salary greater than(ANS, IFTTT Quirk et al. (2015) created this dataset 40000, year)” as “jobs with a salary of num ” by extracting a large number of if-this-then-that 0 Dataset Length Example 9.80 whatmicrosoftjobsdonotrequireabscs? JOBS 22.90 answer(company(J,’microsoft’),job(J),not((req deg(J,’bscs’)))) 7.60 whatisthepopulationofthestatewiththelargestarea? GEO 19.10 (population:i(argmax$0(state:t$0)(area:i$0))) 11.10 dallastosanfranciscoleavingafter4intheafternoonplease ATIS 28.10 (lambda$0e(and(>(departure time$0)1600:ti)(from$0dallas:ci)(to$0san francisco:ci))) Turnonheaterwhentemperaturedropsbelow58degree 6.95 IFTTT TRIGGER:Weather-Current temperature drops below-((Temperature(58))(Degrees in(f))) 21.80 ACTION:WeMo Insight Switch-Turn on-((Which switch?(””))) Table1: Examplesofnaturallanguagedescriptionsandtheirmeaningrepresentationsfromfourdatasets. Theaveragelengthofinputandoutputsequencesisshowninthesecondcolumn. recipesfromtheIFTTTwebsite1. Recipesaresim- Method Accuracy COCKTAIL(TangandMooney,2001) 79.4 pleprogramswithexactlyonetriggerandoneac- PRECISE(Popescuetal.,2003) 88.0 tionwhichusersspecifyonthesite. Wheneverthe ZC05(ZettlemoyerandCollins,2005) 79.3 conditions of the trigger are satisfied, the action DCS+L(Liangetal.,2013) 90.7 TISP(ZhaoandHuang,2015) 85.0 is performed. Actions typically revolve around SEQ2SEQ 87.1 homesecurity(e.g.,“turnonmylightswhenIar- −attention 77.9 rivehome”),automation(e.g.,“textmeifthedoor −argument 70.7 SEQ2TREE 90.0 opens”), well-being (e.g., “remind me to drink −attention 83.6 water if I’ve been at a bar for more than two hours”), and so on. Triggers and actions are se- Table2: Evaluationresultson JOBS. lected from different channels (160 in total) rep- resenting various types of services, devices (e.g., on the training set for JOBS and GEO. We used Android), and knowledge sources (such as ESPN thestandarddevelopmentsetsforATISandIFTTT. or Gmail). In the dataset, there are 552 trigger We usedthe RMSProp algorithm(with batch size functionsfrom128channels,and229actionfunc- setto20)toupdatetheparameters. Thesmoothing tions from 99 channels. We used Quirk et al.’s constant of RMSProp was 0.95. Gradients were (2015)originalsplitwhichcontains77,495train- clipped at 5 to alleviate the exploding gradient ing,5,171development,and4,294testexamples. problem (Pascanu et al., 2013). Parameters were The IFTTT programs are represented as abstract randomly initialized from a uniform distribution syntax trees and are paired with natural language U(−0.08,0.08). Atwo-layerLSTMwasusedfor descriptionsprovidedbyusers(seeTable1). Here, IFTTT, while a one-layer LSTM was employed numbersandURLsareidentified. for the other domains. The dropout rate was selected from {0.2,0.3,0.4,0.5}. Dimensions of 4.2 Settings hidden vector and word embedding were selected Naturallanguagesentenceswerelowercased;mis- from {150,200,250}. Early stopping was used spellings were corrected using a dictionary based to determine the number of epochs. Input sen- on the Wikipedia list of common misspellings. tences were reversed before feeding into the en- Words were stemmed using NLTK (Bird et al., coder (Sutskever et al., 2014). We use greedy 2009). For IFTTT, we filtered tokens, channels search to generate logical forms during inference. and functions which appeared less than five times Notice that two decoders with shared word em- in the training set. For the other datasets, we fil- beddingswereusedtopredicttriggersandactions teredinputwordswhichdidnotoccuratleasttwo for IFTTT, andtwosoftmaxclassifiersareusedto times in the training set, but kept all tokens in classifychannelsandfunctions. the logical forms. Plain string matching was employed to identify augments as described in Sec- 4.3 Results tion 3.6. More sophisticated approaches could be used,howeverweleavethisfuturework. We first discuss the performance of our model on Model hyper-parameters were cross-validated JOBS, GEO, and ATIS, and then examine our results on IFTTT. Tables 2–4 present comparisons 1http://www.ifttt.com against a variety of systems previously described Method Accuracy Method Channel +Func F1 SCISSOR(GeandMooney,2005) 72.3 retrieval 28.9 20.2 41.7 KRISP(KateandMooney,2006) 71.7 phrasal 19.3 11.3 35.3 WASP(WongandMooney,2006) 74.8 sync 18.1 10.6 35.1 λ-WASP(WongandMooney,2007) 86.6 classifier 48.8 35.2 48.4 LNLZ08(Luetal.,2008) 81.8 posclass 50.0 36.9 49.3 ZC05(ZettlemoyerandCollins,2005) 79.3 SEQ2SEQ 54.3 39.2 50.1 ZC07(ZettlemoyerandCollins,2007) 86.1 −attention 54.0 37.9 49.8 UBL(Kwiatkowskietal.,2010) 87.9 −argument 53.9 38.6 49.7 FUBL(Kwiatkowskietal.,2011) 88.6 SEQ2TREE 55.2 40.1 50.4 KCAZ13(Kwiatkowskietal.,2013) 89.0 −attention 54.3 38.2 50.0 DCS+L(Liangetal.,2013) 87.9 (a)Omitnon-English. TISP(ZhaoandHuang,2015) 88.9 SEQ2SEQ 84.6 Method Channel +Func F1 −attention 72.9 retrieval 36.8 25.4 49.0 −argument 68.6 phrasal 27.8 16.4 39.9 SEQ2TREE 87.1 sync 26.7 15.5 37.6 −attention 76.8 classifier 64.8 47.2 56.5 posclass 67.2 50.4 57.7 Table3: EvaluationresultsonGEO. 10-foldcross- SEQ2SEQ 68.8 50.5 60.3 validationisusedforthesystemsshowninthetop −attention 68.7 48.9 59.5 half of the table. The standard split of ZC05 is −argument 68.8 50.4 59.7 SEQ2TREE 69.6 51.4 60.4 usedforallothersystems. −attention 68.7 49.5 60.2 (b)Omitnon-English&unintelligible. Method Accuracy ZC07(ZettlemoyerandCollins,2007) 84.6 Method Channel +Func F1 UBL(Kwiatkowskietal.,2010) 71.4 retrieval 43.3 32.3 56.2 FUBL(Kwiatkowskietal.,2011) 82.8 phrasal 37.2 23.5 45.5 GUSP-FULL(Poon,2013) 74.8 sync 36.5 24.1 42.8 GUSP++(Poon,2013) 83.5 classifier 79.3 66.2 65.0 TISP(ZhaoandHuang,2015) 84.2 posclass 81.4 71.0 66.5 SEQ2SEQ 84.2 SEQ2SEQ 87.8 75.2 73.7 −attention 75.7 −attention 88.3 73.8 72.9 −argument 72.3 −argument 86.8 74.9 70.8 SEQ2TREE 84.6 SEQ2TREE 89.7 78.4 74.2 −attention 77.5 −attention 87.6 74.9 73.5 (c)≥3turkersagreewithgold. Table4: Evaluationresultson ATIS. Table5: Evaluationresultson IFTTT. in the literature. We report results with the full models (SEQ2SEQ, SEQ2TREE) and two abla- tention substantially improves performance on all tionvariants,i.e.,withoutanattentionmechanism three datasets. This underlines the importance of (−attention) and without argument identification utilizing soft alignments between inputs and out- (−argument). We report accuracy which is de- puts. We further analyze what the attention layer fined as the proportion of the input sentences that learns in Figure 6. Moreover, our results show are correctly parsed to their gold standard logical that argument identification is critical for small- forms. Notice that DCS+L, KCAZ13 and GUSP scale datasets. For example, about 92% of city outputanswersdirectly,soaccuracyinthissetting names appear less than 4 times in the GEO train- isdefinedasthepercentageofcorrectanswers. ing set, so it is difficult to learn reliable parame- Overall, SEQ2TREE is superior to SEQ2SEQ. tersforthesewords. Inrelationtopreviouswork, This is to be expected since SEQ2TREE ex- theproposedmodelsachievecomparableorbetter plicitly models compositional structure. On the performance. Importantly,weusethesameframe- JOBS and GEO datasets which contain logical work (SEQ2SEQ or SEQ2TREE) across datasets forms with nested structures, SEQ2TREE out- and meaning representations (Prolog-style logi- performs SEQ2SEQ by 2.9% and 2.5%, respec- cal forms in JOBS and lambda calculus in the tively. SEQ2TREE achieves better accuracy over other two datasets) without modification. Despite SEQ2SEQ on ATIS too,however,thedifferenceis this relatively simple approach, we observe that smaller, since ATIS is a simpler domain without SEQ2TREE ranks second on JOBS, and is tied for complexnestedstructures. Wefindthataddingat- firstplacewithZC07on ATIS. argmin AjNobS() laemxbis$dt0aes(( fliag$nhd0t(( argm$a0x( salary_greater_tAhNanS(, round_atr$$nipd11(( fro$m0)( placaen:dt( num0, class_type)( c$i00 $0 , $1 ) ) year first:c)l ( ( ) ( to , fro$m1 $0 loc:t req_de\+g(( cc$tii011o)( tomorrcoiw1)( c$o00) ANS(, far=e)(( $0))( )( degid0))) $$10)))) departure_<ti/ms$>0e) elevatio$n0:)i </s></s>degid0arequirnotdothatnum0payjobwhich<s> </s>)</s>ci1toci0fromtriproundfareclassfirstwhat<s> </s>tomorrowci1toci0fromflightearliesttheiswhat<s> </s> </s>co0theinelevhighesttheiswhat<s> (a)whichjobspaynum0thatdo (b)what’sfirstclassfare (c)whatistheearliestflight (d)whatisthehighestelevation notrequireadegid0 roundtripfromci0toci1 fromci0toci1tomorrow intheco0 Figure 6: Alignments (same color rectangles) produced by the attention mechanism (darker color represents higher attention score). Input sentences are reversed and stemmed. Model output is shown for SEQ2SEQ(a,b)and SEQ2TREE(c,d). We illustrate examples of alignments produced the proposed derivation. We compare our model by SEQ2SEQ in Figures 6a and 6b. Alignments against posclass, the method introduced in Quirk produced by SEQ2TREE are shown in Figures 6c et al. and several of their baselines. posclass is and 6d. Matrices of attention scores are com- reminiscent of KRISP (Kate and Mooney, 2006), puted using Equation (5) and are represented in it learns distributions over productions given in- grayscale. Aligned input words and logical form put sentences represented as a bag of linguistic predicates are enclosed in (same color) rectan- features. The retrieval baseline finds the closest gleswhoseoverlappingareascontaintheattention description in the training data based on charac- scores. Also notice that attention scores are com- ter string-edit-distance and returns the recipe for putedbyLSTMhiddenvectorswhichencodecon- that training program. The phrasal method uses textinformationratherthanjustthewordsintheir phrase-based machine translation to generate the current positions. The examples demonstrate that recipe, whereas sync extracts synchronous gram- the attention mechanism can successfully model mar rules from the data, essentially recreating the correspondence between sentences and logi- WASP (Wong and Mooney, 2006). Finally, they calforms,capturingreordering(Figure6b),many- useabinaryclassifiertopredictwhetheraproduc- to-many(Figure6a),andmany-to-onealignments tionshouldbepresentinthederivationtreecorre- (Figures6c,d). spondingtothedescription. For IFTTT, we follow the same evaluation pro- Quirk et al. (2015) report results on the full tocol introduced in Quirk et al. (2015). The test data and smaller subsets after noise filter- dataset is extremely noisy and measuring accu- ing, e.g., when non-English and unintelligible de- racy is problematic since predicted abstract syn- scriptions are removed (Tables 5a and 5b). They tax trees (ASTs) almost never exactly match the also ran their system on a high-quality subset of gold standard. Quirk et al. view an AST as a description-programpairswhichwerefoundinthe set of productions and compute balanced F1 in- gold standard and at least three humans managed stead which we also adopt. The first column in toindependentlyreproduce(Table5c). Acrossall Table5showsthepercentageofchannelsselected subsets our models outperforms posclass and re- correctly for both triggers and actions. The sec- latedbaselines. AgainweobservethatSEQ2TREE ond column measures accuracy for both channels consistently outperforms SEQ2SEQ, albeit with a and functions. The last column shows balanced small margin. Compared to the previous datasets, F1 against the gold tree over all productions in the attention mechanism and our argument identification method yield less of an improvement. formance across the board. Extensive compar- This may be due to the size of Quirk et al. (2015) isons with previous methods show that our ap- andthewayitwascreated–usercurateddescrip- proach performs competitively, without recourse tions are often of low quality, and thus align very todomain-orrepresentation-specificfeatures. Di- looselytotheircorrespondingASTs. rectionsforfutureworkaremanyandvaried. For example, it would be interesting to learn a model 4.4 ErrorAnalysis from question-answer pairs without access to tar- Finally, we inspected the output of our model in get logical forms. Beyond semantic parsing, we ordertoidentifythemostcommoncausesoferrors would also like to apply our SEQ2TREE model whichwesummarizebelow. to related structured prediction tasks such as con- stituencyparsing. Under-Mapping The attention model used in our experiments does not take the alignment his- Acknowledgments We would like to thank toryintoconsideration. So,somequestionwords, LukeZettlemoyerandTomKwiatkowskiforshar- expecially in longer questions, may be ignored in ingtheATISdataset. ThesupportoftheEuropean the decoding process. This is a common prob- Research Council under award number 681760 lem for encoder-decoder models and can be ad- “Translating Multiple Modalities into Text” is dressedbyexplicitlymodellingthedecodingcov- gratefullyacknowledged. erage of the source words (Tu et al., 2016; Cohn et al., 2016). Keeping track of the attention his- References tory would help adjust future attention and guide thedecodertowardsuntranslatedsourcewords. [Andreasetal.2013] Jacob Andreas, Andreas Vlachos, andStephenClark. 2013. Semanticparsingasma- Argument Identification Some mentions are chine translation. In Proceedings of the 51st ACL, incorrectly identified as arguments. For example, pages47–52,Sofia,Bulgaria. the word may is sometimes identified as a month [ArtziandZettlemoyer2011] Yoav Artzi and Luke when it is simply a modal verb. Moreover, some Zettlemoyer. 2011. Bootstrapping semantic argument mentions are ambiguous. For instance, parsers from conversations. In Proceedings of the 6 o’clock can be used to express either 6 am or 6 2011 EMNLP, pages 421–432, Edinburgh, United Kingdom. pm. We could disambiguate arguments based on contextual information. The execution results of [ArtziandZettlemoyer2013] Yoav Artzi and Luke logical forms could also help prune unreasonable Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to arguments. actions. TACL,1(1):49–62. Rare Words Because the data size of JOBS, [Bahdanauetal.2015] Dzmitry Bahdanau, Kyunghyun GEO, and ATIS is relatively small, some question Cho, and Yoshua Bengio. 2015. Neural machine words are rare in the training set, which makes it translationbyjointlylearningtoalignandtranslate. hardtoestimatereliableparametersforthem. One InProceedingsoftheICLR,SanDiego,California. solution would be to learn word embeddings on [Birdetal.2009] StevenBird,EwanKlein,andEdward unannotated text data, and then use these as pre- Loper. 2009. Natural Language Processing with trainedvectorsforquestionwords. Python. O’ReillyMedia. [CaiandYates2013] Qingqing Cai and Alexander 5 Conclusions Yates. 2013. Semantic parsing freebase: Towards open-domain semantic parsing. In 2nd Joint Con- In this paper we presented an encoder-decoder ference on Lexical and Computational Semantics, neural network model for mapping natural lan- pages328–338,Atlanta,Georgia. guage descriptions to their meaning representa- [ChenandMooney2011] David L. Chen and Ray- tions. We encode natural language utterances mond J. Mooney. 2011. Learning to interpret nat- intovectorsandgeneratetheircorrespondinglog- urallanguagenavigationinstructionsfromobserva- ical forms as sequences or trees using recur- tions. InProceedingsofthe15thAAAI,pages859– rent neural networks with long short-term mem- 865,SanFrancisco,California. ory units. Experimental results show that en- [Choetal.2014] Kyunghyun Cho, Bart van Merrien- hancing the model with a hierarchical tree de- boer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi coder and an attention mechanism improves per- Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN [Kateetal.2005] Rohit J. Kate, Yuk Wah Wong, and encoder-decoder for statistical machine translation. Raymond J.Mooney. 2005. Learning to transform In Proceedings of the 2014 EMNLP, pages 1724– natural to formal languages. In Proceedings of the 1734,Doha,Qatar. 20th AAAI, pages 1062–1068, Pittsburgh, Pennsyl- vania. [Clarkeetal.2010] James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving [KrishnamurthyandMitchell2012] Jayant Krish- semanticparsingfromtheworld’sresponse. InPro- namurthy and Tom Mitchell. 2012. Weakly ceedings of CONLL, pages 18–27, Uppsala, Swe- supervised training of semantic parsers. In Pro- den. ceedings of the 2012 EMNLP, pages 754–765, Jeju Island,Korea. [Cohnetal.2016] Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, [Kwiatkowskietal.2010] Tom Kwiatkowski, Luke and Gholamreza Haffari. 2016. Incorporating Zettlemoyer, Sharon Goldwater, and Mark Steed- structural alignment biases into an attentional neu- man. 2010. Inducing probabilistic CCG grammars ral translation model. In Proceedings of the 2016 from logical form with higher-order unification. In NAACL,SanDiego,California. Proceedingsofthe2010EMNLP,pages1223–1233, Cambridge,Massachusetts. [GeandMooney2005] Ruifang Ge and Raymond J. [Kwiatkowskietal.2011] Tom Kwiatkowski, Luke Mooney. 2005. A statistical semantic parser that Zettlemoyer, Sharon Goldwater, and Mark Steed- integrates syntax and semantics. In Proceedings of man. 2011. Lexical generalization in CCG CoNLL,pages9–16,AnnArbor,Michigan. grammar induction for semantic parsing. In Pro- [GoldwasserandRoth2011] Dan Goldwasser and Dan ceedings of the 2011 EMNLP, pages 1512–1523, Roth. 2011. Learningfromnaturalinstructions. In Edinburgh,UnitedKingdom. Proceedings of the 22nd IJCAI, pages 1794–1800, [Kwiatkowskietal.2013] Tom Kwiatkowski, Eunsol Barcelona,Spain. Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology [Grefenstetteetal.2014] Edward Grefenstette, Phil matching. In Proceedings of the 2013 EMNLP, Blunsom, Nando de Freitas, and Karl Moritz pages1545–1556,Seattle,Washington. Hermann. 2014. A deep architecture for semantic parsing. InProceedingsoftheACL2014Workshop [Liangetal.2013] Percy Liang, Michael I. Jordan, and onSemanticParsing,Atlanta,Georgia. DanKlein. 2013. Learningdependency-basedcom- positional semantics. Computational Linguistics, [HeandYoung2006] YulanHeandSteveYoung. 2006. 39(2):389–446. Semantic processing using the hidden vector state model. SpeechCommunication,48(3-4):262–275. [Luetal.2008] Wei Lu, Hwee Tou Ng, Wee Sun Lee, andLukeS.Zettlemoyer. 2008. Agenerativemodel [Hermannetal.2015] Karl Moritz Hermann, Tomas forparsingnaturallanguagetomeaningrepresenta- Kocisky,EdwardGrefenstette,LasseEspeholt,Will tions. In Proceedings of the 2008 EMNLP, pages Kay, Mustafa Suleyman, and Phil Blunsom. 2015. 783–792,Honolulu,Hawaii. Teachingmachinestoreadandcomprehend. InAd- vances in Neural Information Processing Systems [Luongetal.2015a] Minh-Thang Luong, Ilya 28,pages1684–1692.CurranAssociates,Inc. Sutskever, Quoc V Le, Oriol Vinyals, and Wo- jciech Zaremba. 2015a. Addressing the rare word [Jeanetal.2015] Se´bastien Jean, Kyunghyun Cho, probleminneuralmachinetranslation. InProceed- RolandMemisevic, andYoshuaBengio. 2015. On ingsofthe53rdACLand7thIJCNLP,pages11–19, using very large target vocabulary for neural ma- Beijing,China. chine translation. In Proceedings of 53rd ACL and 7thIJCNLP,pages1–10,Beijing,China. [Luongetal.2015b] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015b. Effective ap- [KalchbrennerandBlunsom2013] Nal Kalchbrenner proaches to attention-based neural machine trans- and Phil Blunsom. 2013. Recurrent continuous lation. In Proceedings of the 2015 EMNLP, pages translation models. In Proceedings of the 2013 1412–1421,Lisbon,Portugal. EMNLP,pages1700–1709,Seattle,Washington. [Meietal.2016] Hongyuan Mei, Mohit Bansal, and [KarpathyandFei-Fei2015] Andrej Karpathy and MatthewRWalter. 2016. Listen,attend,andwalk: LiFei-Fei. 2015. Deepvisual-semanticalignments Neural mapping of navigational instructions to ac- for generating image descriptions. In Proceedings tion sequences. In Proceedings of the 30th AAAI, ofCVPR,pages3128–3137,Boston,Massachusetts. Phoenix,Arizona. toappear. [KateandMooney2006] RohitJ.KateandRaymondJ. [Milleretal.1996] Scott Miller, David Stallard, Robert Mooney. 2006. Usingstring-kernelsforlearningse- Bobrow, andRichardSchwartz. 1996. Afullysta- manticparsers. InProceedingsofthe21stCOLING tistical approach to natural language interfaces. In and44thACL,pages913–920,Sydney,Australia. ACL,pages55–61.

Language to Logical Form with Neural Attention PDF

0.98 MB·

by Li Dong

#journals #arxiv

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Language to Logical Form with Neural Attention PDF Free - Full Version

by Li Dong| 0.98

Download Language to Logical Form with Neural Attention by Li Dong in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Language to Logical Form with Neural Attention

No description available for this book.

Detailed Information

Author:	Li Dong
File Size:	0.98
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Language to Logical Form with Neural Attention Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Language to Logical Form with Neural Attention PDF?

Yes, on https://PDFdrive.to you can download Language to Logical Form with Neural Attention by Li Dong completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Language to Logical Form with Neural Attention on my mobile device?

After downloading Language to Logical Form with Neural Attention PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Language to Logical Form with Neural Attention?

Yes, this is the complete PDF version of Language to Logical Form with Neural Attention by Li Dong. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Language to Logical Form with Neural Attention PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.