Table Of ContentLanguage to Logical Form with Neural Attention
LiDong and MirellaLapata
InstituteforLanguage,CognitionandComputation
SchoolofInformatics,UniversityofEdinburgh
10CrichtonStreet,EdinburghEH89AB
li.dong@ed.ac.uk,mlap@inf.ed.ac.uk
Abstract Attention Layer
answer(J,(compa
what microsoft jobs
Semantic parsing aims at mapping nat- do not require a MTSL MTSL noyb((JJ,)',mnoict(r(orseoqf_t'd),ej
bscs?
ural language to machine interpretable g(J,'bscs')))))
meaning representations. Traditional ap- Input Sequence Sequence/Tree Logical
proaches rely on high-quality lexicons, Utterance Encoder Decoder Form
6 manually-built templates, and linguis- Figure 1: Input utterances and their logical forms
1 tic features which are either domain-
are encoded and decoded with neural networks.
0
or representation-specific. In this pa-
2 Anattentionlayerisusedtolearnsoftalignments.
per we present a general method based
n
onanattention-enhancedencoder-decoder
u
J model. We encode input utterances into
6 vector representations, and generate their Encoder-decoder architectures based on recur-
] logical forms by conditioning the output rent neural networks have been successfully ap-
L sequences or trees on the encoding vec- plied to a variety of NLP tasks ranging from syn-
C tors. Experimentalresultsonfourdatasets tactic parsing (Vinyals et al., 2015a), to machine
.
s show that our approach performs compet- translation (Kalchbrenner and Blunsom, 2013;
c
[ itivelywithoutusinghand-engineeredfea- Cho et al., 2014; Sutskever et al., 2014), and
tures and is easy to adapt across domains image description generation (Karpathy and Fei-
2
v andmeaningrepresentations. Fei, 2015; Vinyals et al., 2015b). As shown in
0 Figure 1, we adapt the general encoder-decoder
8
1 Introduction paradigm to the semantic parsing task. Our
2
1 model learns from natural language descriptions
0 Semantic parsing is the task of translating text paired with meaning representations; it encodes
. to a formal meaning representation such as log-
1 sentences and decodes logical forms using recur-
0 ical forms or structured queries. There has re-
rentneuralnetworkswithlongshort-termmemory
6
cently been a surge of interest in developing ma-
1 (LSTM) units. We present two model variants,
: chine learning methods for semantic parsing (see the first one treats semantic parsing as a vanilla
v
the references in Section 2), due in part to the
i sequence transduction task, whereas our second
X
existence of corpora containing utterances anno-
modelisequippedwithahierarchicaltreedecoder
r tated with formal meaning representations. Fig-
a whichexplicitlycapturesthecompositionalstruc-
ure 1 shows an example of a question (left hand-
ture of logical forms. We also introduce an atten-
side) and its annotated logical form (right hand-
tion mechanism (Bahdanau et al., 2015; Luong et
side),takenfromJOBS(TangandMooney,2001), al.,2015b)allowingthemodeltolearnsoftalign-
awell-knownsemanticparsingbenchmark. Inor-
mentsbetweennaturallanguageandlogicalforms
der to predict the correct logical form for a given
andpresentanargumentidentificationsteptohan-
utterance, most previous systems rely on prede-
dlerarementionsofentitiesandnumbers.
fined templates and manually designed features,
which often render the parsing model domain- or Evaluationresultsdemonstratethatcomparedto
representation-specific. In this work, we aim to previous methods our model achieves similar or
useasimpleyeteffectivemethodtobridgethegap better performance across datasets and meaning
between natural language and logical form with representations,despiteusingnohand-engineered
minimaldomainknowledge. domain-orrepresentation-specificfeatures.
2 RelatedWork Our model works on more well-defined meaning
representations (such as Prolog and lambda cal-
Our work synthesizes two strands of research,
culus) and is conceptually simpler; it does not
namelysemanticparsingandtheencoder-decoder
employbidirectionalityormulti-levelalignments.
architecturewithneuralnetworks.
Grefenstette et al. (2014) propose a different ar-
The problem of learning semantic parsers has
chitectureforsemanticparsingbasedonthecom-
received significant attention, dating back to
bination of two neural network models. The first
Woods (1973). Many approaches learn from sen-
model learns shared representations from pairs of
tences paired with logical forms following vari-
questions and their translations into knowledge
ous modeling strategies. Examples include the
basequeries,whereasthesecondmodelgenerates
useofparsingmodels(Milleretal.,1996;Geand
thequeriesconditionedonthelearnedrepresenta-
Mooney, 2005; Lu et al., 2008; Zhao and Huang,
tions. However,theydonotreportempiricaleval-
2015), inductive logic programming (Zelle and
uationresults.
Mooney, 1996; Tang and Mooney, 2000; Thom-
spon and Mooney, 2003), probabilistic automata 3 ProblemFormulation
(HeandYoung,2006),string/tree-to-treetransfor-
Our aim is to learn a model which maps natural
mation rules (Kate et al., 2005), classifiers based
language input q = x ···x to a logical form
on string kernels (Kate and Mooney, 2006), ma- 1 |q|
representation of its meaning a = y ···y . The
chinetranslation(WongandMooney,2006;Wong 1 |a|
conditionalprobabilityp(a|q)isdecomposedas:
and Mooney, 2007; Andreas et al., 2013), and
combinatory categorial grammar induction tech- |a|
(cid:89)
niques (Zettlemoyer and Collins, 2005; Zettle- p(a|q) = p(y |y ,q) (1)
t <t
moyer and Collins, 2007; Kwiatkowski et al., t=1
2010; Kwiatkowski et al., 2011). Other work
wherey = y ···y .
<t 1 t−1
learnssemanticparserswithoutrelyingonlogical-
Our method consists of an encoder which en-
fromannotations,e.g.,fromsentencespairedwith
codesnaturallanguageinputq intoavectorrepre-
conversationallogs(ArtziandZettlemoyer,2011),
sentation and a decoder which learns to generate
system demonstrations (Chen and Mooney, 2011;
y ,··· ,y conditioned on the encoding vector.
1 |a|
Goldwasser and Roth, 2011; Artzi and Zettle-
In the following we describe two models varying
moyer, 2013), question-answer pairs (Clarke et
inthewayinwhichp(a|q)iscomputed.
al., 2010; Liang et al., 2013), and distant supervi-
sion (Krishnamurthy and Mitchell, 2012; Cai and 3.1 Sequence-to-SequenceModel
Yates,2013;Reddyetal.,2014).
This model regards both input q and output a as
Our model learns from natural language de-
sequences. AsshowninFigure2,theencoderand
scriptions paired with meaning representations.
decoderaretwodifferentL-layerrecurrentneural
Most previous systems rely on high-quality lex-
networks with long short-term memory (LSTM)
icons, manually-built templates, and features
unitswhichrecursivelyprocesstokensonebyone.
which are either domain- or representation-
Thefirst|q|timestepsbelongtotheencoder,while
specific. Weinsteadpresentageneralmethodthat
thefollowing|a|timestepsbelongtothedecoder.
can be easily adapted to different domains and Let hl ∈ Rn denote the hidden vector at time
t
meaning representations. We adopt the general steptandlayerl. hl isthencomputedby:
t
encoder-decoder framework based on neural net-
(cid:16) (cid:17)
workswhichhasbeenrecentlyrepurposedforvar- hl = LSTM hl ,hl−1 (2)
t t−1 t
iousNLPtaskssuchassyntacticparsing(Vinyals
et al., 2015a), machine translation (Kalchbrenner where LSTM refers to the LSTM function being
andBlunsom,2013;Choetal.,2014;Sutskeveret used. In our experiments we follow the architec-
al.,2014),imagedescriptiongeneration(Karpathy ture described in Zaremba et al. (2015), however
and Fei-Fei, 2015; Vinyals et al., 2015b), ques- other types of gated activation functions are pos-
tion answering (Hermann et al., 2015), and sum- sible (e.g., Cho et al. (2014)). For the encoder,
marization(Rushetal.,2015). h0 = W e(x ) is the word vector of the current
t q t
Mei et al. (2016) use a sequence-to-sequence input token, with Wq ∈ Rn×|Vq| being a parame-
modeltomapnavigationalinstructionstoactions. termatrix,ande(·)theindexofthecorresponding
lambda $0 e <n> </s>
L L L L L L L
S S S S S S S
SL SL SL SL SL SL MT MT MT MT MT MT MT
T T T T T T
M M M M M M
and <n> <n> </s>
L L L L
S S S S
L L L L L L MT MT MT MT
S S S S S S
T T T T T T
M M M M M M
> <n>1600:ti </s> from $0dallas:ci</s>
TSL TSL TSL TSL TSL TSL TSL TSL
M M M M M M M M
Figure 2: Sequence-to-sequence (SEQ2SEQ) departure
<n> Nonterminal _time $0 </s>
modelwithtwo-layerrecurrentneuralnetworks. Start decoding
Parent feeding SL SL SL
T T T
LSTM Encoder unit M M M
LSTM Decoder unit
token. For the decoder, h0 = W e(y ) is the
t a t−1
wordvectorofthepreviouspredictedword,where
Figure 3: Sequence-to-tree (SEQ2TREE) model
Wa ∈ Rn×|Va|. Notice that the encoder and de- withahierarchicaltreedecoder.
coderhavedifferentLSTMparameters.
Once the tokens of the input sequence
x1,··· ,x|q| are encoded into vectors, they are we define a “nonterminal” <n> token which in-
usedtoinitializethehiddenstatesofthefirsttime dicates subtrees. As shown in Figure 3, we pre-
stepinthedecoder. Next,thehiddenvectorofthe processthelogicalform“lambda$0e(and(>(de-
topmost LSTM hLt in the decoder is used to pre- parture time$0)1600:ti)(from$0dallas:ci))”toa
dictthet-thoutputtokenas: treebyreplacingtokensbetweenpairsofbrackets
p(y |y ,q) = softmax(cid:0)W hL(cid:1)(cid:124)e(y ) (3) with nonterminals. Special tokens <s> and <(>
t <t o t t denotethebeginningofasequenceandnontermi-
nalsequence, respectively(omittedfromFigure3
where Wo ∈ R|Va|×n is a parameter matrix, and
due to lack of space). Token </s> represents the
e(yt) ∈ {0,1}|Va| aone-hotvectorforcomputing
endofsequence.
y ’sprobabilityfromthepredicteddistribution.
t
Afterencodinginputq,thehierarchicaltreede-
We augment every sequence with a “start-of-
coder uses recurrent neural networks to generate
sequence” <s> and “end-of-sequence” </s> to-
tokens at depth 1 of the subtree corresponding to
ken. Thegenerationprocessterminatesonce</s>
parts of logical form a. If the predicted token
ispredicted. Theconditionalprobabilityofgener-
is <n>, we decode the sequence by conditioning
ating the whole sequence p(a|q) is then obtained
on the nonterminal’s hidden vector. This process
usingEquation(1).
terminates when no more nonterminals are emit-
3.2 Sequence-to-TreeModel ted. Inotherwords,asequencedecoderisusedto
hierarchicallygeneratethetreestructure.
The SEQ2SEQ model has a potential drawback in
that it ignores the hierarchical structure of logical In contrast to the sequence decoder described
forms. As a result, it needs to memorize various in Section 3.1, the current hidden state does not
piecesofauxiliaryinformation(e.g.,bracketpairs) only depend on its previous time step. In order to
to generate well-formed output. In the following betterutilizetheparentnonterminal’sinformation,
we present a hierarchical tree decoder which is we introduce a parent-feeding connection where
morefaithfultothecompositionalnatureofmean- thehiddenvectoroftheparentnonterminaliscon-
ing representations. A schematic description of catenatedwiththeinputsandfedintoLSTM.
themodelisshowninFigure3. As an example, Figure 4 shows the decoding
Thepresentmodelsharesthesameencoderwith tree corresponding to the logical form “A B (C)”,
thesequence-to-sequencemodeldescribedinSec- where y ···y are predicted tokens, and t ···t
1 6 1 6
tion 3.1 (essentially it learns to encode input q as denote different time steps. Span “(C)” corre-
vectors). However, its decoder is fundamentally spondstoasubtree. Decodinginthisexamplehas
different as it generates logical forms in a top- twosteps: onceinputq hasbeenencoded,wefirst
downmanner. Inordertorepresenttreestructure, generate y ···y at depth 1 until token </s> is
1 4
y =A y =B y =<n>y =</s>
1 2 3 4
q t t t t
1 2 3 4
Attention
y5=Cy6=</s> Scores
<s>
t t SL SL SL SL SL
5 6 T T T T T
M M M M M
<(>
Figure4: A SEQ2TREE decodingexampleforthe
logicalform“AB(C)”. Figure 5: Attention scores are computed by the
currenthiddenvectorandallthehiddenvectorsof
encoder. Then, the encoder-side context vector ct
predicted; next, we generate y5,y6 by condition- is obtained in the form of a weighted sum, which
ingonnonterminalt3’shiddenvectors. Theprob- isfurtherusedtopredictyt.
abilityp(a|q)istheproductofthesetwosequence
decodingsteps: p(y |y ,q) = softmax(cid:0)W hatt(cid:1)(cid:124)e(y ) (8)
t <t o t t
p(a|q) = p(y1y2y3y4|q)p(y5y6|y≤3,q) (4) where Wo ∈ R|Va|×n and W1,W2 ∈ Rn×n are
three parameter matrices, and e(y ) is a one-hot
t
where Equation (3) is used for the prediction of
vectorusedtoobtainy ’sprobability.
t
eachoutputtoken.
3.4 ModelTraining
3.3 AttentionMechanism
Ourgoalistomaximizethelikelihoodofthegen-
As shown in Equation (3), the hidden vectors of
erated logical forms given natural language utter-
the input sequence are not directly used in the
ancesasinput. Sotheobjectivefunctionis:
decoding process. However, it makes intuitively
(cid:88)
sensetoconsiderrelevantinformationfromthein- minimize− logp(a|q) (9)
put to better predict the current token. Following (q,a)∈D
this idea, various techniques have been proposed
where D is the set of all natural language-logical
to integrate encoder-side information (in the form
form training pairs, and p(a|q) is computed as
of a context vector) at each time step of the de-
showninEquation(1).
coder(Bahdanauetal.,2015;Luongetal.,2015b;
The RMSProp algorithm (Tieleman and Hin-
Xuetal.,2015).
ton, 2012) is employed to solve this non-convex
As shown in Figure 5, in order to find rele-
optimization problem. Moreover, dropout is used
vant encoder-side context for the current hidden
forregularizingthemodel(Zarembaetal.,2015).
statehLofdecoder,wecomputeitsattentionscore
t Specifically, dropout operators are used between
withthek-thhiddenstateintheencoderas:
different LSTM layers and for the hidden lay-
exp{hL·hL} ers before the softmax classifiers. This technique
st = k t (5)
k (cid:80)|q| exp{hL·hL} can substantially reduce overfitting, especially on
j=1 j t datasetsofsmallsize.
where hL,··· ,hL are the top-layer hidden vec- 3.5 Inference
1 |q|
torsoftheencoder. Then,thecontextvectoristhe
Attesttime,wepredictthelogicalformforanin-
weightedsumofthehiddenvectorsintheencoder:
pututteranceq by:
|q| aˆ = argmaxp(cid:0)a(cid:48)|q(cid:1) (10)
(cid:88)
ct = stkhLk (6) a(cid:48)
k=1 where a(cid:48) represents a candidate output. How-
InlieuofEquation(3),wefurtherusethiscon- ever, it is impractical to iterate over all possible
textvectorwhichactsasasummaryoftheencoder results to obtain the optimal prediction. Accord-
tocomputetheprobabilityofgeneratingy as: ing to Equation (1), we decompose the probabil-
t
ity p(a|q) so that we can use greedy search (or
hatt = tanh(cid:0)W hL+W ct(cid:1) (7) beamsearch)togeneratetokensonebyone.
t 1 t 2
and “job(ANS), salary greater than(ANS, num ,
0
year)”. We use the pre-processed examples as
trainingdata. Atinferencetime,wealsomasken-
titiesandnumberswiththeirtypesandIDs. Once
we obtain the decoding result, a post-processing
steprecoversallthemarkerstype totheircorre-
i
spondinglogicalconstants.
4 Experiments
We compare our method against multiple previ-
ous systems on four datasets. We describe these
datasets below, and present our experimental set-
tingsandresults. Finally,weconductmodelanal-
ysisinordertounderstandwhatthemodellearns.
Algorithm1describesthedecodingprocessfor The code is available at https://github.
SEQ2TREE. The time complexity of both de- com/donglixp/lang2logic.
coders is O(|a|), where |a| is the length of out-
put. The extra computation of SEQ2TREE com- 4.1 Datasets
pared with SEQ2SEQ is to maintain the nonter- Our model was trained on the following datasets,
minal queue, which can be ignored because most covering different domains and using different
of time is spent on matrix operations. We imple- meaning representations. Examples for each do-
mentthehierarchicaltreedecoderinabatchmode, mainareshowninTable1.
so that it can fully utilize GPUs. Specifically, as
shown in Algorithm 1, every time we pop multi- JOBS This benchmark dataset contains 640
plenonterminalsfromthequeueanddecodethese queries to a database of job listings. Specifically,
nonterminalsinonebatch. questionsarepairedwithProlog-stylequeries. We
used the same training-test split as Zettlemoyer
3.6 ArgumentIdentification and Collins (2005) which contains 500 training
and 140 test instances. Values for the variables
The majority of semantic parsing datasets have
company, degree, language, platform, location,
beendevelopedwithquestion-answeringinmind.
jobarea,andnumberareidentified.
Inthetypicalapplicationsetting,naturallanguage
questions are mapped into logical forms and ex-
GEO Thisisastandardsemanticparsingbench-
ecuted on a knowledge base to obtain an answer.
mark which contains 880 queries to a database of
Due to the nature of the question-answering task,
U.S. geography. GEO has 880 instances split into
many natural language utterances contain entities
a training set of 680 training examples and 200
or numbers that are often parsed as arguments in
test examples (Zettlemoyer and Collins, 2005).
the logical form. Some of them are unavoidably
We used the same meaning representation based
rareordonotappearinthetrainingsetatall(this
on lambda-calculus as Kwiatkowski et al. (2011).
is especially true for small-scale datasets). Con-
Values for the variables city, state, country, river,
ventional sequence encoders simply replace rare
andnumberareidentified.
words with a special unknown word symbol (Lu-
ong et al., 2015a; Jean et al., 2015), which would
ATIS This dataset has 5,410 queries to a flight
bedetrimentalforsemanticparsing.
booking system. The standard split has 4,480
We have developed a simple procedure for ar- traininginstances,480developmentinstances,and
gument identification. Specifically, we identify 450 test instances. Sentences are paired with
entities and numbers in input questions and re-
lambda-calculus expressions. Values for the vari-
place them with their type names and unique
ablesdate,time,city,aircraftcode,airport,airline,
IDs. For instance, we pre-process the training
andnumberareidentified.
example “jobs with a salary of 40000” and its
logicalform“job(ANS),salary greater than(ANS, IFTTT Quirk et al. (2015) created this dataset
40000, year)” as “jobs with a salary of num ” by extracting a large number of if-this-then-that
0
Dataset Length Example
9.80 whatmicrosoftjobsdonotrequireabscs?
JOBS
22.90 answer(company(J,’microsoft’),job(J),not((req deg(J,’bscs’))))
7.60 whatisthepopulationofthestatewiththelargestarea?
GEO
19.10 (population:i(argmax$0(state:t$0)(area:i$0)))
11.10 dallastosanfranciscoleavingafter4intheafternoonplease
ATIS
28.10 (lambda$0e(and(>(departure time$0)1600:ti)(from$0dallas:ci)(to$0san francisco:ci)))
Turnonheaterwhentemperaturedropsbelow58degree
6.95
IFTTT TRIGGER:Weather-Current temperature drops below-((Temperature(58))(Degrees in(f)))
21.80
ACTION:WeMo Insight Switch-Turn on-((Which switch?(””)))
Table1: Examplesofnaturallanguagedescriptionsandtheirmeaningrepresentationsfromfourdatasets.
Theaveragelengthofinputandoutputsequencesisshowninthesecondcolumn.
recipesfromtheIFTTTwebsite1. Recipesaresim- Method Accuracy
COCKTAIL(TangandMooney,2001) 79.4
pleprogramswithexactlyonetriggerandoneac-
PRECISE(Popescuetal.,2003) 88.0
tionwhichusersspecifyonthesite. Wheneverthe ZC05(ZettlemoyerandCollins,2005) 79.3
conditions of the trigger are satisfied, the action DCS+L(Liangetal.,2013) 90.7
TISP(ZhaoandHuang,2015) 85.0
is performed. Actions typically revolve around
SEQ2SEQ 87.1
homesecurity(e.g.,“turnonmylightswhenIar- −attention 77.9
rivehome”),automation(e.g.,“textmeifthedoor −argument 70.7
SEQ2TREE 90.0
opens”), well-being (e.g., “remind me to drink
−attention 83.6
water if I’ve been at a bar for more than two
hours”), and so on. Triggers and actions are se- Table2: Evaluationresultson JOBS.
lected from different channels (160 in total) rep-
resenting various types of services, devices (e.g.,
on the training set for JOBS and GEO. We used
Android), and knowledge sources (such as ESPN
thestandarddevelopmentsetsforATISandIFTTT.
or Gmail). In the dataset, there are 552 trigger
We usedthe RMSProp algorithm(with batch size
functionsfrom128channels,and229actionfunc-
setto20)toupdatetheparameters. Thesmoothing
tions from 99 channels. We used Quirk et al.’s
constant of RMSProp was 0.95. Gradients were
(2015)originalsplitwhichcontains77,495train-
clipped at 5 to alleviate the exploding gradient
ing,5,171development,and4,294testexamples.
problem (Pascanu et al., 2013). Parameters were
The IFTTT programs are represented as abstract
randomly initialized from a uniform distribution
syntax trees and are paired with natural language
U(−0.08,0.08). Atwo-layerLSTMwasusedfor
descriptionsprovidedbyusers(seeTable1). Here,
IFTTT, while a one-layer LSTM was employed
numbersandURLsareidentified.
for the other domains. The dropout rate was se-
lected from {0.2,0.3,0.4,0.5}. Dimensions of
4.2 Settings
hidden vector and word embedding were selected
Naturallanguagesentenceswerelowercased;mis-
from {150,200,250}. Early stopping was used
spellings were corrected using a dictionary based
to determine the number of epochs. Input sen-
on the Wikipedia list of common misspellings.
tences were reversed before feeding into the en-
Words were stemmed using NLTK (Bird et al.,
coder (Sutskever et al., 2014). We use greedy
2009). For IFTTT, we filtered tokens, channels search to generate logical forms during inference.
and functions which appeared less than five times
Notice that two decoders with shared word em-
in the training set. For the other datasets, we fil-
beddingswereusedtopredicttriggersandactions
teredinputwordswhichdidnotoccuratleasttwo
for IFTTT, andtwosoftmaxclassifiersareusedto
times in the training set, but kept all tokens in
classifychannelsandfunctions.
the logical forms. Plain string matching was em-
ployed to identify augments as described in Sec-
4.3 Results
tion 3.6. More sophisticated approaches could be
used,howeverweleavethisfuturework. We first discuss the performance of our model on
Model hyper-parameters were cross-validated JOBS, GEO, and ATIS, and then examine our re-
sults on IFTTT. Tables 2–4 present comparisons
1http://www.ifttt.com against a variety of systems previously described
Method Accuracy Method Channel +Func F1
SCISSOR(GeandMooney,2005) 72.3 retrieval 28.9 20.2 41.7
KRISP(KateandMooney,2006) 71.7 phrasal 19.3 11.3 35.3
WASP(WongandMooney,2006) 74.8 sync 18.1 10.6 35.1
λ-WASP(WongandMooney,2007) 86.6 classifier 48.8 35.2 48.4
LNLZ08(Luetal.,2008) 81.8 posclass 50.0 36.9 49.3
ZC05(ZettlemoyerandCollins,2005) 79.3 SEQ2SEQ 54.3 39.2 50.1
ZC07(ZettlemoyerandCollins,2007) 86.1 −attention 54.0 37.9 49.8
UBL(Kwiatkowskietal.,2010) 87.9 −argument 53.9 38.6 49.7
FUBL(Kwiatkowskietal.,2011) 88.6 SEQ2TREE 55.2 40.1 50.4
KCAZ13(Kwiatkowskietal.,2013) 89.0 −attention 54.3 38.2 50.0
DCS+L(Liangetal.,2013) 87.9
(a)Omitnon-English.
TISP(ZhaoandHuang,2015) 88.9
SEQ2SEQ 84.6
Method Channel +Func F1
−attention 72.9
retrieval 36.8 25.4 49.0
−argument 68.6
phrasal 27.8 16.4 39.9
SEQ2TREE 87.1
sync 26.7 15.5 37.6
−attention 76.8
classifier 64.8 47.2 56.5
posclass 67.2 50.4 57.7
Table3: EvaluationresultsonGEO. 10-foldcross-
SEQ2SEQ 68.8 50.5 60.3
validationisusedforthesystemsshowninthetop −attention 68.7 48.9 59.5
half of the table. The standard split of ZC05 is −argument 68.8 50.4 59.7
SEQ2TREE 69.6 51.4 60.4
usedforallothersystems.
−attention 68.7 49.5 60.2
(b)Omitnon-English&unintelligible.
Method Accuracy
ZC07(ZettlemoyerandCollins,2007) 84.6 Method Channel +Func F1
UBL(Kwiatkowskietal.,2010) 71.4 retrieval 43.3 32.3 56.2
FUBL(Kwiatkowskietal.,2011) 82.8 phrasal 37.2 23.5 45.5
GUSP-FULL(Poon,2013) 74.8 sync 36.5 24.1 42.8
GUSP++(Poon,2013) 83.5 classifier 79.3 66.2 65.0
TISP(ZhaoandHuang,2015) 84.2 posclass 81.4 71.0 66.5
SEQ2SEQ 84.2 SEQ2SEQ 87.8 75.2 73.7
−attention 75.7 −attention 88.3 73.8 72.9
−argument 72.3 −argument 86.8 74.9 70.8
SEQ2TREE 84.6 SEQ2TREE 89.7 78.4 74.2
−attention 77.5 −attention 87.6 74.9 73.5
(c)≥3turkersagreewithgold.
Table4: Evaluationresultson ATIS.
Table5: Evaluationresultson IFTTT.
in the literature. We report results with the full
models (SEQ2SEQ, SEQ2TREE) and two abla- tention substantially improves performance on all
tionvariants,i.e.,withoutanattentionmechanism three datasets. This underlines the importance of
(−attention) and without argument identification utilizing soft alignments between inputs and out-
(−argument). We report accuracy which is de- puts. We further analyze what the attention layer
fined as the proportion of the input sentences that learns in Figure 6. Moreover, our results show
are correctly parsed to their gold standard logical that argument identification is critical for small-
forms. Notice that DCS+L, KCAZ13 and GUSP scale datasets. For example, about 92% of city
outputanswersdirectly,soaccuracyinthissetting names appear less than 4 times in the GEO train-
isdefinedasthepercentageofcorrectanswers. ing set, so it is difficult to learn reliable parame-
Overall, SEQ2TREE is superior to SEQ2SEQ. tersforthesewords. Inrelationtopreviouswork,
This is to be expected since SEQ2TREE ex- theproposedmodelsachievecomparableorbetter
plicitly models compositional structure. On the performance. Importantly,weusethesameframe-
JOBS and GEO datasets which contain logical work (SEQ2SEQ or SEQ2TREE) across datasets
forms with nested structures, SEQ2TREE out- and meaning representations (Prolog-style logi-
performs SEQ2SEQ by 2.9% and 2.5%, respec- cal forms in JOBS and lambda calculus in the
tively. SEQ2TREE achieves better accuracy over other two datasets) without modification. Despite
SEQ2SEQ on ATIS too,however,thedifferenceis this relatively simple approach, we observe that
smaller, since ATIS is a simpler domain without SEQ2TREE ranks second on JOBS, and is tied for
complexnestedstructures. Wefindthataddingat- firstplacewithZC07on ATIS.
argmin
AjNobS() laemxbis$dt0aes(( fliag$nhd0t(( argm$a0x(
salary_greater_tAhNanS(, round_atr$$nipd11(( fro$m0)( placaen:dt(
num0, class_type)( c$i00 $0
, $1 ) )
year first:c)l ( (
) ( to
, fro$m1 $0 loc:t
req_de\+g(( cc$tii011o)( tomorrcoiw1)( c$o00)
ANS(, far=e)(( $0))( )(
degid0))) $$10)))) departure_<ti/ms$>0e) elevatio$n0:)i
</s></s>degid0arequirnotdothatnum0payjobwhich<s> </s>)</s>ci1toci0fromtriproundfareclassfirstwhat<s> </s>tomorrowci1toci0fromflightearliesttheiswhat<s> </s> </s>co0theinelevhighesttheiswhat<s>
(a)whichjobspaynum0thatdo (b)what’sfirstclassfare (c)whatistheearliestflight (d)whatisthehighestelevation
notrequireadegid0 roundtripfromci0toci1 fromci0toci1tomorrow intheco0
Figure 6: Alignments (same color rectangles) produced by the attention mechanism (darker color rep-
resents higher attention score). Input sentences are reversed and stemmed. Model output is shown for
SEQ2SEQ(a,b)and SEQ2TREE(c,d).
We illustrate examples of alignments produced the proposed derivation. We compare our model
by SEQ2SEQ in Figures 6a and 6b. Alignments against posclass, the method introduced in Quirk
produced by SEQ2TREE are shown in Figures 6c et al. and several of their baselines. posclass is
and 6d. Matrices of attention scores are com- reminiscent of KRISP (Kate and Mooney, 2006),
puted using Equation (5) and are represented in it learns distributions over productions given in-
grayscale. Aligned input words and logical form put sentences represented as a bag of linguistic
predicates are enclosed in (same color) rectan- features. The retrieval baseline finds the closest
gleswhoseoverlappingareascontaintheattention description in the training data based on charac-
scores. Also notice that attention scores are com- ter string-edit-distance and returns the recipe for
putedbyLSTMhiddenvectorswhichencodecon- that training program. The phrasal method uses
textinformationratherthanjustthewordsintheir phrase-based machine translation to generate the
current positions. The examples demonstrate that recipe, whereas sync extracts synchronous gram-
the attention mechanism can successfully model mar rules from the data, essentially recreating
the correspondence between sentences and logi- WASP (Wong and Mooney, 2006). Finally, they
calforms,capturingreordering(Figure6b),many- useabinaryclassifiertopredictwhetheraproduc-
to-many(Figure6a),andmany-to-onealignments tionshouldbepresentinthederivationtreecorre-
(Figures6c,d). spondingtothedescription.
For IFTTT, we follow the same evaluation pro- Quirk et al. (2015) report results on the full
tocol introduced in Quirk et al. (2015). The test data and smaller subsets after noise filter-
dataset is extremely noisy and measuring accu- ing, e.g., when non-English and unintelligible de-
racy is problematic since predicted abstract syn- scriptions are removed (Tables 5a and 5b). They
tax trees (ASTs) almost never exactly match the also ran their system on a high-quality subset of
gold standard. Quirk et al. view an AST as a description-programpairswhichwerefoundinthe
set of productions and compute balanced F1 in- gold standard and at least three humans managed
stead which we also adopt. The first column in toindependentlyreproduce(Table5c). Acrossall
Table5showsthepercentageofchannelsselected subsets our models outperforms posclass and re-
correctly for both triggers and actions. The sec- latedbaselines. AgainweobservethatSEQ2TREE
ond column measures accuracy for both channels consistently outperforms SEQ2SEQ, albeit with a
and functions. The last column shows balanced small margin. Compared to the previous datasets,
F1 against the gold tree over all productions in the attention mechanism and our argument iden-
tification method yield less of an improvement. formance across the board. Extensive compar-
This may be due to the size of Quirk et al. (2015) isons with previous methods show that our ap-
andthewayitwascreated–usercurateddescrip- proach performs competitively, without recourse
tions are often of low quality, and thus align very todomain-orrepresentation-specificfeatures. Di-
looselytotheircorrespondingASTs. rectionsforfutureworkaremanyandvaried. For
example, it would be interesting to learn a model
4.4 ErrorAnalysis
from question-answer pairs without access to tar-
Finally, we inspected the output of our model in get logical forms. Beyond semantic parsing, we
ordertoidentifythemostcommoncausesoferrors would also like to apply our SEQ2TREE model
whichwesummarizebelow. to related structured prediction tasks such as con-
stituencyparsing.
Under-Mapping The attention model used in
our experiments does not take the alignment his- Acknowledgments We would like to thank
toryintoconsideration. So,somequestionwords, LukeZettlemoyerandTomKwiatkowskiforshar-
expecially in longer questions, may be ignored in ingtheATISdataset. ThesupportoftheEuropean
the decoding process. This is a common prob- Research Council under award number 681760
lem for encoder-decoder models and can be ad- “Translating Multiple Modalities into Text” is
dressedbyexplicitlymodellingthedecodingcov- gratefullyacknowledged.
erage of the source words (Tu et al., 2016; Cohn
et al., 2016). Keeping track of the attention his-
References
tory would help adjust future attention and guide
thedecodertowardsuntranslatedsourcewords. [Andreasetal.2013] Jacob Andreas, Andreas Vlachos,
andStephenClark. 2013. Semanticparsingasma-
Argument Identification Some mentions are chine translation. In Proceedings of the 51st ACL,
incorrectly identified as arguments. For example, pages47–52,Sofia,Bulgaria.
the word may is sometimes identified as a month
[ArtziandZettlemoyer2011] Yoav Artzi and Luke
when it is simply a modal verb. Moreover, some Zettlemoyer. 2011. Bootstrapping semantic
argument mentions are ambiguous. For instance, parsers from conversations. In Proceedings of the
6 o’clock can be used to express either 6 am or 6 2011 EMNLP, pages 421–432, Edinburgh, United
Kingdom.
pm. We could disambiguate arguments based on
contextual information. The execution results of [ArtziandZettlemoyer2013] Yoav Artzi and Luke
logical forms could also help prune unreasonable Zettlemoyer. 2013. Weakly supervised learning
of semantic parsers for mapping instructions to
arguments.
actions. TACL,1(1):49–62.
Rare Words Because the data size of JOBS,
[Bahdanauetal.2015] Dzmitry Bahdanau, Kyunghyun
GEO, and ATIS is relatively small, some question Cho, and Yoshua Bengio. 2015. Neural machine
words are rare in the training set, which makes it translationbyjointlylearningtoalignandtranslate.
hardtoestimatereliableparametersforthem. One InProceedingsoftheICLR,SanDiego,California.
solution would be to learn word embeddings on
[Birdetal.2009] StevenBird,EwanKlein,andEdward
unannotated text data, and then use these as pre- Loper. 2009. Natural Language Processing with
trainedvectorsforquestionwords. Python. O’ReillyMedia.
[CaiandYates2013] Qingqing Cai and Alexander
5 Conclusions
Yates. 2013. Semantic parsing freebase: Towards
open-domain semantic parsing. In 2nd Joint Con-
In this paper we presented an encoder-decoder
ference on Lexical and Computational Semantics,
neural network model for mapping natural lan-
pages328–338,Atlanta,Georgia.
guage descriptions to their meaning representa-
[ChenandMooney2011] David L. Chen and Ray-
tions. We encode natural language utterances
mond J. Mooney. 2011. Learning to interpret nat-
intovectorsandgeneratetheircorrespondinglog-
urallanguagenavigationinstructionsfromobserva-
ical forms as sequences or trees using recur- tions. InProceedingsofthe15thAAAI,pages859–
rent neural networks with long short-term mem- 865,SanFrancisco,California.
ory units. Experimental results show that en-
[Choetal.2014] Kyunghyun Cho, Bart van Merrien-
hancing the model with a hierarchical tree de-
boer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
coder and an attention mechanism improves per- Bougares, Holger Schwenk, and Yoshua Bengio.
2014. Learning phrase representations using RNN [Kateetal.2005] Rohit J. Kate, Yuk Wah Wong, and
encoder-decoder for statistical machine translation. Raymond J.Mooney. 2005. Learning to transform
In Proceedings of the 2014 EMNLP, pages 1724– natural to formal languages. In Proceedings of the
1734,Doha,Qatar. 20th AAAI, pages 1062–1068, Pittsburgh, Pennsyl-
vania.
[Clarkeetal.2010] James Clarke, Dan Goldwasser,
Ming-Wei Chang, and Dan Roth. 2010. Driving [KrishnamurthyandMitchell2012] Jayant Krish-
semanticparsingfromtheworld’sresponse. InPro- namurthy and Tom Mitchell. 2012. Weakly
ceedings of CONLL, pages 18–27, Uppsala, Swe- supervised training of semantic parsers. In Pro-
den. ceedings of the 2012 EMNLP, pages 754–765, Jeju
Island,Korea.
[Cohnetal.2016] Trevor Cohn, Cong Duy Vu Hoang,
Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, [Kwiatkowskietal.2010] Tom Kwiatkowski, Luke
and Gholamreza Haffari. 2016. Incorporating Zettlemoyer, Sharon Goldwater, and Mark Steed-
structural alignment biases into an attentional neu- man. 2010. Inducing probabilistic CCG grammars
ral translation model. In Proceedings of the 2016 from logical form with higher-order unification. In
NAACL,SanDiego,California. Proceedingsofthe2010EMNLP,pages1223–1233,
Cambridge,Massachusetts.
[GeandMooney2005] Ruifang Ge and Raymond J.
[Kwiatkowskietal.2011] Tom Kwiatkowski, Luke
Mooney. 2005. A statistical semantic parser that
Zettlemoyer, Sharon Goldwater, and Mark Steed-
integrates syntax and semantics. In Proceedings of
man. 2011. Lexical generalization in CCG
CoNLL,pages9–16,AnnArbor,Michigan.
grammar induction for semantic parsing. In Pro-
[GoldwasserandRoth2011] Dan Goldwasser and Dan ceedings of the 2011 EMNLP, pages 1512–1523,
Roth. 2011. Learningfromnaturalinstructions. In Edinburgh,UnitedKingdom.
Proceedings of the 22nd IJCAI, pages 1794–1800,
[Kwiatkowskietal.2013] Tom Kwiatkowski, Eunsol
Barcelona,Spain.
Choi, Yoav Artzi, and Luke Zettlemoyer. 2013.
Scaling semantic parsers with on-the-fly ontology
[Grefenstetteetal.2014] Edward Grefenstette, Phil
matching. In Proceedings of the 2013 EMNLP,
Blunsom, Nando de Freitas, and Karl Moritz
pages1545–1556,Seattle,Washington.
Hermann. 2014. A deep architecture for semantic
parsing. InProceedingsoftheACL2014Workshop
[Liangetal.2013] Percy Liang, Michael I. Jordan, and
onSemanticParsing,Atlanta,Georgia.
DanKlein. 2013. Learningdependency-basedcom-
positional semantics. Computational Linguistics,
[HeandYoung2006] YulanHeandSteveYoung. 2006.
39(2):389–446.
Semantic processing using the hidden vector state
model. SpeechCommunication,48(3-4):262–275.
[Luetal.2008] Wei Lu, Hwee Tou Ng, Wee Sun Lee,
andLukeS.Zettlemoyer. 2008. Agenerativemodel
[Hermannetal.2015] Karl Moritz Hermann, Tomas
forparsingnaturallanguagetomeaningrepresenta-
Kocisky,EdwardGrefenstette,LasseEspeholt,Will
tions. In Proceedings of the 2008 EMNLP, pages
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
783–792,Honolulu,Hawaii.
Teachingmachinestoreadandcomprehend. InAd-
vances in Neural Information Processing Systems [Luongetal.2015a] Minh-Thang Luong, Ilya
28,pages1684–1692.CurranAssociates,Inc. Sutskever, Quoc V Le, Oriol Vinyals, and Wo-
jciech Zaremba. 2015a. Addressing the rare word
[Jeanetal.2015] Se´bastien Jean, Kyunghyun Cho,
probleminneuralmachinetranslation. InProceed-
RolandMemisevic, andYoshuaBengio. 2015. On
ingsofthe53rdACLand7thIJCNLP,pages11–19,
using very large target vocabulary for neural ma-
Beijing,China.
chine translation. In Proceedings of 53rd ACL and
7thIJCNLP,pages1–10,Beijing,China. [Luongetal.2015b] Thang Luong, Hieu Pham, and
Christopher D. Manning. 2015b. Effective ap-
[KalchbrennerandBlunsom2013] Nal Kalchbrenner proaches to attention-based neural machine trans-
and Phil Blunsom. 2013. Recurrent continuous lation. In Proceedings of the 2015 EMNLP, pages
translation models. In Proceedings of the 2013 1412–1421,Lisbon,Portugal.
EMNLP,pages1700–1709,Seattle,Washington.
[Meietal.2016] Hongyuan Mei, Mohit Bansal, and
[KarpathyandFei-Fei2015] Andrej Karpathy and MatthewRWalter. 2016. Listen,attend,andwalk:
LiFei-Fei. 2015. Deepvisual-semanticalignments Neural mapping of navigational instructions to ac-
for generating image descriptions. In Proceedings tion sequences. In Proceedings of the 30th AAAI,
ofCVPR,pages3128–3137,Boston,Massachusetts. Phoenix,Arizona. toappear.
[KateandMooney2006] RohitJ.KateandRaymondJ. [Milleretal.1996] Scott Miller, David Stallard, Robert
Mooney. 2006. Usingstring-kernelsforlearningse- Bobrow, andRichardSchwartz. 1996. Afullysta-
manticparsers. InProceedingsofthe21stCOLING tistical approach to natural language interfaces. In
and44thACL,pages913–920,Sydney,Australia. ACL,pages55–61.