Table Of Content

Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network KimAnhNguyen and SabineSchulteimWalde and NgocThangVu Institutfu¨rMaschinelleSprachverarbeitung Universita¨tStuttgart Pfaffenwaldring5B,70569Stuttgart,Germany {nguyenkh,schulte,thangvu}@ims.uni-stuttgart.de Abstract in NLP. Both make use of distributional vector representations, relying on the distributional hy- Distinguishing between antonyms and pothesis (Harris, 1954; Firth, 1957), that words synonyms is a key task to achieve high 7 with similar distributions have related meanings: 1 performance in NLP systems. While co-occurrence models and pattern-based models. 0 theyarenotoriouslydifficulttodistinguish These distributional semantic models (DSMs) of- 2 by distributional co-occurrence models, ferameanstorepresentmeaningvectorsofwords n pattern-based methods have proven effec- a or word pairs, and to determine their semantic re- J tive to differentiate between the relations. latedness(TurneyandPantel,2010). 1 In this paper, we present a novel neu- In co-occurrence models, each word is repre- 1 ral network model AntSynNET that ex- sented by a weighted feature vector, where fea- ] ploits lexico-syntactic patterns from syn- tures typically correspond to words that co-occur L tactic parse trees. In addition to the lexi- in particular contexts. When using word embed- C calandsyntacticinformation,wesuccess- dings,thesemodelsrelyonneuralmethodstorep- . s fullyintegratethedistancebetweenthere- resent words as low-dimensional vectors. To cre- c [ lated words along the syntactic path as a ate the word embeddings, the models either make newpatternfeature. Theresultsfromclas- 1 use of neural-based techniques, such as the skip- v sification experiments show that AntSyn- gram model (Mikolov et al., 2013), or use matrix 2 NETimprovestheperformanceoverprior factorization (Pennington et al., 2014) that builds 6 pattern-basedmethods. 9 wordembeddingsbyfactorizingword-contextco- 2 occurrence matrices. In comparison to standard 0 1 Introduction co-occurrence vector representations, word em- . 1 Antonymy and synonymy represent lexical se- beddingsaddresstheproblematicsparsityofword 0 7 manticrelationsthatarecentraltotheorganization vectors and have achieved impressive results in 1 ofthementallexicon(MillerandFellbaum,1991). many NLP tasks such as word similarity (e.g., : v Whileantonymyisdefinedastheoppositenessbe- Pennington et al. (2014)), relation classification i X tween words, synonymy refers to words that are (e.g.,Vuetal.(2016)),andantonym-synonymdis- r similar in meaning (Deese, 1965; Lyons, 1977). tinction(e.g.,Nguyenetal.(2016)). a From a computational point of view, distinguish- Inpattern-basedmodels,vectorrepresentations ing between antonymy and synonymy is impor- make use of lexico-syntactic surface patterns to tantforNLPapplicationssuchasMachineTrans- distinguish between the relations of word pairs. lationandTextualEntailment,whichgobeyonda For example, Justeson and Katz (1991) suggested generalnotionofsemanticrelatednessandrequire that adjectival opposites co-occur with each other to identify specific semantic relations. However, in specific linear sequences, such as between duetointerchangeablesubstitution,antonymsand X and Y.Hearst(1992)determinedsurfacepat- synonyms often occur in similar contexts, which terns, e.g., X such as Y, to identify nomi- makes it challenging to automatically distinguish nal hypernyms. Lin et al. (2003) proposed two betweenthem. textual patterns indicating semantic incompatibil- Two families of approaches to differentiate be- ity, from X to Y and either X or Y, to tween antonyms and synonyms are predominent distinguish opposites from semantically similar words. Roth and Schulte im Walde (2014) pro- short-term memory units which is used to en- posed a method that combined patterns with dis- code patterns within a vector representation (Sec- course markers for classifying paradigmatic relation 3.2), and describe two models to classify tions including antonymy, synonymy, and hyper- antonyms and synonyms: the pure pattern-based nymy. Recently, Schwartz et al. (2015) used two model (Section 3.3.1) and the combined model prominent patterns from Lin et al. (2003) to learn (Section3.3.2). Afterintroducingtwobaselinesin word embeddings that distinguished antonyms Section 4, we describe our dataset, experimental fromsimilarwordsindeterminingdegreesofsim- settings, results of our methods, the effects of the ilarityandwordanalogy. newlyproposeddistancefeature,andtheeffectsof In this paper, we present a novel pattern- the various types of word embeddings. Section 6 based neural method AntSynNET to distinguish concludesthepaper. antonyms from synonyms. We hypothesize that 2 RelatedWork antonymous word pairs co-occur with each other inlexico-syntacticpatternswithinasentencemore Pattern-based methods: Regarding the task of often than would be expected by synonymous antonym-synonym distinction, there exist a vari- pairs. Thishypothesisisinspiredbycorpus-based ety of approaches which rely on patterns. Lin studies on antonymy and synonymy. Among oth- et al. (2003) used bilingual dependency triples ers, Charles and Miller (1989) suggested that ad- and patterns to extract distributionally similar jectival opposites co-occur in patterns; Fellbaum words. Theyreliedonclearantonympatternssuch (1995) stated that nominal and verbal opposites as from X to Y and either X or Y in a co-occur in the same sentence significantly more post-processingsteptodistinguishantonymsfrom often than chance; Lin et al. (2003) argued that if synonyms. The main idea is that if two words X two words appear in clear antonym patterns, they and Y appear in one of these patterns, they are areunlikelytorepresentsynonymouspair. unlikely to represent synonymous pair. Schulte WestartoutbyinducingpatternsbetweenXand im Walde and Ko¨per (2013) proposed a method Y from a large-scale web corpus, where X and Y to distinguish between the paradigmatic relations represent two words of an antonym or synonym antonymy, synonymy and hypernymy in German, wordpair,andthepatternisderivedfromthesim- based on automatically acquired word patterns. ple paths between X and Y in a syntactic parse RothandSchulteimWalde(2014)combinedgen- tree. Eachnodeinthesimplepathcombineslexi- erallexico-syntacticpatternswithdiscoursemark- calandsyntacticinformation;inaddition,wesug- ers as indicators for the same relations, both for gest a novel feature for the patterns, i.e., the dis- German and for English. They assumed that if tance between the two words along the syntactic two phrases frequently co-occur with a specific path. All pattern features are fed into a recur- discourse marker, then the discourse relation ex- rent neural network with long short-term mem- pressed by the corresponding marker should also ory (LSTM) units (Hochreiter and Schmidhuber, indicate the relation between the words in the af- 1997), which encode the patterns as vector repre- fected phrases. By using the raw corpus and a sentations. Afterwards, the vector representations fixedlistofdiscoursemarkers,themodelcaneas- of the patterns are used in a classifier to distin- ily be extended to other languages. More re- guish between antonyms and synonyms. The recently,Schwartzetal.(2015)presentedasymmet- sultsfromexperimentsshowthatAntSynNETim- ric pattern-based model for word vector represen- proves the performance over prior pattern-based tation in which antonyms are assigned to dissim- methods. Furthermore,theimplementationofour ilar vector representations. Differently to the pre- modelsismadepubliclyavailable1. viouspattern-basedmethodswhichusedthestan- Theremainderofthispaperisorganizedasfol- dard distribution of patterns, Schwartz et al. used lows: InSection2, wepresentpreviousworkdis- patternstolearnwordembeddings. tinguishing antonyms and synonyms. Section 3 Vector representation methods: Yih et al. describes our proposed AntSynNET model. We (2012) introduced a new vector representation presenttheinductionofthepatterns(Section3.1), whereantonymslieonoppositesidesofasphere. describe the recurrent neural network with long They derived this representation with the incor- 1https://github.com/nguyenkh/AntSynNET poration of a thesaurus and latent semantic analysis, by assigning signs to the entries in the co- 3.1 InductionofPatterns occurrence matrix on which latent semantic anal- Corpus-based studies on antonymy have sug- ysis operates, such that synonyms would tend to gested that opposites co-occur with each other have positive cosine similarities, and antonyms within a sentence significantly more often than would tend to have negative cosine similarities. would be expected by chance. Our method thus Scheible et al. (2013) showed that the distribu- makes use of patterns as the main indicators of tionaldifferencebetweenantonymsandsynonyms word pair co-occurrence, to enforce a distinction canbeidentifiedviaasimplewordspacemodelby betweenantonymsandsynonyms. Figure1shows using appropriate features. Instead of taking into a syntactic parse tree of the sentence “My old account all words in a window of a certain size village has been provided with the new services”. for feature extraction, the authors experimented Following the characterizations of a tree in graph with only words of a certain part-of-speech, and theory, any two nodes (vertices) of a tree are restricted distributions. Santus et al. (2014) pro- connected by a simple path (or one unique path). posed a different method to distinguish antonyms The simple path is the shortest path between any from synonyms by identifying the most salient two nodes in a tree and does not contain repeated dimensions of meaning in vector representations nodes. In the example, the lexico-syntactic tree and reporting a new average-precision-based dis- pattern of the antonymous pair old–new is deter- tributional measure and an entropy-based mea- mined by finding the simple path (in red) from sure. Ono et al. (2015) trained supervised word the lemma old to the lemma new. It focuses embeddings for the task of identifying antonymy. on the most relevant information and ignores They proposed two models to learn word embed- irrelevant information which does not appear in dings: thefirstmodelreliedonthesaurusinforma- the simple path (i.e., has, been). The example tion; the second model made use of distributional patternbetweenX = oldandY = newinFig- information and thesaurus information. More re- ure 1 is represented as follows: X/JJ/amod/2 -- cently, Nguyen et al. (2016) proposed two meth- village/NN/nsubj/1 -- provide/VBN/ROOT/0 ods to distinguish antonyms from synonyms: in -- with/IN/prep/1 -- service/NNS/pobj/2 the first method, the authors improved the qual- -- Y/JJ/amod/3. ity of weighted feature vectors by strengthening those features that are most salient in the vec- Node Representation: The path patterns make tors, and by putting less emphasis on those that use of four features to represent each node in the are of minor importance when distinguishing de- syntaxtree: lemma,part-of-speech(POS)tag,de- grees of similarity between words. In the second pendencylabelanddistancelabel. Thelemmafea- method,thelexicalcontrastinformationwasinte- ture captures the lexical information of words in grated into the skip-gram model (Mikolov et al., the sentence, while the POS and dependency fea- 2013)tolearnwordembeddings. Thismodelsuc- turescapturethemorpho-syntacticinformationof cessfullypredicteddegreesofsimilarityandiden- thesentence. Thedistancelabelmeasuresthepath tifiedantonymsandsynonyms. distancebetweenthetargetwordnodesinthesyn- tactictree. Eachstepbetweenaparentandachild node represents a distance of 1; and the ancestor 3 AntSynNET:LSTM-based nodes of the remaining nodes in the path are rep- Antonym-SynonymDistinction resentedbyadistanceof0. Forexample,thenode providedisanancestornodeofthesimplepath In this section, we describe the AntSynNET from old to new. The distances from the node model, using a pattern-based LSTM for distin- provided to the nodes village and old are guishing antonyms from synonyms. We first 1and2,respectively. presenttheinductionofpatternsfromaparsedcor- Thevectorrepresentationofeachnodeconcate- pus (Section 3.1). Section 3.2 then describes how natesthefour-featurevectorsasfollows: we utilize the recurrent neural network with long short-term memory units to encode the patterns (cid:126)v = [(cid:126)v ⊕(cid:126)v ⊕(cid:126)v ⊕(cid:126)v ] node lemma pos dep dist as vector representation. Finally, we present the AntSynNETmodelandtwoapproachestoclassify where (cid:126)v ,(cid:126)v ,(cid:126)v ,(cid:126)v represent the em- lemma pos dep dist antonymsandsynonyms(Section3.3). beddingsofthelemma,POStag,dependencylabel Root ROOT provided/VBN nsubj prep 1 1 village/NN has/VBZ been/VBN with/IN amod pobj 2 2 My/PRP$ old/JJ services/NNS a m o d 3 new/JJ the/DT Figure 1: Illustration of the syntactic tree for the sentence “My old village has been provided with the newservices”. Redlinesindicatethepathfromthewordoldtothewordnew. anddistancelabel,respectively;andthe⊕denotes ically, an LSTM comprises four components: an theconcatenationoperation. input gate i , a forget gate f , an output gate o , t t t and a memory cell c . The state of an LSTM at t Pattern Representation: For a pattern p eachtimesteptisformalizedasfollows: which is constructed by the sequence of nodes n1,n2,...,nk, the pattern representation of p is it = σ(Wi·xt+Ui·ht−1+bi) a sequence of vectors: p = [(cid:126)n1,(cid:126)n2,...,(cid:126)nk]. The ft = σ(Wf ·xt+Uf ·ht−1+bf) pattern vector (cid:126)v is then encoded by applying a ot = σ(Wo·xt+Uo·ht−1+bo) p recurrentneuralnetwork. gt = tanh(Wc·xt+Uc·ht−1+bc) c = i ⊗g +f ⊗c t t t t t−1 3.2 RecurrentNeuralNetworkwithLong W refers to a matrix of weights that projects in- Short-TermMemoryUnits formationbetweentwolayers;bisalayer-specific A recurrent neural network (RNN) is suitable for vector of bias terms; σ denotes the sigmoid func- modeling sequential data by a vector representation. The output of an LSTM at a time step t is tion. In our methods, we use a long short-term computedasfollows: memory (LSTM) network, a variant of a recur- h = o ⊗tanh(c ) rent neural network to encode patterns, for the t t t following reasons. Given a sequence of words where ⊗ denotes element-wise multiplication. In p = [n1,n2,...,nk] as input data, an RNN pro- ourmethods,werelyonthelaststatehk torepre- cesses each word nt at a time, and returns a vec- sentthevector(cid:126)vp ofapatternp = [(cid:126)n1,(cid:126)n2,...,(cid:126)nk]. tor of state h for the complete input sequence. k 3.3 TheProposedAntSynNETModel For each time step t, the RNN updates an inter- nalmemorystateh whichdependsonthecurrent In this section, we present two models to distin- t inputntandthepreviousstateht−1. Yet,ifthese- guish antonyms from synonyms. The first model quentialinputisalong-termdependency,anRNN makes use of patterns to classify antonyms and facestheproblemofgradientvanishingorexplod- synonyms, by using an LSTM to encode pat- ing,leadingtodifficultiesintrainingthemodel. terns as vector representations and then feeding LSTM units address these problems. The un- those vectors to a logistic regression layer (Sec- derlyingideaofanLSTMistouseanadaptivegat- tion 3.3.1). The second model creates combined ingmechanismtodecideonthedegreethatLSTM vector representations of word pairs, which con- unitskeepthepreviousstateandmemorizetheex- catenate the vectors of the words and the patterns tracted features of the current input. More specif- (Section3.3.2). ~v Logistic Regression lemma ~v pos ~v dep ~v Mean Pooling dist ~v ~v p p LSTM LSTM LSTM LSTM LSTM LSTM X/ADJ/amod/0 from/ADP/prep/1 Y/ADJ/pobj/2 Y/ADJ/amod/1 world/NOUN/pobj/0 X/ADJ/conj/1 Figure 2: Illustration of the AntSynNET model. Each word pair is represented by several patterns, and eachpatternrepresentsapathinthegraphofthesyntactictree. Patternsconsistofseveralnodeswhere each node is represented by a vector with four features: lemma, POS, dependency label, and distance label. The mean pooling of the pattern vectors is the vector representation of each word pair, which is thenfedtothelogisticregressionlayertoclassifyantonymsandsynonyms. 3.3.1 Pattern-basedAntSynNET take into account the patterns and distribution of In this model, we make use of a recurrent neural target pairs to create their combined vector rep- networkwithLSTMunitstoencodepatternscon- resentations. Given a word pair (x,y), the com- taining a sequence of nodes. Figure 2 illustrates binedvectorrepresentationofthepair(x,y)isde- theAntSynNETmodel. Givenawordpair(x,y), terminedbyusingboththeco-occurrencedistribu- weinducepatternsfor(x,y)fromacorpus,where tionofthewordsandthesyntacticpathpatterns: eachpatternrepresentsapathfromxtoy(cf. Sec- (cid:126)v = [(cid:126)v ⊕(cid:126)v ⊕(cid:126)v ] (2) tion3.1). Wethenfeedeachpatternpoftheword comb(x,y) x xy y pair (x,y) into an LSTM to obtain(cid:126)v , the vector p (cid:126)v refers to the combined vector of the representation of the pattern p (cf. Section 3.2). comb(x,y) wordpair(x,y);(cid:126)v and(cid:126)v arethevectorsofword For each word pair (x,y), the vector representa- x y xandwordy,respectively;(cid:126)v isthevectorofthe tionof(x,y)iscomputedasfollows: xy patternthatcorrespondstothepair(x,y),cf. Sec- (cid:80) (cid:126)v ·c tion3.3.1. Similartothepattern-basedmodel,the p∈P(x,y) p p (cid:126)v = (1) xy (cid:80)p∈P(x,y)cp combined vector (cid:126)vcomb(x,y) is fed into the logistic regression layer to classify antonyms and syn- (cid:126)vxy refers to the vector of the word pair (x,y); onyms. P(x,y)isthesetofpatternscorrespondingtothe pair (x,y); c is the frequency of the pattern p. 4 BaselineModels p The vector (cid:126)v is then fed into a logistic regres- xy TocompareAntSynNETwithbaselinemodelsfor sion layer whose target is the class label associ- pattern-based classification of antonyms and syn- atedwiththepair(x,y). Finally,thepair(x,y)is onyms, we introduce two pattern-based baseline predicted as positive (i.e., antonymous) word pair methods: the distributional method (Section 4.1), iftheprobabilityofthepredictionfor(cid:126)v islarger xy andthedistributedmethod(Section4.2). than0.5. 3.3.2 CombinedAntSynNET 4.1 DistributionalBaseline Inspired by the supervised distributional concate- Asafirstbaseline,weapplytheapproachbyRoth nation method in Baroni et al. (2012) and the in- andSchulteimWalde(2014),henceforthR&SiW. tegrated path-based and distributional method for Theyusedavectorspacemodeltorepresentpairs hypernymy detection in Shwartz et al. (2016), we of words by a combination of standard lexico- syntactic patterns and discourse markers. In ad- 5 Experiments ditiontothepatterns,thediscoursemarkersadded 5.1 Dataset information to express discourse relations, which inturnmayindicatethespecificsemanticrelation Fortrainingthemodels,neuralnetworksrequirea between the two words in a word pair. For ex- largeamountoftrainingdata. Weusetheexisting ample,contrastrelationsmightindicateantonymy, large-scaleantonymandsynonympairspreviously whereas elaborations may indicate synonymy or usedbyNguyenetal.(2016). Originally,thedata hyponymy. pairswerecollectedfromWordNet(Miller,1995) andWordnik4. MichaelRoth,thefirstauthorofR&SiW,kindly Inordertoinducepatternsforthewordpairsin computed the relation classification results of the the dataset, we identify the sentences in the cor- pattern–discourse model for our test sets. The pus that contain the word pair. Thereafter, we ex- weights between marker-based and pattern-based tractallpatternsforthewordpair. Wefilteroutall modelsweretunedonthevalidationsets;otherhy- patterns which occur less than five times; and we perparametersweresetexactlyasdescribedbythe only take into account word pairs that contain at R&SiWmethod. leastfivepatternsfortraining,validatingandtest- ing. For the proportion of positive and negative pairs,wekeeparatioof1:1positive(antonym)to 4.2 DistributedBaseline negative (synonym) pairs in the dataset. In order to create the sets of training, testing and valida- TheSPmethodproposedbySchwartzetal.(2015) tion data, we perform random splitting with 70% uses symmetric patterns for generating word em- train, 25% test, and 5% validation sets. The final beddings. Inthiswork,theauthorsappliedanun- datasetcontainsthenumberofwordpairsaccord- supervised algorithm for the automatic extraction ing to word classes described in Table 1. More- of symmetric patterns from plain text. The sym- over, Table 2 shows the average number of pat- metric patterns were defined as a sequence of 3-5 ternsforeachwordpairinourdataset. tokens consisting of exactly two wildcards and 1- 3words. Thepatternswerefilteredbasedontheir WordClass Train Test Validation Total frequencies,suchthattheresultingpatternsetcon- Adjective 5562 1986 398 7946 tained 11 patterns. For generating word embed- Verb 2534 908 182 3624 dings, a matrix of co-occurrence counts between Noun 2836 1020 206 4062 patterns and words in the vocabulary was com- Table1: Ourdataset. puted, using positive point-wise mutual information. The sparsity problem of vector representations was addressed by smoothing. For antonym WordClass Train Test Validation representation, the authors relied on two patterns Adjective 135 131 141 suggested by Lin et al. (2003) to construct word Verb 364 332 396 embeddingscontaininganantonymparameterthat Noun 110 132 105 canbeturnedoninordertorepresentantonymsas dissimilar, and that can be turned off to represent Table2: Averagenumberofpatternsperwordpair antonymsassimilar. acrosswordclasses. To apply the SP method to our data, we make use of the pre-trained SP embeddings2 with 500 5.2 ExperimentalSettings dimensions3. We calculate the cosine similarity We use the English Wikipedia dump5 from June of word pairs and then use a Support Vector Ma- 2016 as the corpus resource for our methods chinewithRadialBasisFunctionkerneltoclassify and baselines. For parsing the corpus, we antonymsandsynonyms. rely on spaCy6. For the lemma embeddings, we rely on the word embeddings of the dLCE 2http://homes.cs.washington.edu/˜roysch/papers/ 4http://www.wordnik.com sp_embeddings/sp_embeddings.html 5https://dumps.wikimedia.org/enwiki/latest/ 3The 500-dimensional embeddings outperformed the enwiki-latest-pages-articles.xml.bz2 300-dimensionalembeddingsforourdata. 6https://spacy.io Adjective Verb Noun Model P R F P R F P R F 1 1 1 SPbaseline 0.730 0.706 0.718 0.560 0.609 0.584 0.625 0.393 0.482 R&SiWbaseline 0.717 0.717 0.717 0.789 0.787 0.788 0.833 0.831 0.832 Pattern-basedAntSynNET 0.764 0.788 0.776∗ 0.741 0.833 0.784 0.804 0.851 0.827 CombinedAntSynNET 0.763 0.807 0.784∗ 0.743 0.815 0.777 0.816 0.898 0.855∗∗ Table3: PerformanceoftheAntSynNETmodelsincomparisontothebaselinemodels. Adjective Verb Noun Feature Model P R F P R F P R F 1 1 1 Pattern-based 0.752 0.755 0.753 0.734 0.819 0.774 0.800 0.825 0.813 Direction Combined 0.754 0.784 0.769 0.739 0.793 0.765 0.829 0.810 0.819 Pattern-based 0.764 0.788 0.776 0.741 0.833 0.784 0.804 0.851 0.827 Distance Combined 0.763 0.807 0.784∗∗ 0.743 0.815 0.777 0.816 0.898 0.855∗∗ Table4: ComparingthenoveldistancefeaturewithSchwarzetal.’sdirectionfeature,acrosswordclasses. model7 (Nguyen et al., 2016) which is the state- R&SiW baseline, but we achieve a much better of-the-art vector representation for distinguishing performancein comparison tothe SPbaseline, an antonyms from synonyms. We re-implemented increase of .37 F . Regarding verbs, we do not 1 thiscutting-edgemodelonWikipediawith100di- outperform the more advanced R&SiW baseline mensions, and then make use of the dLCE word in terms of the F score, but we obtain higher re- 1 embeddings for initialization the lemma embed- call scores. In comparison to the SP baseline, our dings. The embeddings of POS tags, dependency modelsstillshowaclearF improvement. 1 labels,distancelabels,andout-of-vocabularylem- Overall, our proposed models achieve compar- mas are initialized randomly. The number of atively high recall scores compared to the two dimensions is set to 10 for the embeddings of baselines. This strengthens our hypothesis that POS tags, dependency labels and distance labels. there is a higher possibility for the co-occurrence We use the validation sets to tune the number of ofantonymouspairsinpatternsoversynonymous dimensions for these labels. For optimization, pairs within a sentence. Because, when the pro- we rely on the cross-entropy loss function and posed models obtain high recall scores, the mod- StochasticGradientDescentwiththeAdadeltaup- els are able to retrieve most relevant information date rule (Zeiler, 2012). For training, we use the (antonymous pairs) corresponding to the patterns. Theanoframework(TheanoDevelopmentTeam, Regarding the low precision in the two proposed 2016). Regularization is applied by a dropout of models,wesampledrandomly5pairsineachpop- 0.5 on each of component’s embeddings (dropout ulation: truepositive,truenegative,falsepositive, rate is tuned on the validation set). We train the false negative. We then compared the overlap of modelswith40epochsandupdateallembeddings patternsforthetruepredictions(truepositivepairs duringtraining. and true negative pairs) and the false predictions (false positive pairs and false negative pairs). We 5.3 OverallResults foundoutthatthereisnooverlapbetweenpatterns Table3showsthesignificant8 performanceofour of true predictions; and the number overlap be- models in comparison to the baselines. Concern- tween patterns of false predictions is 2, 2, and 4 ing adjectives, the two proposed models signif- patterns for noun, adjective, and verb classes, re- icantly outperform the two baselines: The per- spectively. This shows that the low precision of formance of the baselines is around .72 for F , our models stems from the patterns which repre- 1 and the corresponding results for the combined sentbothantonymousandsynonymouspairs. AntSynNET model achieve an improvement of >.06. Regarding nouns, the improvement of the 5.4 EffectoftheDistanceFeature new methods is just .02 F in comparison to the 1 In our models, the novel distance feature is suc- cessfully integrated along the syntactic path to 7https://github.com/nguyenkh/AntSynDistinction 8t-test,*p<0.05,**p<0.1 represent lexico-syntactic patterns. The intu- Adjective Verb Noun Model WordEmbeddings P R F P R F P R F 1 1 1 GloVe 0.763 0.770 0.767 0.705 0.852 0.772 0.789 0.849 0.818 Pattern-basedModel dLCE 0.764 0.788 0.776 0.741 0.833 0.784 0.804 0.851 0.827 Glove 0.750 0.798 0.773 0.717 0.826 0.768 0.807 0.827 0.817 CombinedModel dLCE 0.763 0.807 0.784 0.743 0.815 0.777 0.816 0.898 0.855 Table5: Comparingpre-trainedGloVeanddLCEwordembeddings. ition behind the distance feature exploits prop- pre-trained GloVe word embeddings, by around erties of trees in graph theory, which show that .01 F for the pattern-based AntSynNET model 1 there exist differences in the degree of relation- and the combined AntSynNET model regarding ship between the parent node and the child nodes adjective and verb pairs. Regarding noun pairs, (distance = 1) and in the degree of relation- the improvements of the dLCE word embeddings shipbetweentheancestornodeandthedescendant over pre-trained GloVe word embeddings achieve nodes(distance > 1). Hence,weusethedistance around.01and.04F forthepattern-basedmodel 1 featuretoeffectivelycapturetheserelationships. andthecombinedmodel,respectively. In order to evaluate the effect of our novel distance feature, we compare the distance feature to the direction feature proposed by Shwartz et al. 6 Conclusion (2016). In their approach, the authors combined lemma, POS, dependency, and direction features for the task of hypernym detection. The direc- In this paper, we presented a novel pattern- tionfeaturerepresentedthedirectionofthedepen- based neural method AntSynNET to distinguish dencylabelbetweentwonodesinapathfromXto antonyms from synonyms. We hypothesized that Y. antonymous word pairs co-occur with each other For evaluation, we make use of the same infor- inlexico-syntacticpatternswithinasentencemore mation regarding dataset and patterns as in Sec- oftenthansynonymouswordpairs. tion 5.3, and then replace the distance feature by Thepatternswerederivedfromthesimplepaths thedirectionfeature. TheresultsareshowninTa- between semantically related words in a syntac- ble 4. The distance feature enhances the perfor- tic parse tree. In addition to lexical and syntactic mance of our proposed models more effectively information, we suggested a novel path distance than the direction feature does, across all word feature. The AntSynNET model consists of two classes. approaches to classify antonyms and synonyms. In the first approach, we used a recurrent neural 5.5 EffectofWordEmbeddings networkwithlongshort-termmemoryunitstoen- Our methods rely on the word embeddings of the code the patterns as vector representations; in the dLCE model, state-of-the-art word embeddings second approach, we made use of the distribution for antonym-synonym distinction. Yet, the word and encoded patterns of the target pairs to gener- embeddings of the dLCE model, i.e., supervised ate combined vector representations. The result- wordembeddings,representinformationcollected ingvectorsofpatternsinbothapproacheswerefed fromlexicalresources. Inordertoevaluatetheef- intothelogisticregressionlayerforclassification. fectofthesewordembeddingsontheperformance Our proposed models significantly outper- ofourmodels,wereplacethembythepre-trained formed two baselines relying on previous work, GloVe word embeddings9 with 100 dimensions, mainly in terms of recall. Moreover, we demon- and compare the effects of the GloVe word em- strated that the distance feature outperformed a beddings and the dLCE word embeddings on the previously suggested direction feature, and that performanceofthetwoproposedmodels. our embeddings outperformed the state-of-the-art Table 5 illustrates the performance of our two GloVe embeddings. Last but not least, our two models on all word classes. The table shows that proposed models only rely on corpus data, such the dLCE word embeddings are better than the that the models are easily applicable to other lan- 9http://www-nlp.stanford.edu/projects/glove/ guagesandrelations. Acknowledgements [Lyons1977] JohnLyons. 1977. Semantics, volume1. CambridgeUniversityPress. We would like to thank Michael Roth for helping ustocomputetheresultsoftheR&SiWmodelon [Mikolovetal.2013] TomasMikolov,Wen-tauYih,and Geoffrey Zweig. 2013. Linguistic regularities in ourdataset. continuous space word representations. In Pro- The research was supported by the Ministry of ceedings of the Conference of the North American Education and Training of the Socialist Republic Chapter of the Association for Computational Lin- of Vietnam (Scholarship 977/QD-BGDDT; Kim- guistics: HumanLanguageTechnologies(NAACL), pages746–751,Atlanta,Georgia. Anh Nguyen), the DFG Collaborative Research Centre SFB 732 (Kim-Anh Nguyen, Ngoc Thang [MillerandFellbaum1991] George A. Miller and Vu), and the DFG Heisenberg Fellowship SCHU- Christiane Fellbaum. 1991. Semantic networks of 2580/1(SabineSchulteimWalde). English. Cognition,41:197–229. [Miller1995] GeorgeA.Miller. 1995. WordNet:Alex- ical database for English. Communications of the References ACM,38(11):39–41. [Baronietal.2012] Marco Baroni, Raffaella Bernardi, Ngoc-QuynhDo,andChungchiehShan. 2012. En- [Nguyenetal.2016] Kim Anh Nguyen, Sabine tailment above the word level in distributional se- Schulte im Walde, and Ngoc Thang Vu. 2016. mantics. In Proceedings of the 13th Conference of Integrating distributional lexical contrast into word theEuropeanChapteroftheAssociationforCompu- embeddings for antonym-synonym distinction. In tationalLinguistics(EACL),pages23–32,Avignon, Proceedings of the 54th Annual Meeting of the France. Association for Computational Linguistics (ACL), pages454–459,Berlin,Germany. [CharlesandMiller1989] Walter G. Charles and George A. Miller. 1989. Contexts of antonymous [Onoetal.2015] Masataka Ono, Makoto Miwa, and adjectives. AppliedPsychology,10:357–375. Yutaka Sasaki. 2015. Word embedding-based antonym detection using thesauri and distributional [Deese1965] James Deese. 1965. The Structure of information. In Proceedings of the Conference of Associations in Language and Thought. The John the North American Chapter of the Association for HopkinsPress,Baltimore,MD. ComputationalLinguistics:HumanLanguageTech- nologies (NAACL), pages 984–989, Denver, Col- [Fellbaum1995] Christiane Fellbaum. 1995. Co- orado. occurrenceandantonymy. InternationalJournalof Lexicography,8:281–303. [Penningtonetal.2014] Jeffrey Pennington, Richard Socher,andChristopherD.Manning. 2014. Glove: [Firth1957] JohnR.Firth. 1957. PapersinLinguistics Globalvectorsforwordrepresentation. InProceed- 1934-51. Longmans,London,UK. ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages [Harris1954] Zellig S. Harris. 1954. Distributional 1532–1543,Doha,Qatar. structure. Word,10(23):146–162. [RothandSchulteimWalde2014] Michael Roth and [Hearst1992] Marti Hearst. 1992. Automatic acquisi- Sabine Schulte im Walde. 2014. Combining word tion of hyponyms from large text corpora. In In patternsanddiscoursemarkersforparadigmaticre- Proceedingsofthe14thInternationalConferenceon lationclassification. InProceedingsofthe52ndAn- Computational Linguistics (COLING), pages 539– nual Meeting of the Association for Computational 545,Nantes,France. Linguistics(ACL),pages524–530,Baltimore,MD. [HochreiterandSchmidhuber1997] Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term [Santusetal.2014] Enrico Santus, Alessandro Lenci, memory. NeuralComputation,9(8):1735–1780. QinLu,andSabineSchulteimWalde. 2014. Chas- ing hypernyms in vector spaces with entropy. In [JustesonandKatz1991] JohnS.JustesonandSlavaM. Proceedingsofthe14thConferenceoftheEuropean Katz. 1991. Co-occurrences of antonymous adjec- Chapter of the Association for Computational Lin- tivesandtheircontexts. ComputationalLinguistics, guistics(EACL),pages38–42,Gothenburg,Sweden. 17:1–19. [Scheibleetal.2013] Silke Scheible, Sabine Schulte [Linetal.2003] Dekang Lin, Shaojun Zhao, Lijuan imWalde,andSylviaSpringorum. 2013. Uncover- Qin, and Ming Zhou. 2003. Identifying syn- ingdistributionaldifferencesbetweensynonymsand onyms among distributionally similar words. In antonyms in a word space model. In Proceedings Proceedingsofthe18thInternationalJointConfer- of the 6th International Joint Conference on Natu- enceonArtificialIntelligence(IJCAI),pages1492– ralLanguageProcessing(IJCNLP),pages489–497, 1493,Acapulco,Mexico. Nagoya,Japan. [SchulteimWaldeandKo¨per2013] Sabine Schulte im WaldeandMaximilianKo¨per. 2013. Pattern-based distinction of paradigmatic relations for german nouns,verbs,adjectives. InProceedingsofthe25th InternationalConferenceoftheGermanSocietyfor Computational Linguistics and Language Technol- ogy(GSCL),pages189–198,Darmstadt,Germany. [Schwartzetal.2015] RoySchwartz, RoiReichart, and Ari Rappoport. 2015. Symmetric pattern based wordembeddingsforimprovedwordsimilaritypre- diction. In Proceedings of the 19th Conference on ComputationalLanguageLearning(CoNLL),pages 258–267,Beijing,China. [Shwartzetal.2016] Vered Shwartz, Yoav Goldberg, andIdoDagan. 2016. Improvinghypernymydetec- tionwithanintegratedpath-basedanddistributional method. In Proceedings of the 54th Annual Meet- ingoftheAssociationforComputationalLinguistics (ACL),pages2389–2398,Berlin,Germany. [TheanoDevelopmentTeam2016] Theano Develop- ment Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXive-prints,abs/1605.02688. [TurneyandPantel2010] Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vec- torspacemodelsofsemantics. JournalofArtificial IntelligenceResearch,37:141–188. [Vuetal.2016] Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schu¨tze. 2016. Combining recurrent and convolutional neural networks for rela- tionclassification. InProceedingsofthe2016Con- ferenceoftheNorthAmericanChapteroftheAsso- ciationforComputationalLinguistics: HumanLan- guageTechnologies(NAACL),pages534–539. [Yihetal.2012] Wen-tau Yih, Geoffrey Zweig, and John C. Platt. 2012. Polarity inducing latent semantic analysis. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guageProcessingandComputationalNaturalLan- guage Learning (EMNLP), pages 1212–1222, Jeju Island,Korea. [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701.