ebook img

Improved Relation Classification by Deep Recurrent Neural Networks with Data Augmentation PDF

0.45 MB·
by  Yan Xu
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Improved Relation Classification by Deep Recurrent Neural Networks with Data Augmentation

Improved Relation Classification by Deep Recurrent Neural Networks with Data Augmentation YanXu,1,∗,‡ RanJia,1,∗ LiliMou,1 GeLi,1,† YunchuanChen,2 YangyangLu,1 ZhiJin1,† 1KeyLaboratoryofHighConfidenceSoftwareTechnologies(PekingUniversity), MinistryofEducation,China; InstituteofSoftware,PekingUniversity {xuyan14,lige,luyy11,zhijin}@sei.pku.edu.cn {jiaran1994,doublepower.mou}gmail.com 2UniversityofChineseAcademyofSciences [email protected] Abstract 6 Nowadays, neural networks play an important role in the task of relation classification. By de- 1 signing different neural architectures, researchers have improved the performance to a large ex- 0 tent in comparison with traditional methods. However, existing neural networks for relation 2 classificationareusuallyofshallowarchitectures(e.g., one-layerconvolutionalneuralnetworks t c or recurrent networks). They may fail to explore the potential representation space in different O abstraction levels. In this paper, we propose deep recurrent neural networks (DRNNs) for rela- 3 tion classification to tackle this challenge. Further, we propose a data augmentation method by 1 leveragingthedirectionalityofrelations. WeevaluatedourDRNNsontheSemEval-2010Task8, ] andachieveanF1-scoreof86.1%,outperformingpreviousstate-of-the-artrecordedresults.1 L C 1 Introduction . s c Classifyingrelationsbetweentwoentitiesinagivencontextisanimportanttaskinnaturallanguagepro- [ cessing(NLP).Takethefollowingsentenceasanexample: “Jewelryandothersmaller[valuables]e1 were 2 locked in a [safe] or a closet with a deadbolt.” The marked entities valuables and safe are of relation v e2 Content-Container(e ,e ). Relation classification plays a key role in various NLP applications, 1 1 2 5 andhasbecomeahotresearchtopicinrecentyears. 6 Nowadays, neural network-based approaches have made significant improvement in relation classifi- 3 0 cation, compared with traditional methods based on either human-designed features (Kambhatla, 2004; . Hendrickxetal.,2009)orkernels(BunescuandMooney,2005; PlankandMoschitti,2013). Forexam- 1 0 ple, Zeng et al. (2014) and Xu et al. (2015a) utilize convolutional neural networks (CNNs) for relation 6 classification. Xuetal.(2015b)applylongshorttermmemory(LSTM)-basedrecurrentneuralnetworks 1 (RNNs) along the shortest dependency path. Nguyen and Grishman (2015) build ensembles of gated : v recurrentunit(GRU)-basedRNNsandCNNs. i X Wehavenoticedthattheseneuralmodelsaretypicallydesignedinshallowarchitectures,e.g.,onelayer r ofCNNorRNN,whereasevidenceinthedeeplearningcommunitysuggeststhatdeeparchitecturesare a more capable of information integration and abstraction (Graves et al., 2013; Hermans and Schrauwen, 2013;IrsoyandCardie,2014). Anaturalquestionisthenwhethersuchdeeparchitecturesarebeneficial totherelationclassificationtask. In this paper, we propose the deep recurrent neural networks (DRNNs) to classify relations. The deep RNNs can explore the representation space in different levels of abstraction and granularity. By visualizinghowRNNunitsarerelatedtotheultimateclassification,wedemonstratethatdifferentlayers indeed learn different representations: low-level layers enable sufficient information mix, while high- levellayersaremorecapableofpreciselylocatingtheinformationrelevanttothetargetrelationbetween ∗Equalcontribution. †Correspondingauthors. ‡YanXuiscurrentlyaresearchscientistatInvenoCo.,Ltd. . . 1Codereleasedonhttps://sites.google.com/site/drnnre/ This work is licenced under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/ AcceptedbyCOLING-2016 (a)                                                                   (b)                            locked                                                               locked           jewelry   were              in                                             in            jewelry                                    and     [valuables]e  1        closet                              closet            [valuables]e1              other  smaller    [safe] e 2   a    with                   [safe]e2                                       a       or        deadbolt                                                                 a Figure 1: (a) The dependency parse tree corresponding to the sentence “Jewelry and other smaller [valuables] were locked in a [safe] or a closet with a deadbolt.” Red arrows indicate the shortest e1 e2 dependencypathbetweene ande . (b)Theaugmenteddatasample. 1 2 two entities. Following our previous work (Xu et al., 2015b), we leverage the shortest dependency path(SDP,Figure1)asthebackboneofourRNNs. We further observe that the relationship between two entities are directed. Two sub-paths, separated by entities’ common ancestor, can be mapped to subject-predicate and object-predicate components of a relation. By changing the order of these two sub-paths, we obtain a new data sample withtheinversedrelationship(Figure1b). Suchdataaugmentationtechniquecanprovideadditionaldata sampleswithoutusingexternaldataresources. WeevaluatedourproposedmethodontheSemEval-2010relationclassificationtask. Evenifwedonot apply data augmentation, the DRNNs model has achieved a high performance of 84.2% F -score with 1 a depth of 3, but the performance decreases when the depth is too large. This is because the deep RNN is a large model, which necessitates more data samples for training. Applying data augmentation can alleviatetheproblemofdatasparsenessandsustainadeeperRNNtoimprovetheperformanceto86.1%. Theresultsshowthatbothourdeepnetworksandthedataaugmentationstrategyhavecontributedtothe relationclassificationtask,andthattheyarecoupledwelltogetherforfurtherperformanceimprovement. Therestofthispaperisorganizedasfollows. Section2reviewsrelatedwork;Section3describesour DRNNs model in detail. Section 4 presents in-depth experimental results. Finally, we have conclusion inSection5. 2 RelatedWork Traditionalmethodsforrelationclassificationmainlyfallintotwogroups: feature-basedorkernel-based. Theformerapproachesextractdifferenttypesoffeaturesandfeedthemintoaclassifier,e.g.,amaximum entropy model (Kambhatla, 2004). Various features, including lexical, syntactic, as well as semantic ones,areshowntobeusefultorelationclassification(Hendrickxetal.,2009). Bycontrast,kernel-based methods do not have explicit feature representations, but require predefined similarity measure of two data samples. Bunescu and Mooney (2005) design a kernel along the shortest dependency path (SDP) betweentwoentitiesbyobservingthattherelationstronglyreliesonSDPs. PlankandMoschitti(2013) combinestructuralinformationandsemanticinformationinatreekernel. Neural networks have now become a prevailing technique in this task. Socher et al. (2011) design a recursiveneuralnetworkalongtheconstituencyparsetree. Hashimotoetal.(2013),alsoonthebasisof recursive networks, emphasize more on important phrases; Ebrahimi and Dou (2015) restrict recursive networks to SDP. In our previous study (Xu et al., 2015b), we introduce SDP-based recurrent neural networktoclassifyrelations. Zengetal.(2014),ontheotherhand,applyCNNstorelationclassification. Alongthisline,dosSantos et al. (2015) replace the common softmax loss function with a ranking loss in their CNN model. Xu et al.(2015a)designanegativesamplingmethodforSDP-basedCNNs. Besides, representative hybrid models of CNNs and recursive/recurrent networks include Liu et al. (2015)andNguyenandGrishman(2015). 3 TheProposedMethodology Inthissection,wedescribeourmethodologyindetail. Subsection3.1providesanoverallpictureofour DRNNs model. Subsections 3.2 and 3.3 describe deep recurrent neural networks. The proposed data augmentation technique is introduced in Subsection 3.4. Finally, we present our training objective in Subsection3.5. 3.1 Overview Figure2depictstheoverallarchitectureoftheDRNNsmodel. Givenasentenceanditsdependencyparse tree,1 wefollowourpreviouswork(Xuetal.,2015b)andbuildDRNNsontheshortestdependencypath (SDP), which serves as a backbone. In particular, an RNN picks up information along each sub-path, separated by the common ancestor of marked entities. Also, we take advantage of four information channels,namely,wordembeddings,POSembeddings,grammaticalrelationembeddings,andWordNet embeddings. DifferentfromXuetal.(2015b),wedesigndeepRNNswithuptofourhiddenlayerssoastocapture informationindifferentlevelsofabstraction. ForeachRNNlayer,maxpoolinggathersinformationfrom differentrecurrentnodes. Noticethatthefourchannels(witheightsub-paths)areprocessedinasimilar way. Then all pooling layers are concatenated and fed into a hidden layer for information integration. Finally,wehaveasoftmaxoutputlayerforclassification. 3.2 RecurrentNeuralNetworksonShortestDependencyPath Inthissubsection,weintroduceasinglelayerofRNNbasedonSDP,servingasabuildingblockofour deeparchitecture. Compared with a raw word sequence or a whole parse tree, the shortest dependency path (SDP) be- tweentwoentitieshastwomainadvantages. First,itreducesirrelevantinformation;second,grammatical relations between words focus on the action and agents in a sentence and are naturally suitable for re- lation classification. Existing studies have demonstrated the effectiveness of SDP (Ebrahimi and Dou, 2015;Liuetal.,2015;Xuetal.,2015b;Xuetal.,2015a);detailsarenotrepeatedhere. FocusedontheSDP,anRNNkeepsahiddenstatevectorh,changingwiththeinputwordateachstep accordingly. Concretely, the hidden state h , for the t-th word in the sub-path, depends on its previous t state h and the current word’s embedding x . For the simplicity and without loss of generality, we t−1 t use vanilla recurrent networks with perceptron-like interaction, that is, the input is linearly transformed byaweightmatrixandnon-linearlysquashedbyanactivationfunction,i.e., h = f(W x +W h +b ) (1) t in t rec t−1 h where W and W are weight matrices for the input and recurrent connections, respectively. b is a in rec h biasterm,andf isanon-linearactivationfunction(ReLUinourexperiment). 3.3 DeepRecurrentNeuralNetworks Although an RNN, as described above, is suitable for picking information along a sequence (a subpath inourtask)byitsiterativenature,themachinelearningcommunitysuggeststhatdeeparchitecturesmay bemorecapableofinformationintegration,andcancapturedifferentlevelsofabstraction. A single-layer RNN can be viewed that it is deep along time steps. When unfolded, however, the RNN has only one hidden layer to capture the current input, as well as to retain the information in its previousstep. Inthissense,single-layerRNNsareactuallyshallowininformationprocessing(Hermans andSchrauwen,2013;IrsoyandCardie,2014). 1ParsedbytheStanfordparser(deMarneffeetal.,2006). Softmax Word/GR/POS/WordNet embeddings Hidden layers ... Hidden layer for each channel Feed-forward connection Max pooling connection valuables jewelry locked locked in closet safe Figure2: TheoverallarchitectureofDRNNs. Tworecurrentneuralnetworkspickupinformationalong the shortest dependency path, separated by its common ancestor. We use four information channels, namelywords,part-of-speechtags,grammaticalrelations(GR),andWordNethypernyms. Intherelationclassificationtask,wordsalongSDPsprovideinformationfromdifferentperspectives. On the one hand, the marked entities themselves are informative. On the other hand, the entities’ com- mon ancestor (typically verbs) tells how the two entities are related to each other. Such heterogeneous informationmightnecessitatemorecomplexmachinerythanasingleRNNlayer. Followingsuchintuition, weinvestigatedeepRNNsbystackingmultiplehiddenlayersonthetopof one another, that is, every layer treats its previous layer as input, and computes its activation similar to Equation1. Formally,wehave h(i) = f(W(i−1)h(i−1)+W(i)h(i) +W(i−1)h(i−1)+b(i)) (2) t in t rec t−1 cross t−1 where the subscripts refer to time steps, and superscripts indicate the layer number. To enhance infor- mation propagation, we add a “cross” connection for hidden layers (i ≥ 2) from the lower layer in the (i−1) (i−1) previoustimestep,givenbyW h inEquation2. (Seealso(cid:37)and(cid:45)arrowsinFigure2). cross t−1 3.4 DataAugmentation Neuralnetworks,especiallydeepones,arelikelytobepronetooverfitting. TheSemEval-2010relation classificationdataset,weuse,comprisesonlyseveralthousandsamples,whichmaynotfullysustainthe trainingofdeepRNNs. To mitigate this problem, we propose a data augmentation technique for relation classification by makinguseofthedirectionalityofrelationships. Thetwosub-paths [valuables] →jewelry→locked e1 locked←in←closet←[safe] e2 in Figure 1, for example, can be mapped to the subject-predicate and object- predicate components in the relation Content-Container(e ,e ). If we change the order of these two sub- 1 2 paths,weobtain [safe] →closet→in→locked e1 locked←jewelry←[valuables] e2 Then the relationship becomes Container-Content(e ,e ), which is exactly the inverse of 1 2 Content-Container(e ,e ). In this way, we can augment the dataset without using additional 1 2 resources. 3.5 TrainingObjective For each recurrent layer and embedding layer (over each sub-path for each channel), we apply a max pooling layer to gather information. In total, we have 40 pools, which are concatenated and fed to a hiddenlayerforinformationintegration. Finally,asoftmaxlayeroutputstheestimatedprobabilitythattwosub-paths(sleftandsright)areofrela- tionr. Forasingledatasamplei,weapplythestandardcross-entropyloss,denotedasJ(sleft,sright,r ). i i i Withthedataaugmentationtechnique,ouroveralltrainingobjectiveis m ω (cid:88) (cid:88) J = J(sleft,sright,r )+J(sright,sleft,r−1)+λ (cid:107)W (cid:107) i i i i i i i F i=1 i=1 where r−1 refers to the inverse of relation r. m is the number of data samples in the original training set. ω isthenumberofweightmatricesinDRNNs. λisaregularizationcoefficient, and(cid:107)·(cid:107) denotes F Frobeniusnormofamatrix. Fordecoding(predictingtherelationofanunseensample),thedataaugmentationtechniqueprovides new opportunities, because we can use the probability of r(e ,e ), r−1(e ,e ), or both. Section 4.3 1 2 2 1 providesdetaileddiscussion. 4 Experiments Inthissection,wepresentourexperimentsindetail. Subsection4.1introducesthedataset;Subsection4.2 describeshyperparametersettings. WediscussthedetailsofdataaugmentationinSubsection4.3andthe rationale for using RNNs in Subsection 4.4. Subsection 4.5 compares our DRNNs model with other methods in the literature. In Subsection 4.6, we have quantitative and qualitative analysis of how the depthaffectsourmodel. 4.1 Dataset WeevaluatedourDRNNsmodelontheSemEval-2010Task8dataset,whichisanestablishedbenchmark forrelationclassification(Hendrickxetal.,2009). Thedatasetcontains8000sentencesfortraining,and 2717fortesting. Wesplit800samplesoutofthetrainingsetforvalidation. There are 9 directed relations and an undirected default relation Other; thus, we have 19 different labels in total. However, the Other class is not taken into consideration when we compute the official measures. 4.2 HyperparameterSettings This subsection presents hyperparameters of our proposed model. We basically followed the settings inourpreviouswork(Xuetal., 2015b). Wordembeddingswere200-dimensional, pretrainedourselves using word2vec (Mikolov et al., 2013) on the Wikipedia corpus; embeddings in other channels were 50-dimensionalinitializedrandomly. Thehiddenlayersineachchannelhadthesamenumberofunitsas their embeddings (either 200 or 50); the penultimate hidden layer was 100-dimensional. An (cid:96) penalty 2 of 10−5 was also applied as in Xu et al. (2015b), but we chose the dropout rate by validation with a granularityof5%forourmodelvariants(withdifferentdepths). WealsochosethedepthofDRNNsbyvalidationfromtheset{1,2,··· ,6}. The3-layerand4-layer DRNNs yield the highest performance with and without data augmentation, respectively. Section 4.6 providesbothquantitativeandqualitativeanalysisregardingtheeffectofdepth. Weappliedmini-batchedstochasticgradientdescentforoptimization,wheregradientswerecomputed bystandardback-propagation. VariantofDataaugmentation F1 Depth NoAugmentation 84.16 1 2 Augmentallrelations 83.43 CNN 84.01 83.78 AugmentOtheronly 83.01 RNN 84.43 85.04 Augmentdirectedrelationsonly 86.10 Table 2: Comparing CNNs and RNNs Table1: Comparingvariantsofdataaugmentation. (alsousingF -scoreasthemeasurement). 1 4.3 DataAugmentationDetails As mentioned in Section 4.1, the SemEval-2010 Task 8 dataset contains an undirected class Other in addition to 9 directed relations (18 classes). For data augmentation, it is natural that the inversed Other relation is also in the Other class itself. However, if we augment all the relations, we observe a performance degradation of 0.7% (Table 1). We deem the Other class contains mainly noise, and is inimical to our model. Then we conducted another experiment where we only augmented the Other class. Theresultverifiesourconjectureasweobtainedanevenlargerdegradationof1.1%inthissetting. Thepilotexperimentssuggestthatweshouldtakeintoconsiderationunfavorablenoisewhenperform- ingdataaugmentation. Inthisexperiment,ifwereversethedirectedrelationsonlyandleavetheOther classintact,theperformanceisimprovedbyalargemarginof1.9%. Thisshowsthatourproposeddata augmentationtechniquedoeshelptomitigatetheproblemofdatasparseness,ifwecarefullyruleoutthe impactofnoise. During validation and testing, we shall decode the target label of an unseen data sample (with two entities e and e ). Through data augmentation, we are equipped with the probability of r−1(e ,e ) 1 2 2 1 in addition to r(e ,e ). In our experiment, we tried several settings and chose to use r−1(e ,e ) only, 1 2 2 1 because it yields the highest the validation result. We think this is probably because the Other class bringsmorenoisetor thanr−1,astheOtherclassisnotaugmented(andhenceasymmetric). We would like to point out that our data augmentation method is a general technique for relation classification, which is not ad hoc to a specific dataset; that the methodology for dealing with noise is alsopotentiallyapplicabletootherdatasets. 4.4 RNNsvs. CNNs As both RNNs and CNNs are prevailing neural models for NLP, we are curious whether deep architec- tures are also beneficial to CNNs. We tried a CNN with a sliding window of size 3 based on SDPs, similartoXuetal.(2015a);othersettingswereasourDRNNs. The results are shown in Table 2. We observe that a single layer of CNN is also effective, yielding an F -score slightly worse than our RNN. But the deep architecture hurts the performance of CNNs in 1 this task. One plausible explanation is that, when convolution is performed, the beginning and end of a sentence are typically padded with a special symbol or simply zero. However, the shortest dependency path between two entities is usually not very long (∼4 on average). Hence, sentence boundaries may playalargeroleinconvolution,whichmakesCNNsvulnerable. Onthecontrary,RNNscandealwithsentenceboundariessmoothly,andtheperformancecontinuesto increasewithupto4hiddenlayers. (DetailsaredeferredtoSubsection4.6.) 4.5 OverallPerformance Table3comparesourDRNNsmodelwithpreviousstate-of-the-artmethods.2 Thefirstentryinthetable presentsthehighestperformanceachievedbytraditionalfeature-basedmethods. Hendrickxetal.(2009) feedavarietyofhandcraftedfeaturestotheSVMclassifierandachieveanF -scoreof82.2%. 1 Recent performance improvements on this dataset are mostly achieved with the help of neural net- works. In an early study, Socher et al. (2012) build a recursive network on constituency trees, but 2ThispaperwaspreprintedonarXivon14Jan2016. Model Features F1 POS,WordNet,prefixesandothermorphologicalfeatures, SVM depdencyparse,Levinclasses,PropBank,FanmeNet, 82.2 (Hendrickxetal.,2009) NomLex-Plus,Googlen-gram,paraphrases,TextRunner RNN Wordembeddings 74.8 (Socheretal.,2011) +POS,NER,WordNet 77.6 MVRNN Wordembeddings 79.1 (Socheretal.,2012) +POS,NER,WordNet 82.4 CNN Wordembeddings 69.7 (Zengetal.,2014) +positionembeddings,WordNet 82.7 ChainCNN Wordembeddings,POS,NER,WordNet 82.7 (EbrahimiandDou,2015) CR-CNN Wordembeddings 82.8 (dosSantosetal.,2015) +positionembeddings 84.1 FCM Wordembeddings 80.6 (Yuetal.,2014) +dependencyparsing,NER 83.0 SDP-LSTM Wordembeddings 82.4 (Xuetal.,2015b) Word+POS+GR+WordNetembeddings 83.7 DepNN Wordembeddings+WordNet 83.0 (Liuetal.,2015) Wordembeddings+NER 83.6 depLCNN Word+WordNet+wordsaroundnominals 83.7 (Xuetal.,2015a) +negativesamplingfromNYTdataset 85.6 EnsembleMethods Word+POS+NER+WordNetembeddings,CNNs,RNNs+Stacking 83.4 (NguyenandGrishman,2015) Word+POS+NER+WordNetembeddings,CNNs,RNNs+Voting 84.1 Word+POS+GR+WordNetembeddingsw/odataaugmentation 84.2 DRNNs +dataaugmentation 86.1 Table3: Comparisonofpreviousrelationclassificationsystems. achieve a performance worse than Hendrickx et al. (2009). They extend their recursive network with matrix-vectorinteractionandelevatetheF -scoreto82.4%. EbrahimiandDou(2015)restricttherecur- 1 sivenetworktoSDP,whichisslightlybetterthanasentence-widenetwork. Inourpreviousstudy(Xuet al.,2015b),weintroducerecurrentneuralnetworksbasedonSDPandimprovetheF -scoreto83.7%. 1 In the school of convolution, Zeng et al. (2014) construct a CNN on the word sequence; they also integrate word position embeddings, which benefit the CNN architecture. dos Santos et al. (2015) pro- pose a similar CNN model, named CR-CNN, by replacing the common softmax cost function with a ranking-basedcostfunction. BydiminishingtheimpactoftheOtherclass,theyachieveanF -scoreof 1 84.1%. Xuetal.(2015a)designanSDP-basedCNNwithnegativesampling,improvingtheperformance to85.6%. Hybrid models of CNNs and RNNs do not appear to be very useful, achieving up to an F -score of 1 84.1%(Liuetal.,2015;NguyenandGrishman,2015). Yu et al. (2014) propose a Feature-rich Compositional Embedding Model (FCM), which combines unlexicalizedlinguisticcontextsandwordembeddings. Theydonotuseneuralnetworks(atleastinthe usualsense)andachieveanF -scoreof83.0%. 1 Our DRNNs model, along with data augmentation, achieves an F -score of 86.1%. Even if we do 1 notapplydataaugmentation, theDRNNsmodelyields84.2%F -score, whichisalsothehighestscore 1 achieved without special treatment to the noisy Other class. The above results show the effectiveness ofDRNNs,especiallytrainedwithalarge(augmented)dataset. 4.6 AnalysisofDRNNs’Depth In this subsection, we analyze the effect of depth in our 88 DRNNs model. We have tested the depth from the set 87 {1,2,··· ,6}, and plot the results in Figure 3. Initially, the 86 86.1 performanceincreasesifthedepthislargerinbothsettings %)85 ( 84.2 withandwithoutaugmentation. However,ifwedonotaug- ore84 c mentdata,theperformancepeakswhenthedepthis3. Pro- F-s183 vided with augmented training samples, the F -score con- 1 82 w/dataaugmentation tinues to increase with up to 4 layers, and ends up with an 81 w/odataaugmentation F -scoreof86.1%. 1 1 2 3 4 5 6 WenextinvestigatehowRNNunitsindifferentlayersare DepthsofDRNNs relatedtotheultimatetaskofinterest. Thisisaccomplished by tracing back information from pooling layers. Noticing Figure3: Analysisofthedepth.3 that the pooling layer takes maximum value in each dimension, we can compute how much a hidden layer’sunitsaregathered bypoolingforfurtherprocessing. Inthisway, weareable todemonstratethe information flow in RNN hidden units. We plot three examples in Figure 4. Here, rectangles refer to RNNhiddenlayers, unfoldedalong time. (Roundedrectanglesarewordembeddings.) Theintensityof colorreflectstheratioofthepoolingproportion. • Sample1: “Until1864[vessels] intheserviceofcertainUKpublicofficesdefacedtheRedEnsign e1 withthe[badge] oftheiroffice”withlabelInstrument-Agency(e ,e ). Itstwosub-pathsof e2 2 1 SDPare [vessels] → until → defaced e1 defaced←with←[badge] e2 FromFigure4a,weseethatentitieslikevesselsandbadgearedarkerthantheverbphrasedefaced with on the embedding layer. When information is propagating horizontally and vertically, these entities are getting lighter, while the verb phrase becomes darker gradually. Intuitively, we think that, consideringtherelationInstrument-Agency(e ,e ), itislessinformativewithonlytwo 2 1 entities vessels and badge. When adding the semantic of verb phrase defaced with, we are more awareofthetargetrelation.3 • Sample2: “Mostofthe[verses] oftheplantationsongshadsomereferenceto[freedom] ”with e1 e2 labelMessage-Topic(e ,e ). Itstwosub-pathsofSDPare 1 2 [verses] → of → most → had e1 had←reference←to←[freedom] e2 SimilartoSample1,weseefromFigure4bthatthecolorofthe“pivot”verbhad isgettingdarker vertically,andbecomesthedarkestinthefourthRNNlayer,indicatingthehighestpoolingportion. Thisisprobablybecausehad linkstwoendsoftherelation,MessageandTopic. • Sample3: “Amorespare,lessrobustuseofclassical[motifs] isevidentina[ewer] of1784-85” e1 e2 withlabelComponent-Whole(e ,e ). Itstwosub-pathsofSDPare 1 2 [motifs] → of → use → evident e1 evident←in←[ewer] e2 Different from Figures 4a and 4b, higher layers pay more attention to entities rather than their ancestor. In this example, motifs and ewer appear to be more relevant to the relation Component-Wholethantheircommonancestorevident. Thepoolingproportionofentities(mo- tifs,ewer)isincreasing,whileotherwords’proportionisdecreasing. 3UsingvanillaRNNwithadepthof1,weobtainedaslightlybetteraccuracyinthispaperthanXuetal.(2015b). 30C32C34C36C38C40C 20C22C24C26C28C30C 10C18C26C34C42C50C vessels until defaced defaced with badge verses of most had had eference to freedom motifs of use evident evident in ewer r MaTIIInstrument-AgencyMe2,Ie1T MbTIIMessage-TopicMe1,Ie2T McTIIComponent-WholeMe1,Ie2T Figure4: VisualizationofinformationpropagationalongmultipleRNNlayers. Wesummarizeourfindingsasfollows. (1)Pooledinformationusuallypeaksatoneorafewwordsin theembeddinglayer. Thismakessensebecausethereisnoinformationflowinthislayer. (2)Information scattersoverawiderrangeinhiddenlayers,showingthattherecurrentpropagationdoesmixinformation. (3)Forahigher-levellayer,thenetworkpaysmoreattentiontothosewordsthataremorerelevanttothe relation,butwhetherentitiesortheircommonancestorismorerelevantisnotconsistentamongdifferent datasamples. 5 Conclusion Inthispaper,weproposeddeeprecurrentneuralnetworks,namedDRNNs,toimprovetheperformance ofrelationclassification. TheDRNNsmodel,consistingofseveralRNNlayers,explorestherepresenta- tionspaceofdifferentabstractionlevels. ByvisualizingDRNNs’units,wedemonstratedthathigh-level layers are more capable of integrating information relevant to target relations. In addition, we have de- signedadataaugmentationstrategybyleveragingthedirectionalityofrelations. Whenevaluatedonthe SemEval dataset, our DRNNs model results in substantial performance boost. The performance gener- allyimproveswhenthedepthincreases;withadepthof4,ourmodelreachesthehighestF -measureof 1 86.1%. Acknowledgments Wethankallreviewersfortheirconstructivecomments. ThisresearchissupportedbytheNationalBasic Research Program of China (the 973 Program) under Grant No. 2015CB352201, the National Natural ScienceFoundationofChinaunderGrantNos.61232015,91318301,61421091,and61502014. References [BunescuandMooney2005] Razvan C. Bunescu and Raymond J. Mooney. 2005. A shortest path dependency kernelforrelationextraction. InProceedingsoftheConferenceonHumanLanguageTechnologyandEmpirical MethodsinNaturalLanguageProcessing,pages724–731. [deMarneffeetal.2006] Marie-CatherinedeMarneffe,BillMacCartney,andChristopherD.Manning. 2006. Gen- eratingtypeddependencyparsesfromphrasestructureparses. InProceedingsoftheInternationalConference onLanguageResourcesandEvaluation,volume6,pages449–454. [dosSantosetal.2015] Cıcero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations byrankingwithconvolutionalneuralnetworks. InProceedingsof53rdAnnualMeetingoftheAssociationfor ComputationalLinguistics,pages626–634. [EbrahimiandDou2015] Javid Ebrahimi and Dejing Dou. 2015. Chain based rnn for relation classification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguageTechnologies,pages1244–1249. [Gravesetal.2013] AlanGraves,Abdel-rahmanMohamed,andGeoffreyHinton. 2013. Speechrecognitionwith deeprecurrentneuralnetworks. InProceedingsoftheIEEEInternationalConferenceonAcoustics,Speechand SignalProcessing,pages6645–6649. [Hashimotoetal.2013] KazumaHashimoto,MakotoMiwa,YoshimasaTsuruoka,andTakashiChikayama. 2013. Simple customization of recursive neural networks for semantic relation classification. In Proceedings of the 2013ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages1372–1376. [Hendrickxetal.2009] Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O´ Se´aghdha, Sebastian Pado´, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009. Semeval-2010 task 8: Multi-wayclassificationofsemanticrelationsbetweenpairsofnominals. InProceedingsoftheWorkshopon SemanticEvaluations: RecentAchievementsandFutureDirections,pages94–99. [HermansandSchrauwen2013] Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrentneuralnetworks. InAdvancesinNeuralInformationProcessingSystems,pages190–198. [IrsoyandCardie2014] OzanIrsoyandClaireCardie. 2014. Opinionminingwithdeeprecurrentneuralnetworks. InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages720–728. [Kambhatla2004] Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In Proceedings of the ACL Interactive Poster and Demonstration Sessions,pages178–181. [Liuetal.2015] YangLiu,FuruWei,SujianLi,HengJi,MingZhou,andHoufengWANG. 2015. Adependency- basedneuralnetworkforrelationclassification. InProceedingsofthe53rdAnnualMeetingoftheAssociation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages285–290. [Mikolovetal.2013] TomasMikolov,IlyaSutskever,KaiChen,GregS.Corrado,andJeffDean. 2013. Distributed representationsofwordsandphrasesandtheircompositionality. InAdvancesinNeuralInformationProcessing Systems,pages3111–3119. [NguyenandGrishman2015] Thien Huu Nguyen and Ralph Grishman. 2015. Combining neural networks and log-linearmodelstoimproverelationextraction. arXivpreprintarXiv:1511.05926. [PlankandMoschitti2013] Barbara Plank and Alessandro Moschitti. 2013. Embedding semantic similarity in tree kernels for domain adaptation of relation extraction. In Proceedings of the 51st Annual Meeting of the AssociationforComputationalLinguistics,pages1498–1507. [Socheretal.2011] RichardSocher,JeffreyPennington,EricH.Huang,AndrewY.Ng, andChristopherD.Man- ning. 2011. Semi-supervisedrecursiveautoencodersforpredictingsentimentdistributions. InProceedingsof theConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages151–161. [Socheretal.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Seman- tic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1201–1211. [Xuetal.2015a] Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2015a. Semantic relation clas- sificationviaconvolutionalneuralnetworkswithsimplenegativesampling. InProceedingsofConferenceon EmpiricalMethodsinNaturalLanguageProcessing,pages536–540. [Xuetal.2015b] YanXu,LiliMou,GeLi,YunchuanChen,HaoPeng,andZhiJin. 2015b. Classifyingrelationsvia longshorttermmemorynetworksalongshortestdependencypaths. InProceedingsofConferenceonEmpirical MethodsinNaturalLanguageProcessing,pages1785–1794. [Yuetal.2014] MoYu,MatthewGormley,andMarkDredze. 2014. Factor-basedcompositionalembeddingmod- els. InNIPSWorkshoponLearningSemantics. [Zengetal.2014] DaojianZeng,KangLiu,SiweiLai,GuangyouZhou,andJunZhao. 2014. Relationclassification viaconvolutionaldeepneuralnetwork. InProceedingsofthe25thInternationalConferenceonComputational Linguistics: TechnicalPapers,pages2335–2344.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.