Collective Vertex Classification Using Recursive Neural Network QiongkaiXu QingWang TheAustralianNationalUniversity TheAustralianNationalUniversity Data61CSIRO [email protected] [email protected] ChenchenXu LizhenQu TheAustralianNationalUniversity TheAustralianNationalUniversity Data61CSIRO Data61CSIRO [email protected] [email protected] 7 Abstract however introduce noise at the same time and result in re- 1 duced performance (Chakrabarti, Dom, and Indyk 1998; 0 Collectiveclassificationofverticesisataskofassign- Myaeng and Lee 2000). Other approaches incorporate the 2 ingcategoriestoeachvertexinagraphbasedonboth labels of its neighbours. For instance, the Iterative Classi- vertexattributesandlinkstructure.Nevertheless,some n fication Approach (ICA) integrates the label distribution of existing approaches do not use the features of neigh- a neighbouringverticestoassistclassification(LuandGetoor J bouring vertices properly, due to the noise introduced 2003) and the Label Propagation approach (LP) fine-tunes by these features. In this paper, we propose a graph- 4 based recursive neural network framework for collec- predictionsofthevertexusingthelabelsofitsneighbouring 2 tivevertexclassification.Inthisframework,wegener- vertices(WangandZhang2008).However,labelsofneigh- ate hidden representations from both attributes of ver- bouring vertices are not representative enough for learning ] G tices and representations of neighbouring vertices via sophisticated relationships of vertices, while using their at- recursive neural networks. Under this framework, we tributes directly would involve noise. We thus need an ap- L exploretwotypesofrecursiveneuralunits,naiverecur- proachthatiscapableofcapturinginformationfromneigh- . sive neural unit and long short-term memory unit. We s bouring vertices, while in the mean time reducing noise of c haveconductedexperimentsonfourreal-worldnetwork attributes.Utilizingrepresentationslearnedfromneuralnet- [ datasets.Theexperimentalresultsshowthatourframe- works instead of neighbouring attributes or labels is one workwithlongshort-termmemorymodelachievesbet- 1 ofthepossibleapproaches(Schmidhuber2015).Asgraphs terresultsandoutperformsseveralcompetitivebaseline v methods. normallyproviderichstructuralinformation,weexploitthe 1 neural networks with sufficient complexity for capturing 5 suchstructures. 7 Introduction Recurrent neural networks were developed to utilize se- 6 0 In everyday life, graphs are ubiquitous, e.g. social net- quence structures by processing input in order, in which . works,sensornetworks,andcitationnetworks.Mininguse- the representation of the previous node is used to gener- 1 ful knowledge from graphs and studying properties of var- ate the representation of the current node. Recursive neu- 0 ious kinds of graphs have been gaining popularity in re- ral networks exploit representation learning on tree struc- 7 1 centyears.Manystudiesformulatetheirgraphproblemsas tures. Both of the approaches achieve success in learning : predictive tasks such as vertex classification (London and representationofdatawithimplicitstructureswhichindicate v Getoor2014),linkprediction(Tangetal.2015),andgraph the order of processing vertices (Tang, Qin, and Liu 2015; Xi classification(Niepert,Ahmed,andKutzkov2016). Tai,Socher,andManning2015).Followingthesework,we explore the possibility of integrating graph structures into Inthispaper,wefocusonvertexclassificationtaskwhich r a studies the properties of vertices by categorising them. Al- recursive neural network. However, graph structures, espe- cially cyclic graphs, do not provide such an processing or- gorithmsforclassifyingverticesarewidelyadoptedinweb der.Inthispaper,weproposeagraph-basedrecursiveneural page analysis, citation analysis and social network analy- network(GRNN),whichallowsustotransformgraphstruc- sis(LondonandGetoor2014).Naiveapproachesforvertex turestotreestructuresanduserecursiveneuralnetworksto classification use traditional machine learning techniques learnrepresentationsfortheverticestoclassify.Thisframe- to classify a vertex only based on the attributes or fea- workconsistsoftwomaincomponents,asillustratedinFig- tures provided by this vertex, e.g. such attributes can be ure1: words of web pages or user profiles in a social network. Another series of approaches are collective vertex classi- 1. Treeconstruction:Foreachvertextoclassifyv ,wegen- fication, where instances are classified simultaneously as t erate a search tree rooted at v . Starting from v , we add opposed to independently. Based on the observation that t t itsneighbouringverticesintothetreelayerbylayer. information of neighbouring vertices may help classify- ing current vertex, some approaches incorporate attributes 2. Recursiveneuralnetworkconstruction:Webuildare- of neighbouring vertices into classification process, which cursive neural network for the constructed tree, by aug- Figure1:MaincomponentsoftheGraph-basedRecursiveNeuralNetworkframework(GRNN):1)constructingatreefromthevertextoclassify(vertex1);2)buildingarecursive neuralnetworkonthetree. menting each vertex with one recursive neural unit. The neural network technologies. Deepwalk (Perozzi, Al-Rfou, inputs of each vertex are its features and hidden states and Skiena 2014) is an unsupervised learning algorithm to of its child vertices. The output of a vertex is its hidden learnvertexembeddingsusinglinkstructure,whilecontent states. of each vertex is not considered. Convolutional neural net- workforgraphs(Niepert,Ahmed,andKutzkov2016)learns Ourmaincontributionsare:(1)Weintroducerecursiveneu- featurerepresentationsforthegraphsasawhole.Recurrent ral networks to solve the collective vertex classification neural collective classification (Monner and Reggia 2013) problem. (2) We propose a method that can transfer ver- encodes neighbouring vertices via a recurrent neural net- ticesforclassificationtoalocallyconstructedtree.Basedon work,whichishardtocapturetheinformationfromvertices thetree,recursiveneuralnetworkcanextractrepresentations thataremorethanseveralstepsaway. for target vertices. (3) Our experimental results show that Recursive neural networks (RNN) are a series of mod- the proposed approach outperforms several baseline meth- elsthatdealwithtree-structuredinformation.RNNhasbeen ods. Particularly, we demonstrate that including informa- implemented in natural scenes parsing (Socher et al. 2011) tion from neighbouring vertices can improve performance and tree-structured sentence representation learning (Tai, ofclassification. Socher, and Manning 2015). Under this framework, repre- sentationscanbelearnedfrombothinputfeaturesandrepre- RelatedWork sentationsofchildnodes.Graphstructuresaremorewidely There has been a growing trend to represent data using usedandmorecomplicatedthantreeorsequencestructures. graphs (Angles and Gutierrez 2008). Discovering knowl- Due to the lack of notable order for processing vertices in edge from graphs becomes an exciting research area, such a graph, few studies have investigated the vertex classifi- asvertexclassification(LondonandGetoor2014)andgraph cation problem using recursive neural network techniques. classification (Niepert, Ahmed, and Kutzkov 2016). Graph The graph-based recursive neural network framework pro- classificationanalyzesthepropertiesofthegraphasawhole, posed in this paper can generate the processing order for while vertex classification focuses on predicting labels of neural network according to the vertex to classify and the vertices in the graph. In this paper, we discuss the prob- localgraphstructure. lemofvertexclassification.Themain-streamapproachesfor vertex classification are collective vertex classification (Lu Graph-basedRecursiveNeuralNetworks andGetoor2003)whichclassifyverticesusinginformation provided by neighbouring vertices. Iterative classification In this section, we present the framework of Graph-based approach (Lu and Getoor 2003) models neighbours’ label RecursiveNeuralNetworks(GRNN).AgraphG = (V,E) distribution as link features to facilitate classification. La- consists of a set of vertices V = {v ,v ,...,v } and a bel propagation approach (Wang and Zhang 2008) assigns 1 2 N set of edges E ⊆ V × V. Graphs may contain cycles, a probabilistic label for each vertex and then fine-tunes the where a cycle is a path from a vertex back to itself. Let probabilityusinggraphstructure.However,labelsofneigh- X = {x ,x ,··· ,x } be a set of feature vectors, where bouring vertices are not representative enough to include 1 2 N eachx ∈ X isassociatedwithavertexv ∈ V,Lbeaset all useful information. Some researchers tried to introduce i i oflabels,andv ∈V beavertextobeclassified,calledtar- attributesfromneighbouringverticestoimproveclassifica- t getvertex.Then,thecollectivevertexclassificationproblem tionperformance.Nevertheless,asreportedin(Chakrabarti, istopredictthelabely ofv ,suchthat Dom, and Indyk 1998; Myaeng and Lee 2000), naively in- t t corporating these features may reduce the performance of yˆ =argmaxP (y |v ,G,X) (1) classification, when original features of neighbouring ver- t θ t t yt∈L ticesaretoonoisy. Recently, some researchers analysed graphs using deep usingarecursiveneuralnetworkwithparametersθ. Algorithm1TreeConstructionAlgorithm Require: GraphG,Targetvertexv ,Treedepthd t Ensure: TreeT =(V ,E ) T T 1. LetQbeaqueue 2. VT ={vt},ET =∅ 3. Q.push(vt) 4. whileQisnotemptydo 5. v=Q.pop() 6. ifT.depth(v)≤dthen 7. for winG.outgoingVertices(v)do 8. //addwtoT aschildofv 9. VT =VT ∪{w} 10. ET =ET ∪{(v,w)} Figure2:(a)Agraphwithacycle;(b)Atreeconstructedwithoutduplicatevertices; (c)Atreeconstructedwithduplicatevertices. 11. Q.push(w) 12. endfor 13. endif RecursiveNeuralNetworkConstruction 14. endwhile Now we construct a recursive neural unit (RNU) for each 15. return T vertexv ∈T.EachRNUtakesafeaturevectorx andhid- k k denstatesofitschildverticesasinput.Weexploretwokinds ofrecursiveneuralunitswhicharediscussedin(Socheretal. TreeConstruction 2011;Tai,Socher,andManning2015). NaiveRecursiveNeuralUnit(NRNU) Inaneuralnetwork,neuronsarearrangedinlayersanddif- Each NRNU for a vertex v ∈ T takes a feature vector x k k ferentlayersareprocessedfollowingapredefinedorder.For andtheaggregationofthehiddenstatesfromallchildrenof example, recurrent neural networks process inputs in se- v .ThetransitionequationsofNRNUaregivenasfollows: k quentialorderandrecursiveneuralnetworksdealwithtree- structures in a bottom-up manner. However, graphs, partic- h(cid:102)k = max {hr} (2) ularly cyclic graphs, do not have an explicit order. How to vr∈C(vk) constructanorderedstructurefromagraphischallenging. hk =tanh(W(h)xk+U(h)h(cid:102)k+b(h)) (3) GivenagraphG=(V,E),atargetvertexvt ∈V andtree where C(vk) is the set of child vertices of vk, W(h) and depthd ∈ N,wecanconstructatreeT = (VT,ET)rooted U(h) areweightmatrix,andb(h) isthebias.Thegenerated atvtusingbreadth-first-search,whereVT isavertexset,ET hidden state hk of vk is related to the input vector xk and isanedgeset,and(v,w) ∈ E meansanedgefromparent vertexvtochildvertexw.TheTdepthofavertexvinT isthe aggregatedhiddenstateh(cid:102)k.Differentfromsummingupall hiddenstatesasin(Tai,Socher,andManning2015),weuse lengthofthepathfromv tov,denotedasT.depth(v).The depth of a tree is maximutm depth of vertices in T. We use max pooling for h(cid:102)k (see Eq 2). This isbecause, in real-life situations,thenumberofneighboursforavertexcanbevery G.outgoingVertices(v) to denote a set of outgoing vertices largeandsomeofthemareirrelevantforthevertextoclas- from v, i.e. {w|(v,w) ∈ G}. The tree construction algo- sify (Zhang et al. 2013). We use G-NRNN to refer to the rithmisdescribedinAlgorithm1.Firstly,afirst-in-first-out graph-basednaiverecursiveneuralnetworkwhichincorpo- queue(Q)isinitializedwithv (lines1-3).Thealgorithmit- t ratesNRNUasrecursiveneuralunits. erativelychecktheverticesinQ.Ifthereisavertexwhose depth is less than d, we pop it out from Q (line 6), add all LongShort-TermMemoryUnit(LSTMU) itsneighbouringverticesinGasitschildreninT andpush LSTMUisonevariationonRNU,whichcanhandlethelong themtotheendofQ(lines9-11). termdependencyproblembyintroducingmemorycellsand In general, there are two approaches to deal with cycles gatedunits(HochreiterandSchmidhuber1997).LSTMUis inagraph.Oneistoremoveverticesthathavebeenvisited, composedofaninputgateik,aforgetgatefk,anoutputgate andtheotheristokeepduplicatevertices.Fig2.adescribes ok, a memory cell ck and a hidden state hk. The transition anexampleofagraphwithacyclebetweenv andv .Let equationsofLSTMUaregivenasfollows: 1 2 usstartwiththetargetvertexv .Inthefirstapproach,there 1 Twhilel cboerrneospcohnilddinvgertrteexe ifsorshvo2w, snininceFvig1 2is.ba.lIrneatdhyevsiesciotendd. h(cid:102)k =vrm∈Ca(xvk){hr} (4) approach, we will add v1 as a child vertex to v2 iteratively ik =σ(W(i)xk+U(i)h(cid:102)k+b(i)) (5) and terminate after certain steps. The corresponding tree is f =σ(W(f)x +U(f)h +b(f)) (6) illustratedinFig2.c.Whengeneratingtherepresentationof kr k r a vertex, say v2, any information from its neighbours may ok =σ(W(o)xk+U(o)h(cid:102)k+b(o)) (7) help.Wethusincludev asachildvertexofv .Inthispaper, 1 2 weusethesecondmannerfortreeconstruction. uk =tanh(W(u)xk+U(u)h(cid:102)k+b(u)) (8) (cid:88) c =i (cid:12)u + f (cid:12)c (9) dataset is used to demonstrate the effectiveness of our k k k kr r frameworkondatasetswhichmaynothaveexplicitrela- vr∈C(vk) tionshiprepresentedasedgesbetweenvertices,butcanbe h =o (cid:12)tanh(c ) (10) k k k treatedasgraphswhoseedgesarebasedonsomemetrics whereC(v )isthesetofchildverticesofv ,v ∈ C(v ), suchassimilarityofvertices. k k r k xk is corresponding feature vector of the child vertex vk, We use abstracts of publications in Cora and Citeseer, and (cid:12)iselement-wisemultiplicationandσissigmoidfunction, contents of web pages in WebKB to generate features of W(∗) andU(∗) areweightmatrices,andb(∗) arethebiases. vertices.Fortheabovedatasets,allwordsarestemmedfirst, h(cid:102)k isavectoraggregatedfromthehiddenstatesofthechild then stop words and words with document frequency less vertices. We use G-LSTM to refer to the graph-based long than 10 are discarded. A dictionary is generated by includ- short-termmemorynetworkwhichincorporatesLSTMUas ing all these words. We have 1433, 3793, and 1703 for recursiveneuralunits. Cora,CiteseerandWebKB,respectively.Eachvertexisrep- AfterconstructingGRNN,wecalculatethehiddenstates resentedbyabag-of-wordsvectorwhereeachdimensionin- ofallverticesinT fromleavestoroot,thenweuseasoftmax dicatesabsenceoroccurrenceofawordinthedictionaryof classifiertopredictlabely ofthetargetvertexv usingits thecorrespondingdataset1. t t hiddenstatesh (seeEq11). t BaselineMethods Pθ(yt|vt,G,X)=softmax(W(s)ht+b(s)) (11) Wehaveimplementedthefollowingthreebaselinemethods: yˆ =argmaxP (y |v ,G,X) (12) • Logistic Regression (LR) (Hosmer Jr and Lemeshow t θ t t yt∈L 2004)predictsthelabelofavertexusingitsownattributes Cross-entropy J(θ) = −1 (cid:80)N logP (y |v ,G,X) is throughalogisticregressionmodel. N t=1 θ t t usedascostfunction,whereNisthenumberofverticesina • Iterativeclassificationapproach(ICA)(LuandGetoor trainingset. 2003;LondonandGetoor2014)utilizesthecombination oflinkstructureandvertexfeaturesasinputofastatisti- ExperimentalSetup calmachinelearningmodel.WeusetwovariantsofICA: ICA-binary uses the occurrence of labels of neighbour- To verify the effectiveness of our approach, we have con- ing vertices, ICA-count uses the frequency of labels of ducted experiments on four datasets and compared our ap- neighbouringvertices. proach with three baseline methods. We will describe the datasets,baselinemethods,andexperimentalsettings. • Label propagation (LP) (Wang and Zhang 2008; Lon- donandGetoor2014)usesastatisticalmachinelearning Datasets togivealabelprobabilityforeachvertex,thenpropagates thelabelprobabilitytoallitsneighbours.Thepropagation We have tested our approach on four real-world network stepsarerepeateduntilalllabelprobabilitiesconverge. datasets. To make experiments consistent, logistic regression is • Cora(McCallumetal.2000)isacitationnetworkdataset usedforallstatisticalmachinelearningcomponentsinICA which is composed of 2708 scientific publications and andLP.Werun5iterationsforeachICAexperimentand20 5429citationsbetweenpublications.Allpublicationsare iterationsforeachLPexperiment2. classifiedintosevenclasses:RuleLearning(RU),Genetic Algorithms (GE), Reinforcement Learning (RE), Neural ExperimentalSettings Networks(NE),ProbabilisticMethods(PR),CaseBased In our experiments, we split each dataset into two parts: (CA)andTheory(TH). trainingsetandtestingset,withdifferentproportions(70% • Citeseer (Giles, Bollacker, and Lawrence 1998) is an- to 95% for training). For each proportion setting, we ran- other citation network dataset which is larger than Cora. domlygenerate5pairsoftrainingandtestingsets.Foreach Citeseer is composed of 3312 scientific publications and experimentonapairoftrainingandtestingsets,werun10 4723 citations. All publications are classified into six epochs on the training set and record the highest Micro-F1 classes:Agents,AI,DB,IR,MLandHCI. score (Baeza-Yates, Ribeiro-Neto, and others 1999) on the • WebKB (Craven et al. 1998) is a website network col- testingset.Thenwereporttheaveragedresultsfromtheex- lectedfromfourcomputersciencedepartmentsindiffer- periments with the same proportion setting. According to ent universities which consists of 877 web pages, 1608 preliminary experiments, the learning rate is set to 0.1 for hyper-links between web pages. All websites are classi- LR, ICA and LP and 0.01 for all GRNN models. We em- fiedintofiveclasses:faculty,students,project,courseand pirically set number of hidden states to 200 for all GRNN other. models. Adagrad (Duchi, Hazan, and Singer 2011) is used astheoptimizationmethodinourexperiments. • WebKB-simisanetworkdatasetwhichisgeneratedfrom WebKBbasedonthecosinesimilaritybetweeneachver- 1Cora,CiteseerandWebKBcanbedownloadedfromLINQS. texanditstop3similarverticesaccordingtotheirfeature WewillpublishWebKB-simalongwithourcode. vectors(Baeza-Yates,Ribeiro-Neto,andothers1999).We 2According to our preliminary experiments, LP converges use same feature vectors as the ones in WebKB. This slowerthanICA. (a) Cora G-NRNN (b) Citeseer G-NRNN (c) WebKB G-NRNN (d) WebKB-sim G-NRNN 86 G-NRNN_d0 78 90 90 G-NRNN_d1 %)84 G-NRNN_d2 76 88 88 e ( or82 74 86 86 c s 1- F80 72 84 84 o cr Mi78 70 82 82 76 68 80 80 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 (e) Cora G-LSTM (f) Citeseer G-LSTM (g) WebKB G-LSTM (h) WebKB-sim G-LSTM 86 G-LSTM_d0 78 90 90 G-LSTM_d1 %)84 G-LSTM_d2 76 88 88 e ( or82 74 86 86 c s 1- F80 72 84 84 o cr Mi78 70 82 82 76 68 80 80 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 Training Proportion Training Proportion Training Proportion Training Proportion Figure3:ComparisonofG-NRNNandG-LSTMonfourdatasets:(a)CoraG-NRNN,(b)CiteseerG-NRNN,(c)WebKBG-NRNN,(d)WebKB-simG-NRNN(e)CoraG-LSTM, (f)CiteseerG-LSTM,(g)WebKBG-LSTMand(h)WebKB-simG-LSTM. (a) Cora (b) Citeseer (c) WebKB (d) WebKB-sim 86 LICRA-binary LGP-NRNN_d2 78 90 90 ICA-count G-LSTM_d2 %)84 76 88 88 e ( or82 74 86 86 c s 1- F80 72 84 84 o cr Mi78 70 82 82 76 68 80 80 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 Training Proportion Training Proportion Training Proportion Training Proportion Figure4:ComparisonofG-LSTMwithLR,LPandICAonfourdatasets:(a)Cora,(b)Citeseer,(c)WebKBand(d)WebKB-sim. ResultsandDiscussion of trees. Using same RNU setting, d2 outperforms d1 in ModelandParameterSelection mostexperimentsonCora,CiteseerandWebKB-sim.How- ever,forWebKB,d2doesnotalwaysoutperformsd1.That Figure 3 illustrates the performance of G-NRNN and G- is to say, introducing more layers of vertices may help im- LSTM on four datasets. We use G-NRNN di and G- proving the performance, while the choice of the best tree LSTM di to refer to G-NRNN and G-LSTM over trees of depthdependsonapplications. depthi,respectively,wherei=0,1,2. FortheexperimentsonG-NRNNandG-LSTMovertrees ofdifferentsteps,d1andd2outperformd0inmostcases3. Table1:Micro-F1score(%)ofG-LSTMmodelwithdifferentpoolingstrategieson Particularly,theexperimentswithd1andd2performbetter Cora,Citeseer,WebKBandWebKB-sim. with more than 2% improvement than d0 on Cora, and d1 andd2 enjoyaconsistentimprovementoverd0onCiteseer Method Pooling Datasets andWebKB-sim.Thisperformancedifferenceisalsoobvi- Strategy Cora Citeseer WebKB WebKB-sim ous in WebKB, when the training proportion is larger than sum 83.05 74.81 86.21 87.42 85%.Thesemeanthatintroducingneighbouringverticescan G-LSTMd1 mean 84.18 74.77 86.21 87.58 improve the performance of classification and more neigh- max 83.83 74.89 87.12 87.42 bouringinformationcanbeobtainedbyincreasingthedepth sum 84.03 75.05 85.61 87.42 G-LSTMd2 mean 84.47 75.33 86.21 87.42 3Whend=0,eachconstructedtreecontainsonlyonevertex. max 84.72 75.45 86.06 87.73 Figure5:Distributionoflabelco-occurrenceforCora,Citeseer,WebKBandWebKB-sim,whered=1,2. InTable1,wecomparethreedifferentpoolingstrategies is likely due to the LSTMU’s capability of memorizing in- usedinG-LSTM4,sum,meanandmaxpooling.Weuse85% formationusingmemorycells.G-LSTMcanthusbettercap- for training for all datasets here. In general, mean and max ture correlations between representations with long depen- outperformsumwhichisusedin(Tai,Socher,andManning dencies(HochreiterandSchmidhuber1997). 2015).Thisisprobablybecause,thenumberofneighbours for a vertex can be very large and summing them up can DatasetComparison make(cid:101)h large for some extreme cases. max slightly outper- forms mean in our experiments, which is probably due to To analyse co-occurrence of neighbouring labels, we com- pute the transition probability from target vertices to their max pooling can select the most influential information of neighbouring vertices. We first calculate the label co- childverticeswhichfiltersoutnoisetosomeextend. occurrence matrix Md, where Md indicates co-occurred i,j BaselineComparison timesoflabelsli oftargetverticesvi andlabelslj ofd-step awayverticesv .Thenweobtainatransitionprobabilityma- j Wecompareourapproachwiththebaselinemethodsonfour trix Td, where Td = Md /((cid:80) Md ). The heat maps of networkdatasets.Asourmodelswithd providebetterper- i,j i,j k i,k 2 TdonfourdatasetsaredemonstratedinFigure5. formance,wechooseG-NRNN d2andG-LSTM d2asrep- resentativemodels. ForCoraandCiteseer,neighbouringverticestendtoshare In Figure 4.a and Figure 4.b, we compare our approach same labels. When d increases to 2, labels are still tightly with the baseline methods on two citation networks, Cora correlated.ThatisprobablywhyallICA,LPG-NRNNand andCiteseer.Thecollectivevertexclassificationapproaches, G-LSTMworkwellonCoraandCiteseer.Inthissituation, i.e. LP, ICA, G-NRNN and G-LSTM, largely outperform GRNN integrates features of d-step away vertices which LR which only uses attributes of the target vertex. Both maydirectlyhelpclassifyatargetvertex.ForWebKB,cor- G-NRNN d2 and G-LSTM d2 consistently outperform all relation of labels is not clear, some label can be strongly baseline methods on the citation networks, which indicates related to more than two labels, e.g. students connects to the effectiveness of our approach. In Figure 4.c and Fig- course, project and student. Introducing vertices which are ure 4.d, we compare our approach with the baseline meth- morestepsawaymakesthecorrelationevenworseforWe- ods on WebKB and WebKB-sim. For WebKB, our method bKB, e.g.all labels are most related to student. In this sit- obtains competitive results in comparison with ICA. LP uation, LP totally fails, while ICA can learn the correla- works worse than LR on WebKB, where Micro-F1 score tion of labels that are not same, i.e. student may relate to is less than 0.7. For this reason, it is not presented in Fig- courseinsteadofstudentitself.Forthisdataset,GRNNstill ure4.c,andwewilldiscussthisindetailinthenextsubsec- achievescompetitiveresultswiththebestbaselineapproach. tion. For WebKB-sim, our approaches consistently outper- ForWebKB-sim,althoughstudentisstillthelabelwithhigh- formallbaselinemethods,andG-LSTM d2outperformsG- estfrequency,thecorrelationbetweenlabelsisclearerthan NRNN d1whenthetrainingproportionislargerthan80%. WebKB, i.e. project relates to project and student. That is In general, G-LSTM d2 outperforms G-NRNN d2. This probably the reason why, the performance of our approach is good on WebKB-sim for both settings and G-LSTM d2 4AsG-NRNNgivessimilarresults,weonlyillustrateresultsfor achieves better results than G-NRNN d2 when the training G-LSTMhere proportionislarger. ConclusionsandFuturework Niepert,M.;Ahmed,M.;andKutzkov,K. 2016. Learning In this paper, we have presented a graph-based recursive convolutionalneuralnetworksforgraphs. InProceedingsof neuralnetworkframework(GRNN)forvertexclassification the33rdInternationalConferenceonMachineLearning. on graphs. We have compared two recursive units, NRNU Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: andLSTMUwithinthisframework.ItturnsoutthatLSTMU Onlinelearningofsocialrepresentations. InProceedingsof worksbetterthanNRNUonmostexperiments.Finally,the the20thACMSIGKDDInternationalConferenceonKnowl- performanceofourproposedmethodsoutperformedseveral edgeDiscoveryandDataMining,701–710. ACM. state-of-the-artstatisticalmachinelearningbasedmethods. Schmidhuber, J. 2015. Deep learning in neural networks: Inthefuture,weintendtoextendthisworkinseveraldi- Anoverview. NeuralNetworks61:85–117. rections.WeaimtoapplyGRNNtolargescalegraphs.We Socher, R.; Lin, C. C.; Manning, C.; and Ng, A. Y. 2011. also aim to improve the efficiency of GRNN and conduct Parsing natural scenes and natural language with recursive timecomplexityanalysis. neural networks. In Proceedings of the 28th International ConferenceonMachineLearning,129–136. References Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Im- proved semantic representations from tree-structured long Angles, R., and Gutierrez, C. 2008. Survey of graph short-term memory networks. In Proceedings of the 53rd databasemodels.ACMComputingSurveys(CSUR)40(1):1. Annual Meeting of the Association for Computational Lin- Baeza-Yates,R.;Ribeiro-Neto,B.;etal. 1999. Modernin- guistic. ACL. formationretrieval,volume463. ACMpressNewYork. Tang,J.;Chang,S.;Aggarwal,C.;andLiu,H. 2015. Neg- Chakrabarti,S.;Dom,B.;andIndyk,P. 1998. Enhancedhy- ativelinkpredictioninsocialmedia. InProceedingsofthe pertext categorization using hyperlinks. In ACM SIGMOD Eighth ACM International Conference on Web Search and Record,volume27,307–318. ACM. DataMining,87–96. ACM. Craven, M.; DiPasquo, D.; Freitag, D.; and McCallum, A. Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling 1998. Learning to extract symbolic knowledge from the withgatedrecurrentneuralnetworkforsentimentclassifica- worldwideweb. InProceedingsofthe15thNationalCon- tion. In Proceedings of the 2015 Conference on Empirical ferenceonArtificialIntelligence,509–516. AmericanAsso- MethodsinNaturalLanguageProcessing,1422–1432. ciationforArtificialIntelligence. Wang,F.,andZhang,C. 2008. Labelpropagationthrough Duchi,J.;Hazan,E.;andSinger,Y. 2011. Adaptivesubgra- linear neighborhoods. IEEE Transactions on Knowledge dient methods for online learning and stochastic optimiza- andDataEngineering20(1):55–67. tion. JournalofMachineLearningResearch12(Jul):2121– Zhang,J.;Liu,B.;Tang,J.;Chen,T.;andLi,J. 2013. So- 2159. cialinfluencelocalityformodelingretweetingbehaviors. In Giles,C.L.;Bollacker,K.D.;andLawrence,S. 1998. Cite- Proceeding of the 23rd International Joint Conference on seer: An automatic citation indexing system. In Proceed- ArtificialIntelligence,volume13,2761–2767. Citeseer. ingsofthe3rdACMconferenceonDigitallibraries,89–98. ACM. Hochreiter,S.,andSchmidhuber,J. 1997. Longshort-term memory. Neuralcomputation9(8):1735–1780. HosmerJr,D.W.,andLemeshow,S. 2004. Appliedlogistic regression. JohnWiley&Sons. London, B., and Getoor, L. 2014. Collective classification ofnetworkdata. DataClassification:AlgorithmsandAppli- cations399. Lu,Q.,andGetoor,L. 2003. Link-basedclassification. In Proceedings of the 20th International Conference on Ma- chineLearning,volume3,496–503. McCallum, A. K.; Nigam, K.; Rennie, J.; and Seymore, K. 2000. Automating the construction of internet portals with machinelearning. InformationRetrieval3(2):127–163. Monner, D. D., and Reggia, J. A. 2013. Recurrent neural collective classification. IEEE transactions on neural net- worksandlearningsystems24(12):1932–1943. Myaeng,S.H.,andLee,M.-h. 2000. Apracticalhypertext categorization method using links and incrementally avail- able class information. In Proceedings of the 23rd ACM International Conference on Research and Development in InformationRetrieval. ACM.