Table Of Content

Fast and Exact Top-k Search for Random Walk with Restart Yasuhiro Fujiwara , Makoto Nakatsuji , Makoto Onizuka , Masaru Kitsuregawa ∗‡ † ∗ ‡ NTTCyberSpaceLabs, NTTCyberSolutionsLabs, TheUniversityofTokyo ∗ † ‡ fujiwara.yasuhiro, nakatsuji.makoto, onizuka.makoto @lab.ntt.co.jp, [email protected] { } ABSTRACT Withtherapidlyincreasingamountsofgraphdata,search- 2 inggraphdatabasestoidentifyhighproximitynodes,where Graphsarefundamentaldatastructuresandhavebeenem- 1 a proximity measure is used to rank nodes in accordance ployed for centuries to model real-world systems and phe- 0 with relevance to a query node [13], has become an impor- nomena. Randomwalkwithrestart(RWR)providesagood 2 tant research problem. Many papers in the database com- proximity score between two nodes in a graph, and it has n been successfully used in many applications such as auto- munityhaveaddressednode-to-nodeproximities[20,1,13]. a maticimagecaptioning,recommendersystems,andlinkpre- For example, Sun et al. proposed a novel proximity mea- J sure called PathSim which produces good similarity qual- diction. Thegoalofthisworkistofindnodesthathavetop- ities given heterogeneous information networks [20]. Sim- 1 khighestproximitiesforagivennode. Previousapproaches 3 tothisproblemfindnodesefficientlyattheexpenseofexact- rank++,proposedbyAntonellisetal.,findshighproximity ness. Themainmotivationofthispaperistoanswer,inthe nodes effectively for historical click data [1]. One of the ] affirmative,thequestion,‘Isitpossibletoimprovethesearch most successful techniques known to the academic commu- B nitiesisbasedonrandomwalkwithrestart(RWR)[19]. This time without sacrificing the exactness?’. Our solution, K- D is because the proximity defined by RWR yields the follow- dash, is based on two ideas: (1) It computes the proximity ingbenefits: (1)itcapturestheglobalstructureofthegraph . of a selected node efficiently by sparse matrices, and (2) It s [8],and(2)itcapturesmulti-facetrelationshipsbetweentwo skips unnecessary proximity computations when searching c nodes unlike traditional graph distances [21]. [ forthetop-knodes. TheoreticalanalysesshowthatK-dash guaranteesresultexactness. Weperformcomprehensiveex- However, the computation of the proximities by RWR is 1 periments to verify the efficiency of K-dash. The results computationallyexpensive. Considerarandomparticlethat v starts from query node q. The particle iteratively moves to show that K-dash can find top-k nodes significantly faster 6 its neighborhood with the probability that is proportional thanthepreviousapproacheswhileitguaranteesexactness. 6 to their edge weights. Additionally, in each step there is a 5 probability that it will restart at node q. A node probabil- 1. INTRODUCTION 6 itychangesovertimeduringiterationsbyrecursivelyapply- . Recent advances in social and information science have ing the above procedures. As the result, the steady-state 1 shown that linked data pervade our society and the natu- probability can be obtained. The proximity of node u with 0 2 ral world around us [24]. Graphs become increasingly im- respect to node q is defined as the steady-state probability 1 portanttorepresentcomplicatedstructuresandschema-less with which the particle stays at node u. : data such as is generated by Wikipedia 1, Freebase 2, and Although RWR has been receiving increasing interests v various social networks [10]. Due to the extensive applica- from many applications [15, 11, 12, 19], its excessive CPU Xi tionsofgraphmodels,vastamountsofgraphdatahavebeen timeledtotheintroductionofapproximateapproaches[22, collectedandgraphdatabaseshaveattractedsignificantat- 19]. These approaches have the advantage of speed at the ar tentioninthedatabasecommunity. Inrecentyears,various expenseofexactness. However,approximatealgorithmsare approaches have been proposed to deal with graph-related not well adopted. This is because it is difficult for approxi- research problems such as subgraph search [24], shortest- matealgorithmstoenhancethequalityofrealapplications. path query [4], pattern match [5], and graph clustering [25] Therefore, we address the following problem in this paper: to get insights into graph structures. Problem (Top-k search for RWR). 1http://www.wikipedia.org/ Given: The query node q, and the required number of 2http://www.freebase.com/ answer nodes K. Find: Top K nodes with the highest proximities with Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor respect to node q exactly. personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare notmadeordistributedforprofitorcommercialadvantageandthatcopies To the best of our knowledge, our approach to finding bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to top-k nodes in RWR is the first solution to achieve both republish,topostonserversortoredistributetolists,requirespriorspecific exactness and efficiency at the same time. permissionand/orafee. Articlesfromthisvolumewereinvitedtopresent theirresultsatThe38thInternationalConferenceonVeryLargeDataBases, 1.1 Contributions August27th-31st2012,Istanbul,Turkey. We propose a novel method called K-dash that can effi- ProceedingsoftheVLDBEndowment,Vol.5,No.5 Copyright2012VLDBEndowment2150-8097/12/01...$10.00. ciently find top-k nodes in RWR. In order to reduce search 442 cost,(1)weusesparsematricestocomputetheexactprox- as friendship and social tagging embedded in social knowl- imity of a selected node, and (2) we prune low proximity edgetoimprovetheaccuracyofitemrecommendations[11]. nodes by estimating the proximities of those nodes without Theyalsoappliedastandardcollaborativefilteringmethod computingtheirexactproximities. K-dashhasthefollowing as a baseline, and showed that their method was superior. attractive characteristics based on the above ideas: Thequestionof‘whichnewinteractionsamongsocialnet- work members are more likely to occur in the near future?’ Exact: K-dashdoesnotsacrificeaccuracyeventhough is being avidly pursued by many researchers. Schifanella • itexploitsanestimation-basedapproachtopruneun- et al. proposed a metadata based approach for this prob- likely nodes; it returns top-k nodes without error un- lem [17]. Their idea is that members with similar interests like the previous approximate approaches. aremorelikelytobefriends,sosemanticsimilaritymeasures amongmembersbasedontheirannotationmetadatashould Efficient: K-dashpractically requires O(n+m) time be predictive of social links. Liven-Nowell et al. explored • where n and m are the number of nodes and edges, this question by using RWR [12]; the probability of a fu- respectively. Bycomparison,solutionsbasedonexist- ture collaboration betweenauthors is computed from RWR ing approximate algorithms are expensive; they need proximity. Their approach is based on the observation that O(n2) time to find the answer nodes. Note that m thetopologyofthenetworkcansuggestmanynewcollabo- n2 in practice [14]. ≪ rations. For example, two researchers who are close in the networkwillhavemanycolleaguesincommon,andthusare Nimble: K-dash practically needs O(n+m) space • while the previous approaches need O(n2) space. The morelikelytocollaborateinthenearfuture. Theytookthe RWR-based approach since it can capture the global struc- requiredmemoryspaceofK-dashissmallerthanthat ture of the graph. They showed that RWR provides better of the previous approximate approaches. link predictions than the random prediction approach. Parameter-free: The previous approaches require Approximationmethod. EventhoughRWRisveryuseful, • careful setting of the inner-parameter [22], since it oneproblemisitslargeCPUtime. Sunetal. observedthat impacts the search results. K-dash, however, is com- thedistributionofRWRproximitiesishighlyskewed. Based pletely automatic; this means it does not require the onthisobservation,combinedwiththefactorthatmanyreal user to set any inner-parameters. graphs have block-wise/partition structure, they proposed an approximation approach for RWR [19]; they performed While RWR has been used in many applications, it has RWR only on the partition that contains the query node. beendifficulttoutilizeitduetoitshighcomputationalcost. All nodes outside the partition are simply assigned RWR However, by providing exact solutions in a highly efficient proximities of 0. In other words, their approach outputs a manner, K-dash will allow many more RWR-based applica- local estimation of RWR proximities. tions to be developed in the future. Tong et al. proposed a fast approximation solution for The remainder of this paper is organized as follows. Sec- RWR. They designed B LIN and its derivative, NB LIN tion2describesrelatedwork. Section3overviewstheback- [22]. Thesemethodstakeadvantagesoftheblock-wisestruc- ground of this work. Section 4 introduces the main ideas ture and linear correlations in the adjacency matrix of real of K-dash. Section 5 gives theoretical analyses of K-dash. graphs, using the Sherman-Morrison Lemma [16] and the Section 6 reviews the results of our experiments. Section 7 singularvaluedecomposition(SVD).EspeciallyforNB LIN, provides our brief conclusion. theyshowedtheproofofanerrorbound. Theexperimental resultsshowedthattheirmethodsoutperformedtheapprox- 2. RELATEDWORK imation method of Sun et al. [19]. Their methods require O(n2) space and O(n2) time. This is because their meth- Node-to-nodeproximityisanimportantproperty. Oneof odsutilizeO(n2)sizematricestoapproximatetheadjacency themostpopularproximitymeasurementsisRWR,andre- matrix for proximity computation. searchersofdataengineeringhavepublishedmanypaperson RWRanditsapplications[15,11,12,19,22]. Withourap- proach,manyapplicationscanbeprocessedmoreefficiently. 3. PRELIMINARY In this section, we formally define the notations and in- Application. Automatic image captioning is a technique troducethebackgroundofthispaper. Table1liststhemain which automatically assigns caption words to a query im- symbols and their definitions. age. Pan et al. proposed a graph-based automatic caption Measuring the proximity of two nodes in a graph can be method in which images and caption words are treated as achieved using RWR. Starting from a node, a RWR is per- nodes in a mixed media graph [15]. They utilized RWR to formed by iteratively following an edge to another node at estimate the correlations between the query image and the each step. Additionally, at every step, there is a non-zero captions. Theyreportedthattheirmethodprovided10per- probability,c,ofreturningtothestartnode. Letpbeacol- cent higher captioning accuracy than a fine-tuned method. umn vector where p denotes the probability that the ran- u Recommendationsystemsaimtoprovidepersonalizedrec- dom walk is at node u. q is a column vector of zeros with ommendations of items to a user. One recent recommenda- the element corresponding to the starting node q set to 1, tiontechniqueproposedbyKonstasetal. isbasedonRWR i.e. q =1. AlsoletAbethecolumnnormalizedadjacency q over a graph that connects users to tags and tags to items, matrix of the graph. In other words, A is the transition where the probabilities of relevance for items are given by probability table where its element A gives the probabil- uv RWRproximities;highinterestitemswouldhavehighprox- ity of node u being the next state given that the current imities. They incorporate the additional information such state is node v. The steady-state, or stationary probabili- 443 main diagonal are zero. As a result, the inverse matrices Table 1: Definition of main symbols. are sparse, and we can compute the proximities of the se- Symbol Definition lected nodes with low memory consumption by using the q querynode adjacency-list representation [6]. K Numberofanswernodes Thisnewideahasthefollowingtwomajoradvantagesbe- n Numberofnodes sides the one described above. First we can compute the m Numberofedges proximities exactly. This is because LU decomposition, un- c therestartprobability p n 1vector,pu istheproximityofnodeu like SVD which is used in the previous methods [22], is not q n×1vector,theq-thelement1and0forothers anapproximationmethod. Thesecondadvantageisthatwe × A thecolumnnormalizedadjacentmatrix can compute the proximities efficiently. This is because we use sparse matrices to compute the proximities. Tree estimation. Although our sparse matrices approach ties for each node can be obtained by recursively applying is able to compute the proximities of selected nodes, we the following equation until convergence: havethefollowingtwoquestionstofindthetop-knodes: (1) ‘What nodes should be selected to compute the proximities p=(1 c)Ap+cq (1) − in the search process?’, and (2) ‘Can we avoid computing where the convergence of the equation is guaranteed [18]. theproximitiesofunselectednodes?’. Thesecondapproach Thesteady-stateprobabilitiesgiveusthelongtermvisitrate is designed to answer these two questions. of each node given a bias toward query node q. Therefore, Thesequestionscanbeansweredbyestimatingwhatnodes p can be considered as a measure of proximity of node u can be expected to have high/low proximities. Our pro- u with respect to node q. posal exploits the following observations: the proximity of This method needs O(mt) time where t is the number of a node declines as the number of hops from the query node iteration steps. This incurs excessive CPU time for large increases, and proximities of unselected nodes can be esti- graphs, and a fast solution is demanded as illustrated by mated from computed proximities. Our search algorithm thestatement‘itson-lineresponsetimeisnotacceptablein first constructs a single breadth-first search tree rooted at reallifesituations’madeinapreviousstudy[11]. Itshould the query node. We compute the proximities of the top-k be emphasized that shortening response time is critical to nearest nodes from the root node to discover answer can- enhancing business success in real web applications 3. didate nodes. We then estimate the proximities of unselected nodes from the proximities of already selected nodes to obtain the upper proximity bound. The time incurred 4. PROPOSEDMETHOD to estimate node proximity is O(1) for each node. In the In this section, we explain the two main ideas underlying searchprocess,iftheupperproximityboundofanodegives K-dash. The main advantage of our approach is to exactly a score lower than the K-th highest proximity of the can- andefficientlyfindtop-khighestproximitynodesforRWR. didates nodes, the node cannot be one of the top-k highest First, we give an overview of each idea and then a full de- proximity nodes. Accordingly, unnecessary proximity com- scription. Proofs of lemmas or theorems in this section are putations can be skipped. shown in Appendix A. This estimation allows us to find the top-k nodes exactly whilewepruneunselectednodes. Thismeanswecansafely 4.1 IdeasbehindK-dash discardunlikelynodesatlowCPUcost. Thisestimationap- Our solution is based on the following two approaches: proach also allows us to automatically determine the nodes Sparsematricescomputation. Theproximitiesforaquery forwhichwecomputetheproximities. Thisimpliesourap- proach avoids to have user-defined inner-parameters. node are the steady-state probabilities which are computed by recursive procedures as described in Section 3. This ap- 4.2 Sparsematricescomputation proachrequireshighcomputationtimebecauseitcomputes the proximities of all the nodes in the graph. Our idea is Our first approach is to obtain sparse inverse matrices to simple; we compute the proximities of only selected nodes computetheproximitiesofselectednodesefficiently. Inthis enough to find the top-k nodes, instead of computing the section, we first describe how to compute the proximities proximities of all nodes. by inverse matrices. We then describe that obtaining the As described in Section 4.2.1, the proximities of selected sparseinversematricesisanNP-completeproblem,andwe nodes are naively computed by the inverse matrix that can then show our approximate approach for the problem. be directly obtained from Equation (1). Therefore, if we precompute and store this inverse matrix, we can get the 4.2.1 Proximitycomputation proximitiesefficiently. However,thisapproachisimpractical FromEquation(1),wecanobtainthefollowingequation: whenthedatasetislarge,becauseitrequiresquadraticspace to hold the inverse matrix. p=c I (1 c)A −1q=cW−1q (2) { − − } We introduce an efficient approach that can compute the whereIrepresentstheidentitymatrixandW=I (1 c)A. proximitiesfromsparsematrices. Intheprecomputingpro- − − This equation implies that we can compute the proximities cess, we reorder nodes and compute the inverse matrices of selected nodes by obtaining the corresponding elements of the lower/upper triangulars obtained by LU decomposi- tionasdescribedinSection4.2.2. Alower/uppertriangular intheinversematrixW−1. However,thisapproachrequires highmemoryconsumption. Thisisbecausetheinversema- matrix is a matrix where all the elements above/below the trix W−1 would be dense even though the matrix W itself 3http://www.keynote.com/downloads/ZonaNeedForSpeed.pdf is sparse [16] (In many real graphs, the number of edges is 444 muchsmallerthanthesquarednumberofnodes[14]). That             is, this approach requires O(n2) space.       We utilize the inverse matrices of lower/upper triangu-             lars to compute the proximities in our approach. Formally,             the following equation gives the proximities for the query node,wherethematrixWisdecomposedtoLUbytheLU (1) Degree (2) Cluster (3) Hybrid decomposition (i.e. W=LU). p=cU−1L−1q (3) Figure 1: Reordering methods. Note that the matrices L−1 and U−1 are lower and upper thesparseinversematricesbylettingupper/leftelementsof triangular, respectively. matrix A be zero. 4.2.2 Inversematricesproblem Weintroducethefollowingthreeapproximationsolutions against the inverse matrices problem: As shown in Equation (3), if we precompute the matrices L−1 and U−1, we can compute proximities of selected Degreereordering. Inthisapproach,wearrangenodesof nodes. However,thisraisesthefollowingquestion: ‘Canthe the given graph in ascending order of degree (the number matricesL−1 andU−1 besparseifmatrixW−1 isdense?’. ofedgesincidenttoanode)andrenamethembytheorder. Our answer to this question is to compute the sparse ma- Low degree nodes have few edges, and the upper/left ele- trices L−1 and U−1 by reordering the columns and rows of mentsofcorrespondingmatrixAareexpectedtobe0with the sparse matrix A. But finding the node order in matrix this approach. A that yields the sparse matrices is NP-complete. Cluster reordering. This approach first divides the given Theorem 1 (Inverse matrices problem). graph into κ partitions by Louvain Method [3], and it ar- Determining the node order that minimizes non-zero ele- ranges nodes according to the partitions. Note that the ments in matrices L−1 and U−1 is NP-complete. numberofpartitions,κ,isautomaticallydeterminedbyLou- vain Method. It then creates new empty κ+1-th partition. Becausetheinverse matricesproblemisNP-complete, Finally, if a node of a partition has an edge to another par- weintroduceanapproximationtoaddressthisproblem. Be- tition, it rearranges the node to the κ+1-th partition. As a fore we describe our approaches in detail, we show the ma- result,matrixAwouldbeadoubly-borderedblockdiagonal trixelementsofL−1andU−1canberepresentedbythoseof matrix [16] as shown in Figure 1-(2); elements correspond L and U by forward/backward substitution [16] as follows: to cross-partitions edges would be 0 for κ partitions 4. 0 (i<j) We use the Louvain Method because it is an efficient ap- L−ij1 = 1−/1L/iLjii ik−=1jLikL−kj1 ((ii>=jj)) (4) pmnreeosasascouhfren5ofoadrnedppaairtrttiuitttiioiolninziinensgg.,thiTnehtmeheomdsouednlausreliattryhitay[3t]atshasseerstesheaesretqhumeaalfiinttyy-  0 ∑ (i>j) edgeswithinapartitionandonlyafewbetweenthem. That Ui−j1 = −1/1U/iUjii jk=i+1UikUk−j1 ((ii<=jj)) (5) iesd,gLeso.uvAasinaMreestuhlot,dthreisduacpepsrotahcehnsuhmoubledryoifelcdromsso-rpearstpiatirosne inverse matrices. where the matrixelements ∑of L and U can be represented by those of W by Crout’s algorithm [16] as follows: Hybridreordering. Thisapproachisacombinationofthe degreeandtheclusterreordering. Thatis,wearrangenodes 0 (i<j) byclusterreordering,andwethensortnodesineachparti- Lij = 1 (i=j) (6) tion by their degree. This approach makes matrix A have  1/Ujj Wij− jk−=11LikUkj (i>j) nocross-partitionedgesforκpartitions,andtheupper/left ( ) elements of each partition are expected to be 0.  0 ∑ (i>j) Uij = WWiijj− ik−=11LikUkj ((ii≤≤jj ∩∩ ii≠=11)) (7) pwrhoFexirgiemurzaeetri1ooniallnaupdsptrnraootanec-shzemrfooartretilxhemeAeinnotvbsetraairsneeesdmhboawytrnaibcionevswepheraiotcebhlaeanmpd-, Equation (4),(5), (6), a∑nd (7) imply that elements of L−1, gray, respectively. Algorithm 1, 2, and 3 in Appendix B U−1,L,andUarecomputedfromthecolumnsfromleftto showthedetailsofdegreereordering,clusterreordering,and right, and within each column from top to bottom. For ex- hybrid reordering, respectively. ample, element L−ij1 can be computed from the correspond- Owingtothesethreeapproaches,wecaneffectivelyobtain ing upper/left elements of L and L−1, and element Lij can sparsematricesLandU,andthensparsematricesL−1 and becomputedfromthecorrespondingupper/leftelementsof U−1. As demonstrated in Section 6, these approaches can W, L, and U. drasticallyreducethememoryneededtoholdmatricesL−1 Ourapproachesarebasedonthefollowingthreeobserva- and U−1; they have practically linear space complexity in tionsintheabovefourequations: (1)elementsL−ij1andUi−j1 thesizeofedgesinthegivengraphbyusingtheadjacency- wouldbezeroifthecorrespondingupper/leftelementsofL list representation [6]. and U are zero, (2) upper/left elements of L and U would 4 be zero if the corresponding upper/left elements of W are Doubly-bordered block diagonal matrix D is defined by Duv = 0, (u,v)P(u) = P(v),P(u) = κ+1,P(v) = κ+1 where P(u) zero, and (3) the upper/left elements of W would be zero ∀{ | ̸ ̸ ̸ } isthepartitionnumberofnodeu. ifthecorrespondingupper/leftelementsofAarezerosince 5 Foralldatainourexperiments,LouvainMethodcancomputepar- A = (I W)/(1 c). That is, we can effectively compute titionsinafewseconds. − − 445 4.3 Treeestimation The determination of the root node of the tree and the Weintroduceanalgorithmforestimatingtheproximities selection of proximity computation nodes are important in of unselected nodes in the search process effectively and ef- achieving efficient pruning. We determine the query node ficiently. Inthisapproach,anodeisvisitedonebyone,and as the root node, and we visit and select nodes in increas- weestimateitsproximity. Iftheestimatedproximityisnot ing order of layer number. This is because: (1) a few nodes lower than the K-th highest proximity of candidate nodes, which are just a few hops from the query node have high then the node is selected to compute the exact proximity. proximities, and (2) we can estimate the proximities of vis- Otherwiseweskipsubsequentexactproximitycomputations ited nodes from those of selected nodes (see Definition 1). of visited nodes. This approach, based on a single breadth- As a result, we can effectively estimate the proximities of firstsearchtree,yieldstheupperboundingscoreestimations visited nodes. of visited nodes. In this section, we first give the notations Iftheestimated proximityofa visitednodeis lowerthan for the estimation, next we formally introduce the estima- theK-thhighestproximityofthecandidatenodes,wetermi- tion, and then our approach to incremental estimation in nate the search process without computing the estimations the search process. ofotherunvisitedandunselectednodes. However,thisraise the following question: ‘Can we find the answer nodes ex- 4.3.1 Notation actly if we terminate the search process?’. To answer this In the search process, we construct a single breadth-first question, we show the following lemma: search tree that is rooted on the query node; thus it forms Lemma 2 (Layer search). If nodes are visited and layer0. The direct neighbors of the root node form layer1. selectedin ascendingorderof layers, p¯ p¯ holds for node u v All nodes that are i hops from the root node form layer i. ≥ u and v such that l l and u,v=q. u v ThesetofnodesinthegraphisdefinedasV,andthesetof ≤ ̸ selected(i.e.,exactproximitycomputed)nodesisdefinedas Lemma 2 implies that the estimated proximity of a vis- V . Thelayernumberofnodeuisdenotedasl . Moreover, ited node can not be lower than that of an unvisited and s u thesetofselectednodespriornodeuwhoselayernumberis unselected node on the same/lower layer. Therefore, if the l isdefinedasV (u),thatisV (u)= v:(v V ) (l = estimatedproximityofavisitednodeislowerthantheK-th lu) . A is thleumaximum elleument in{matri∈x As,∩thavt is highestproximityofthecandidatenodes,allotherunvisited u max A } = max A : i,j V . A (u) is the maximum and unselected nodes have lower proximities than the K-th max ij max element from{node u, tha∈t is A} (u)=max A :i V . highest proximities of the candidate nodes. Thus we can max iu Note that both A and A (u) can be{precomp∈ute}d. safely terminate the search process. max max It requires O(1) space to hold A , and it requires O(n) max 4.3.3 Incrementalcomputation space to hold A (u) of all n nodes. max AsdescribedinSection4.3.2,byDefinition1,O(n)timeis 4.3.2 Proximityestimation requiredtocomputetheestimatedproximityforeachnode. We describe the definition of the proximity estimation in Inthissection,weshowourapproachtoefficientlycompute thissection. Wealsoshowthattheestimationgivesavalid the estimated proximity. We assume that node u is visited upperproximitybound. Weestimatetheproximityofnode andselectedimmediatelyafternodeu′inthesearchprocess. u via breadth-first search tree as follows: In other words, we visit and select these nodes in order u′ andu. Inthissection,letp¯ ,p¯ ,andp¯ befirst,second, u,1 u,2 u,3 Definition 1 (Proximity estimation). If node u is andthirdtermsinEquation(8),respectively. Thatis,p¯ = u not the query node (i.e. u̸=q), the following equation gives c′(p¯u,1+p¯u,2+p¯u,3). the definition of proximity estimation of node u, p¯u, to skip We compute the estimation of u as follows: proximity computation in the search process: Definition 2 (Incremenatal update). Forthegiven graph and query node, if u′ = q, we compute the first, sec- ̸ p¯u =c′ pvAmax(v)+ pvAmax(v) ond, and third terms of the estimation of node u from those v∈V∑lu−1(u) v∈∑Vlu(u) (8) of u′ in the search process as follows: where c′ =(1 c)/(1 Auu++cA(u1u)−. v∑∈Vspv)Amax} pp¯¯uu,,12 =={ ppp0¯¯¯uuu′′′,,,122++ppuu′′AAmmaaxx((uu′′)) iooifftthhll((eeuurrww))ii==sseell((uu′′)) (9) − − If node u is the query node (i.e. u=q), p¯ =1. { u p¯ =(p¯ /A p )A u,3 u′,3 max− u′ max It needs O(n) time to compute the estimation for each node if we compute it according to Definition 1. This is If u′ = q, p¯u,1 = pqAmax(q), p¯u,2 = 0, and p¯u,3 = (1 − p )A (u). becauseV (u),V (u),andV allhavesizeofO(n). We, q max however, cluo−m1pute thlue estimations in O(1) time as described Weprovidethefollowinglemmafortheincrementalcom- in Section 4.3.3. putation in the search process: Toshowthepropertyofourproximityestimation, wein- troduce the following lemma: Lemma 3 (Incremenatal update). For node u, the estimated proximity can be exactly computed at the cost of Lemma 1 (Proximity estimation). p¯u pu holds O(1) time by Definition 2. ≥ for node u in the given graph. This property enables K-dash to efficiently compute the This lemma enables us to find the answer nodes exactly. estimated proximity in the search process. 446 4.4 Searchalgorithm efficiently for Personalized PageRank. The definitions of Even though a detailed search algorithm of K-dash is de- RWR and Personalized PageRank are very similar 6. Even scribed in Algorithm 4 in Appendix B.2, we outline our though Avrachenkov et al. also proposed an efficient ap- searchalgorithmasfollowstomakethepaperself-contained: proach for Personalized PageRank based top-k search [2], we compared K-dash to Basic Push Algorithm. This is be- 1. We construct a breadth-first search tree rooted at the cause Basic Push Algorithm uses precomputed proximities query node. of hub nodes to estimate the upper bounding proximities 2. We visit a node in ascending order of tree layer and [7]; this implies Basic Push Algorithm theoretically guar- compute its estimated proximity by Definition 2. antees that the recall of its answer result is always 1 while 3. If the estimated proximity of the visited node is not the approach of Avrachenkov et al. does not. Basic Push lowerthantheK-thproximityofthecandidatenodes, Algorithm is an approximate approach and the number of the node can be an answer node. Therefore, we com- answernodesyieldedbythisapproachcanbemorethanK. pute the proximity of the node and return to step 2. Our experiments will demonstrate that: 4. Otherwise, we terminate the search process since the Efficiency: K-dash can outperform the approximate node and other unselected nodes can not be answer • approach by several order of magnitude in terms of nodes (Lemma 2). searchtimefortherealdatasetstested(Section6.1). 5. THEORETICALANALYSIS Exactness: Unlike the approximate approach, which • This section provides a theoretical analysis that confirms sacrifices accuracy, K-dash can find the top-k nodes the accuracy and complexity of K-dash. Let n be the num- exactly (Section 6.2). ber of nodes. Proofs of each theorem in this section are Effectiveness: The components of K-dash, sparse ma- shown in Appendix A. • tricescomputationandtreeestimation,areveryeffec- We show that K-dash finds the top-k highest proximity tive in identifying top-k nodes (Section 6.3). nodes accurately (without fail) as follows: The results of the application of K-dash to a real dataset Theorem 2 (Exact search). K-dash guarantees the are reported in Appendix D. exact answer in finding the top-k highest proximity nodes. We used the following five public datasets in the experiments: Dictionary, Internet, Citation, Social, and Email. We discuss the complexity of the existing approximate The details of datasets are reported in Appendix C. In this algorithmB LINandNB LIN[22]andthenthatofK-dash. section,K-dash representstheresultsoffindingthetopfive Theorem 3 (The approximate algorithm). B LIN nodes with the hybrid reordering approach. We set the andNB LINbothrequireO(n2)spaceandO(n2)timetofind restart parameter, c, at 0.95 as in the previous works [22, the top-k highest proximity nodes. 8]. Weevaluatedthesearchperformancethroughwallclock time. AllexperimentswereconductedonaLinuxquad3.33 Theorem 4 (Space complexity of K-dash). K-dash GHz Intel Xeon server with 32GB of main memory. We requiresO(n2)spacetofindthetop-khighestproximitynodes. implemented our algorithms using GCC. Theorem 5 (Time complexity of K-dash). K-dash 6.1 EfficiencyofK-dash requiresO(n2)timetofindthetop-khighestproximitynodes. WeassessedthesearchtimeneededforK-dash, NB LIN, Theorems 3, 4, and 5 show that K-dash has, in the worst andBasicPushAlgorithm. Figure2showstheresults. The case, the same space and time complexities as the previ- results of K-dash are referred to as K-dash(K), where K ous approximate approaches. However, the space and time is the number of answer nodes. We set the target rank of complexities of K-dash is, in practice, O(n+m), which are SVD to 100 and 1,000 (referred to as NB LIN(100) and smallerthanthoseofthepreviousapproximateapproaches. NB LIN(1,000)). Note that the number of answer nodes, This is because the number of non-zero elements in the in- K,hasnoimpactinNB LINsinceitcomputesapproximate versematricesisO(m)asshowninthenextsection. Inthe proximityscoresforallnodes. BPA(K)indicatestheresults nextsection,weconfirmtheeffectivenessofourapproaches of Basic Push Algorithm where K is the number of answer by presenting the results of extensive experiments. nodes and the number of hub nodes is set to 1,000. Thisfigureshowsthatourmethodismuchfasterthanthe 6. EXPERIMENTALEVALUATION previousapproachesunderallconditionsexamined. Specifi- cally,K-dashismorethan68,000timesfasterthanNB LIN We performed experiments to demonstrate K-dash’s ef- and 690,000 times faster than Basic Push Algorithm. As fectiveness in a comparison to NB LIN by Tong et al. [22] described in Section 5, NB LIN takes O(n2) time to com- and Basic Push Algorithm by Gupta et al. [7]. NB LIN puteproximities. EventhoughK-dashtheoreticallyrequires was selected since, as reported in [22], it outperforms the O(n2) time as shown in Lemma 5, it does not, in practice, iterative approach and the approximation approach by Sun take O(n2) time to find the answer nodes. This is because et al. [19], and it yields similar results to B LIN, which the number of non-zero elements in the inverse matrices is was also proposed by by Tong et al., in all of our datasets. O(m)inpracticeasshowninSection6.3.1. Thatis,thetime NB LINhasoneparameter: thetargetrankofthelow-rank complexity of K-dash is, in practice, O(n+m), because it approximation. We used SVD as the low-rank approxima- takesO(n+m)timeforbreadth-firsttreeconstructionand tionasproposedbyTongetal. NotethatNB LINcancom- pute proximities quickly at the expense of exactness. Basic 6 In Personalized PageRank, a random particle returns to the start Push Algorithm is an approach that can find top-k nodes nodeset,notthestartnode. 447 Wall clock time [s]1111111100000000------01654321 NBND_BKKiLc_K--ItddLNiBB-odaaIB(PPnN1assPaAA(hh,s10rA((h((y250250((55050500I))))))))nternet Citation Social Email Precision 0000 ....012468100 Target ran4k 0o0f SVD/Number7 No0KfB0 -hd_BuaLPbsINA hnodes 1000 Wall clock time [s]11111110000000------0654321100 Target ran4k0 o0f SVD/Number7 o0NfK0 Bh-du_BabLPs InNAhodes 1000 Ratio of the number of non-zero elements111110000001234 RDCaHDenluyigdcbsrotteriemoiedrnaryInternet Citation Social Email Figure 2: Efficiency of K- Figure 3: Precision of Figure 4: Wall clock time Figure 5: Effect of re- dash. NB LIN. of NB LIN. ordering approaches. O(m) time for proximity computations. Therefore, our ap- imitiesofselectednodesinthesearchprocess. Thenumber proach can find the answer nodes more efficiently than the of non-zero elements in these inverse matrices is a factor previous approaches. that influences the search and memory cost. We take three approaches to reduce the number of non-zero elements as 6.2 Exactnessofthesearchresults described in Section 4.2.2. Accordingly, we evaluated the One major advantage of K-dash is that it guarantees the ratio of the number of non-zero elements to that of edges exact answer, but this raises the following simple question: in each reordering approach. Figure 5 shows the results. ‘Howsuccessfularetheapproximateapproachesinproviding Inthisfigure,Random representstheresultsachievedwhen the exact answer even though it sacrifices exactness?’. nodes are arranged in random order. Toanswerthisquestion,weconductedcomparativeexper- As we can see from the figure, our approaches (Degree, iments. We used precision as the metric of accuracy. Pre- Cluster, and Hybrid reordering) yield many fewer non-zero cision is the fraction of answer nodes among top-k results elementsthanrandomreordering. Thisfigurealsoindicates by each approach that match those of the original iterative that our approach makes the number of non-zero elements algorithm. Figure3showstheprecisionandFigure4shows near to that of the edges of the given graph in all datasets the wall clock time. In this experiment, we used various if we adopt Hybrid reordering approach. That is, the space target rank setting and various number of hub nodes for complexity of K-dash is O(m). Owing to the small size of NB LIN and Basic Push Algorithm, respectively. We used the inverse matrices, K-dash achieves excellent search per- Dictionary as the dataset in these experiments. formance as shown in Figures 2. As we can see from Figure 3, the precision of K-dash is 1 6.3.2 Precomputationtime because it finds the top-k nodes without fail. NB LIN, on the other hand, has lower precision. Figure 4 shows that Our approach uses the inverse matrices of lower/upper K-dashgreatlyreducesthecomputationtimewhileitguar- triangulars in the search process. That is, these matrices anteestheexactanswer. TheefficiencyofNB LINdepends must be computed in the precomputing process. Figure 6 on the parameters used. shows the process time in the precomputing process. And Figure 3 and Figure 4 show that NB LIN forces a Figure 6 indicates that our reordering approach can en- trade-offbetweenspeedandaccuracy. Thatis,asthetarget hancetheprocesstime;itisupto140timesfasterthanthe rank decreases, the wall clock time decreases but the preci- Randomreorderingapproach. Therearetworeasonsforthis siondecreases. NB LINdoesnotguaranteethattheanswer result. The first is that the inverse matrices have a sparse resultsareaccurate,andsocanmisstheexacttop-knodes. data structure as shown in Figure 5. The second is that K-dash also computes estimate proximities, but unlike the elements of the inverse matrices which correspond to cross approximateapproach,K-dashdoesnotdiscardanyanswer partition edges between 1st to κ-th partition are zero due nodes in the search process. toEquation(4), (5), (6), and(7) 7. Therefore wecaneffec- Figure 3 shows that the precision of Basic Push Algo- tivelyskipthe computations ofthese elements. Asa result, rithm is almost constant against the number of hub nodes. we can efficiently compute the inverse matrices. Additional Figure 4 indicates that the search speed of the approach experimentsconfirmedthatourapproachneedslessprecom- increases as the number of hub nodes increases. This is putationtimeduetoitssophisticatedsparsedatastructure because Basic Push Algorithm utilizes precomputed prox- than the other approaches. For example, NB LIN needs imities of hub nodes to estimate the proximities of a query severalweekstocomputeSVDbecauseSVDrequiresO(n3) node. Figures3and4showthatourapproachismuchfaster time, while our approach needs several hours. thanthepreviousapproacheswhileguaranteeingexactness. 6.3.3 Treeestimation 6.3 Effectivenessofeachapproach As mentioned in Section 4.3, K-dash skips unnecessary In the following experiments, we examine the effective- proximity computations in the search process. To show the ness of the two core techniques of K-dash: sparse matrices effectivenessofthisidea,weremovedthepruningtechnique computation and tree estimation. from K-dash, and reexamined the wall clock time. Figure 7 6.3.1 Reorderingapproach 7 For Dictionary, the improvement yielded by our approach was marginal. This is because the Louvain Method divides this dataset K-dashutilizestheinversematricesoflower/uppertrian- into one large partition and many small partitions which limits the gularsobtainedbyLUdecompositiontocomputetheprox- effectivenessofourapproach. 448 106 10-1 [3] V.D.Blondel,J.-L.Guillaume,R.Lambiotte,and Degree K-dash 105 RCaHnluydbsotremidr 10-2 Without pruning EN.etLweoferbkvs.reJ..FSatsattiUstnicfoalldMinegchoafnCicosm:mTuhneiotrieysainndLarge me [s]104 me [s]10-3 Experiment,2008(10),2008. Wall clock ti110023 Wall clock ti10-4 [[45]] EJS.h.oCPrh.teeFsn.tgC,PhJa.athnXQ.anYuduer,HieB.s..LDVimiLn.DgO,BPpt.Ji.Sm,.i1zY6au(t3,io)a:n3n4da3n–Hd3.6EW9v,aa2lnu0g0a.7tFi.oanstof 101 10-5 GraphPatternMatching.InICDE,pages913–922,2008. 100 10-6 [6] T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein. DictionaryInternet Citation Social Email DictionaryInternet Citation Social Email Introduction to Algorithms.TheMITPress,2009. [7] M.S.Gupta,A.Pathak,andS.Chakrabarti.Fast AlgorithmsforTop-kPersonalizedPageRankQueries.In Figure 6: Precomputa- Figure 7: Effect of tree WWW,pages1225–1226,2008. tion time. estimation. [8] J.He,M.Li,H.Zhang,H.Tong,andC.Zhang. Manifold-RankingBasedImageRetrieval.InACM Multimedia,pages9–16,2004. [9] G.Karypis,V.Kumar,andV.Kumar.Multilevelk-way shows the result. K-dash without the pruning technique is PartitioningSchemeforIrregularGraphs.Journal of abbreviated to Without pruning in this figure. Parallel and Distributed Computing,48:96–129,1998. The results show that the pruning technique can provide [10] A.Khan,N.Li,X.Yan,Z.Guan,S.Chakraborty,and efficient search for all datasets used; these results indicate S.Tao.NeighborhoodBasedFastGraphSearchinLarge thatourapproachiseffectiveforvariousedgedistributions. Networks.InSIGMOD Conference,pages901–912,2011. K-dash is up to 1,020 times faster if the pruning method [11] I.Konstas,V.Stathopoulos,andJ.M.Jose.OnSocial is used. This is because we can effectively terminate the NetworksandCollaborativeRecommendation.InSIGIR, pages195–202,2009. searchprocesswiththeestimationtechnique. Theseresults [12] D.Liben-NowellandJ.M.Kleinberg.TheLinkPrediction (compareFigure 2 to Figure 7) also showthat, by, K-dash ProblemforSocialNetworks.InCIKM,pages556–559, canfindthetop-knodesfasterthanNB LINevenifK-dash 2003. computestheproximitiesofallnodes. Toevaluatetheeffec- [13] D.Lizorkin,P.Velikhov,M.N.Grinev,andD.Turdakov. tivenessofourapproachforvariousproximitydistributions, AccuracyEstimateandOptimizationTechniquesfor we subjected it to additional evaluations using various val- SimRankcomputation.PVLDB,1(1):422–433,2008. ues of the restart probability c. The results confirmed that [14] M.Newman,A.-L.Barabasi,andD.J.Watts.The our approach can efficiently find the top-k nodes under all Structure And Dynamics of Networks.PrincetonUniversity Press,2006. conditions examined; we can effectively prune unnecessary [15] J.-Y.Pan,H.-J.Yang,C.Faloutsos,andP.Duygulu. proximity computations. AutomaticMultimediaCross-modalCorrelationDiscovery. InKDD,pages653–658,2004. 7. CONCLUSIONS [16] W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P. Flannery.Numerical Recipes 3rd Edition.Cambridge This paper addressed the problem of finding the top-k UniversityPress,2007. nodes for a given node efficiently. As far as we know, this [17] R.Schifanella,A.Barrat,C.Cattuto,B.Markines,and is the first study to address the top-k node search problem F.Menczer.FolksinFolksonomies: SocialLinkPrediction with the guarantee of exactness. Our proposal, K-dash, is fromSharedMetadata.InWSDM,pages271–280,2010. based on two ideas: (1) It computes the proximities of se- [18] G.Strang.Introduction to Linear Algebra.Wellesley lected nodes efficiently by use of inverse matrices, and (2) CambridgePress,2009. It skips unnecessary proximity computations in finding the [19] J.Sun,H.Qu,D.Chakrabarti,andC.Faloutsos. top-k nodes, which greatly improves overall efficiency. Our NeighborhoodFormationandAnomalyDetectionin BipartiteGraphs.InICDM,pages418–425,2005. experiments show that K-dash works as expected; it can [20] Y.Sun,J.Han,X.Yan,P.Yu,andT.Wu.PathSim: Meta find the top-k nodes at high speed; specifically, it is signif- Path-BasedTop-KSimilaritySearchinHeterogeneous icantly faster than existing approximate methods. Top-k InformationNetworks.PVLDB,4(1),2011. searchbasedonRWRisfundamentalformanyapplications [21] H.TongandC.Faloutsos.Center-PieceSubgraphs: in various domains such as image captioning, recommender ProblemDefinitionandFastSolutions.InKDD,pages systems, and link prediction. The proposed solution allows 404–413,2006. thetop-knodestobedetectedexactlyandefficiently,andso [22] H.Tong,C.Faloutsos,andJ.-Y.Pan.FastRandomWalk willhelptoimprovetheeffectivenessoffutureapplications. withRestartandItsApplications.InICDM,pages 613–622,2006. [23] M.Yannakakis.ComputingtheMinimumFill-Inis 8. REFERENCES NP-Complete.SIAM. J. on Algebraic and Discrete Methods,2(1):77–79,1981. [1] I.Antonellis,H.Garcia-Molina,andC.-C.Chang. Simrank++: QueryRewritingthroughLinkAnalysisofthe [24] P.Zhao,J.X.Yu,andP.S.Yu.GraphIndexing: Tree+ ClickGraph.PVLDB,1(1):408–421,2008. Delta>=Graph.InVLDB,pages938–949,2007. [2] K.Avrachenkov,N.Litvak,D.Nemirovsky,E.Smirnova, [25] Y.Zhou,H.Cheng,andJ.X.Yu.GraphClusteringBased andM.Sokol.QuickDetectionofTop-kPersonalized onStructural/AttributeSimilarities.PVLDB, PageRankLists.InWAW,pages50–61,2011. 2(1):718–729,2009. 449 APPENDIX u u 5 6 A. PROOFS u2 u u u7 Inthissection,weshowtheproofsofalllemmasandthe- u 3 4 orems in this paper. 1 A.1 Theorem1 Figure 8: An example graph. Proof. We prove the theorem by a reduction from the elimination ordering problem [23]. An instance of the elimination ordering problem consists of a graph, node pu5 =c′(A51p1+A52p2+A53p3+A54p4+A56p6+A57p7) eliminationordering,andthechordalgraphthatcanbeob- =c′(A52p2+A54p4+A56p6) tained by the graph and the elimination ordering. Given Since the proximities of node u , u , u and u are already 1 2 3 4 the graph, the elimination ordering problem finds the computed before node u and node u , the following equa- 5 6 minimumnumberofedgeswhoseadditionmakesthegraph tion holds: chordal by changing node ordering. We transform an instance of the elimination ordering pu5≤c′(p2Amax(u2)+p4Amax(u4)+(1−p1−p2−p3−p4)Amax)=p¯u5 problemtoaninstanceoftheinversematricesproblemas Notethatourestimationapproachtakesintoaccountedges follows: for the graph of the elimination ordering prob- ofselectednodesandunvisitednodesasA (u)andA , lem, we create matrix A. That is, we make the adjacency- max max respectively. For example, non-tree edges A and A are listmatrix from the graph. Forthe node elimination order- 54 56 taken as A (u ) and A , respectively. ing, we create node ordering, and we create matrix L−1 / max 4 max U−1 for the chordal graph. A.3 Lemma2 Given this mapping, it is easy to show that there exists Proof. Ifl =l ,itisobviousthatp¯ =p¯ . Ifl =l 1, a solution to the elimination ordering problem with the u v u v u v− the following inequality holds since V (v)= : minimumnumberofedgeadditionsifandonlyifthereexists lv ∅ asolutiontotheinverse matricesproblemwiththemin- iTmhueminivnecrresaesemoaftnroicn-ezserporoeblelmemenitsstirnivtihaleiinnvNerPse. matrice2s. p¯v =c′ pwAmax(w)+(1− pw)Amax≤p¯u w∈∑Vlu(u) w∑∈Vs  A.2 Lemma1 Andiflu lv 2,thefollowinginequalitysimilarlyholds ≥ − Proof. If node u is not the query node, the following since Vlv(v)=∅ and Vlv−1(v)=∅: equation holds from Equation (1): p =(1 c)(A p +A p +...+A p +...+A p ) p¯v =c′ 1 pw Amax p¯u u − u1 1 u2 2 uu u un n ( − ) ≤ Since more than two upper/lower layer nodes can not be w∑∈Vs which completes the proof. 2 directlyconnectedtonodeuinthebreadth-firstsearchtree, if the set of directly neighboring nodes (adjacent nodes) of A.4 Lemma3 node u is N , p is represented as follows: u u Proof. We first prove that, if u′ = q, p¯u can be exactly pu =c′v∑̸=uAuvpv =c′v∑∈NuAuvpv qco,mVpluu(tue)d=fro0m, pa¯un,d1,Vp¯su,2=,aqn.dTp¯uh,3er.eIfnorteh,isitcaisseo,bVvliuo−u1s(uth)a=t c′(p¯u,1+p¯u,2+p¯u,3)=p¯u holds by Definition 2. ≤c′ Auvpv+ Auvpv noWdeeunecxatnpbreoveexatchtalyt,ciofmup′u̸=teqd,ftrhoemesthtiamtaotfenpordoexium′.ity of STihnecreefpovrei,spvro∈b{Valbui−li1∑t(yu,)+Vlvu∈(Vu)\}Vspv = 1v−∈∑V\Vvs∈Vspv holds. u′.IfTluhe=reflour′,e,Vlu−1(u)=Vlu′−1(u′) and Vlu(u)=Vl′u(u′)+ ∑ ∑ p¯ p¯ = p A (v)=0 u,1− u′,1 v max pu ≤c′ pvAmax(v)+ pvAmax(v) v∈{Vlu−1(u∑)−Vlu′−1(u′)} and v∈V∑lu−1(u) v∈∑Vlu(u)  + 1 p A =p¯ p¯u,2−p¯u′,2 = pvAmax(v)=pu′Amax(u′) ( −v∑∈Vs v) max} u v∈{Vlu(u∑)−Vl′u(u′)} pI¯fun=od1eaundis0t≤hepuqu≤er1y. Tnohdues,thiteiisneoqbuvaioliutys hp¯ould≥s.pu sinc2e VluO(tuh)e=rw∅is.eT(hi.eer.elfuor=e,lu′+1),Vlu−1(u)=Vlu′(u′)+u′ and Example. Let node u be a query node in a directed graph in Figure 8. As des1cribed in Section 4.3, node/layer p¯u,1 = pvAmax(v)=p¯u′,2+pu′Amax(u′) numbersofallthenodesareassignedbybreadth-firstsearch v∈{Vlu∑′(u′)+u′} tree; node u1 forms layer 0, node u2 and u3 form layer 1, and nodeu andu formlayer2,andnodeu andu formlayer 4 5 6 7 3. Andweassumethatwevisitandselectnodesinascend- p¯ = p A (v)=0 u,2 v max ing order of their node number. For node u5, the following ∑v∈∅ equation holds from Equation (1) since A ,A ,A =0: 51 53 57 450 Since node u is visited immediately after node u′, Algorithm 1 Degree reordering Input: A,thecolumnnormalizedadjacentmatrix (p¯ p¯ )/A =p u′,3− u,3 max u′ Output: A′,thereorderedmatrixofA 1: arrangenodesinascendingorderoftheirdegrees; Therefore, p¯u,1, p¯u,2, and p¯u,3 can be exactly computed 2: compute matrix A′ by interchanging the rows and columns of from p¯ , p¯ , and p¯ , respectively. matrixAbythedegreeorder; Wefiun′,a1llyup′,r2ovethatu′i,t3takesO(1)timetocomputep¯ , 3: return A′; u,1 p¯u,2, and p¯u,3 in the search process. If u′ = q, p¯u,1, p¯u,2, and p¯ are defined by Definition 2. As described in Sec- u,3 tion 4.3.1, both A and A (u) can be precomputed. Algorithm 2 Cluster reordering max max p¯ ,p¯ ,p¯ ,andp arealreadycomputedbeforecom- Input: A,thecolumnnormalizedadjacentmatrix pthuue′t,i1pnrgouop¯′fu,.2inut′h,3e searchu′process if u′ ̸= q. This complete2s O123u:::tfdcporievuraitdit:ee:=Annoe′1,wdtteohpseaiκnrrttediootoirκodnepraPerκdt+itm1ioa=ntsr∅iPx;1o,fPA2,...,PκbyLouvainmethod; A.5 Theorem2 4: removenodeswhoseedgescrossmorethantwopartitionsfrom Proof. Let θ be the K-th highest proximity among the 5: paparpteintidonthPei;removednodestoPκ+1; candidate nodes in the search process. And let θK be the 6: endfor K-th highest proximity among the answer nodes (i.e. θK is 7: compute matrix A′ by interchanging the rows and columns of matrixAbythepartitions; the lowest proximity among the answer nodes). 8: return A′; We first prove that θ is monotonic non-decreasing in the search process of K-dash. To find the answer nodes in the searchprocess,wefirstsetθ at0andsetthedummynodes A.7 Theorem4 as the candidates. We maintain the candidate as the best Proof. To compute the estimation, K-dash holds the result;whenwefindanodewithhigherproximity,itsprox- maximumelementsofthematrixA,thepreviousestimated imityisgreaterthanθ,thecandidateisreplacedbythenew proximity, and the previous proximity. It needs O(n) space node (see Algorithm 4). This makes θ higher. Therefore, θ to hold these values. K-dash keeps the inverse matrices to keeps increasing in the search process. compute the proximities. The numberof non-zero elements Inthesearchprocess,sinceθismonotonicnon-decreasing of these matrices is O(n2) in the worst case. Therefore, it and θ θ, the estimate proximities of answer nodes are K ≥ requires O(n2) space to keep the inverse matrices. There- never lower than θ (Lemma 1). The algorithm discards a fore,ourapproachrequiresO(n2)spacetofindtop-khighest nodeif(1)itsestimatedproximityislowerthanθ,or(2)its proximity nodes. 2 upper/same layer unselected node has estimated proximity lower than θ. Since the estimated proximity of a node can A.8 Theorem5 notbelowerthanthatofanodeonthesameorlowerlayer Proof. Tofindtheanswernodes,K-dashfirstconstructs (Lemma 2), the answer nodes can never be pruned during the breadth-first search tree, and computes the estimated the search process. 2 proximityofthevisitednode. Itnextcomputestheproxim- A.6 Theorem3 ity of the node if the node is not pruned by the estimation. K-dashneedsO(n+m)timetoconstructthetreeandO(n) Proof. WefirstprovethatB LINandNB LIN[22]both need O(n2) space. The off-line process of B LIN first par- time if it can not prune any nodes by the estimation. This isbecauseittakesO(1)timetocomputetheestimationfor titions the adjacency matrix by METIS [9], and then de- each node (Lemma 3). K-dash needs O(n2) times to com- composes the matrix into the within-partition edge matrix pute the proximities of all nodes since the inverse matrices and the cross-partition edge matrix. It next performs low- havesizeofO(n2)intheworstcase. SoK-dashneedsO(n2) rankapproximationfor the cross-partition edge matrix and time to find the top-k highest proximity nodes. 2 obtains two orthogonal matrices and one diagonal matrix. It then computes the product of the within-partition edge matrix and the orthogonal matrices. B. ALGORITHMS The off-line process of NB LIN first performs low-rank Inthissection,weshowthealgorithmsforreorderingap- approximation for the adjacency matrix and obtains two proaches and K-dash. orthogonal matrices and one diagonal matrix. It then computes the product of these matrices. B.1 Reorderingapproach Both B LIN and NB LIN hold the matrix product and We interchange the rows and columns of matrix A to re- two orthogonal matrices to compute the proximities. The duce the number of non-zero elements in the inverse matri- matrixproductandorthogonalmatriceshavesizeofO(n2). ces. Sincetheinverse matricesproblemis NP-complete, Therefore, B LIN and NB LIN both require O(n2) space. we take three approximation solutions to the problem: de- Next,weprovethatB LINandNB LINbothneedO(n2) gree reordering, cluster reordering, and hybrid reordering. time. Theycomputetheproximitiesofnodesbymultiplying Algorithm1depictsourdegreereorderingapproach. This the vector q, the matrix product, and orthogonal matrices. approach reduces non-zero elements by arranging low de- EventhoughthesizeofthevectorqisO(n),thatofthema- gree nodes to the upper/left elements in matrix A. It first trix product and orthogonal matrices is O(n2). Therefore, computes the degrees of all nodes and arranges the nodes B LIN and NB LIN both require O(n2) time. 2 according to their degrees (line 1). It then computes the reordered matrix by the degree order (line 2). 451