ebook img

An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings PDF

18 Pages·0.32 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings

An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings ∗ Ming-Yang Kao Tak-Wah Lam Wing-Kin Sung Hing-Fung Ting † ‡ § ‡ 1 0 February 1, 2008 0 2 n a J Abstract 8 A widely used method for determining the similarity of two labeled trees is to compute a 1 maximumagreementsubtree of the two trees. Previouswork onthis similarity measureis only ] concerned with the comparison of labeled trees of two special kinds, namely, uniformly labeled V trees (i.e., trees with all their nodes labeled by the same symbol) and evolutionary trees (i.e., C leaf-labeled trees with distinct symbols for distinct leaves). This paper presents an algorithm s. for comparingtrees that are labeled in anarbitrarymanner. In additionto this generality,this c algorithm is faster than the previous algorithms. [ Anothercontributionofthispaperisonmaximumweightbipartitematchings. Weshowhow 2 to speed up the best known matching algorithms when the input graphs are node-unbalanced v orweight-unbalanced. Basedontheseenhancements,weobtainanefficientalgorithmforanew 0 matchingproblemcalledthehierarchical bipartite matchingproblem,whichisatthecoreofour 1 maximum agreement subtree algorithm. 0 1 0 1 Introduction 1 0 / s A labeled tree is a rooted tree with an arbitrary subset of nodes labeled with symbols. In recent c years, many algorithms for comparing such trees have been developed for diverse application areas : v including biology [8, 19, 23], chemistry [25], linguistics [9, 21], computer vision [18], and structured i X text databases [16, 17, 20]. r A widely used measure of the similarity of two labeled trees is the notion of a maximum a agreement subtree defined as follows. A labeled tree R is a label-preserving homeomorphic subtree of another labeled tree T if there exists a one-to-one mapping f from the nodes of R to those of T such that for any nodes u,v,w of R, (1) u and f(u) have the same label; and (2) w is the least ∗Thispapercombinesresultsofthreeconferencepapers: (1)Afasterandunifyingalgorithmforcomparingtrees. InProceedings ofthe11thSymposium onCombinatorial Pattern Matching,pages129-142,2000. (2)Unbalancedand hierarchicalbipartitematchingswithapplicationstolabeledtreecomparison. InProceedingsofthe11thInternational SymposiumonAlgorithmsandComputation,pages479-490,2000. (3)Adecompositiontheoremformaximumweight bipartite matchings with applications to evolutionary trees. In Proceedings of the 8th Annual European Symposium on Algorithms, pages 438-449, 1999. †Department of Computer Science, Yale University, New Haven, CT 06520, USA; [email protected]. Research supported in part by NSFGrant CCR-9531028. ‡Department of Computer Science and Information Systems, University of Hong Kong, Hong Kong; {twlam, hfting}@cs.hku.hk. Research supported in part byHong Kong RGCGrant HKU-7027/98E. §E-Business Technology Institute, Universityof HongKong, HongKong; [email protected]. 1 common ancestor of u and v if and only if f(w) is the least common ancestor of f(u) and f(v). Let T and T be two labeled trees. An agreement subtree of T and T is a labeled tree which is 1 2 1 2 also a label-preserving homeomorphic subtree of the two trees. A maximum agreement subtree is one which maximizes the number of labeled nodes. Let mast(T ,T ) denote the number of labeled 1 2 nodes in a maximum agreement subtree of T and T . 1 2 In the literature, many algorithms for computing a maximum agreement subtree have been developed. These algorithms focus on the special cases where T and T are either (1) uniformly 1 2 labeled trees, i.e., treeswithalltheirnodesunlabeled,orequivalently, labeledwiththesamesymbol or (2) evolutionary trees [13], i.e., leaf-labeled trees with distinct symbols for distinct leaves. We denote n as the number of nodes in the labeled trees T and T , and d as the maximum 1 2 degree of T and T . For uniformly labeled trees T and T , Chung [3] gave an algorithm to 1 2 1 2 determine whether T is a label-preserving homeomorphic subtree of T using O(n2.5) time. Gupta 1 2 and Nishimura [11] gave an algorithm which actually computes a maximum agreement subtree of T and T in O(n2.5logn) time. For evolutionary trees, Steel and Warnow [24] gave the first 1 2 polynomial-time algorithm for computing a maximum agreement subtree. Farach and Thorup [7] improved the time complexity from O(n4.5logn) to O(n1.5logn). Faster algorithms for the case d = O(1) were also discovered. The algorithm of Farach, Przytycka and Thorup [6] runs in O(√dnlog3n) time, and that of Kao [14] takes O(nd2log2nlogd) time. Cole et al. [4] gave an O(nlogn)-time algorithm for the case where T and T are binary trees. Przytycka [22] attempted 1 2 to generalize the algorithm of Cole et al. so that the degree-2 restriction could be removed with the running time being O(√dnlogn). For unrestricted labeledtrees(i.e., trees wherelabelsarenotrestricted toleaves andmaynotbe distinct), little work has been reported, but they have applications in several contexts [1, 18]. For example, labeled trees are used to represent sentences in a structural text database [16, 17, 20] and querying such a database involves comparison of trees; an XML document can also be represented by a labeled tree [1]. Instead of solving special cases, this paper gives an algorithm to compute mast(T ,T ) where T and T are unrestricted labeled trees. As detailed below, our algorithm not 1 2 1 2 only is more general but also uniformly improves or matches the previously best algorithms for subtree homeomorphism and evolutionary tree comparison. Let ∆ (or simply ∆ when the context is clear) = δ(u,v) where δ(u,v) = 1 T1,T2 u T1 v T2 if nodes u and v are labeled with the same symbol, and 0Pot∈herPwis∈e. Our algorithm computes mast(T ,T ) in O(√d∆log 2n) time. Thus, if T and T are uniformly labeled trees, then ∆ n2 1 2 d 1 2 ≤ and the time complexity of our algorithm is O(√dn2log 2n), which is faster than the Gupta- d Nishimura algorithm [11] for any d. If T and T are evolutionary trees, then ∆ n and the time 1 2 ≤ complexity ofouralgorithmisO(√dnlog 2n),whichisbetterthantheO(√dnlogn)boundclaimed d by Przytycka [22]. In particular, our algorithm can attain the O(nlogn)boundfor binary trees [4]. Also for general evolutionary trees, our algorithm runs in O(n1.5) time since √dnlog 2n = O(n1.5) d for any degree d. This is faster than the O(n1.5logn) time of the Farach-Thorup algorithm [7]. The efficiency achieved by our mast algorithm is based on improved algorithms for computing maximum weight matchings of bipartite graphs that satisfy some structural properties. Let G = (X,Y,E) be a bipartite graph with positive integer weights on its edges. Denote by n, m, N, and W the number of nodes, the number of edges, the maximum edge weight, and the total edge weight of G, respectively. The best known algorithm for computing maximum weight bipartite matchings was given by Gabow and Tarjan [10], which takes O(√nmlognN) time. For some applications where the total edge weight is small (say, W = O(m)), Kao et al. [15] gave a slightly faster algorithm that runs in O(√nW) time. Intuitively, a bipartite graph is node-unbalanced if 2 there are much fewer nodes on one side than the other. It is weight-unbalanced if its total weight is dominated by the edges incident to a few nodes; we call these nodes the dominating nodes. In this paper, we show how to enhance these two matching algorithms when the input graphs are either node-unbalanced or weight-unbalanced. The node-unbalanced property has many practical applications (see, e.g., [12]) and has been exploited to improve various graph algorithms. For example, Ahuja et al. [2] adapted several bipartite network flow algorithms such that the running times depend on the number of nodes in the smaller side of the input bipartite graph instead of the total number of nodes. Tokuyama and Nakano used this property to reduce the time complexity of the minimum cost assignment problem [27] and the Hitchcock transportation problem [26]. This paper presents similar improvements for maximumweightmatching. Specifically, weshowthattherunningtimeofthematchingalgorithms of Gabow and Tarjan [10] and Kao et al. [15] can be improved to O(min √nsmlognsN,m + { n2s.5lognsN ) and O(√nsW), respectively, where ns is the number of nodes in the smaller side of } the input bipartite graph. The weight-unbalanced property is exploited in another way. Given a weight-unbalanced bi- partite graph G, let G be the subgraph of G with its dominating nodes removed. Note that G ′ ′ has a total weight much smaller than G does. Based on the O(√nW)-time matching algorithm of Kao et al. [15], finding the maximum weight matching of G is much faster than finding one of ′ G. To take advantage of this fact, we design an efficient algorithm that finds a maximum weight matching of G from that of G. This algorithm is substantially faster than applying directly the ′ O(√nW)-time matching algorithm on G. These results for unbalanced graphs provide a basis for solving a new matching problem called the hierarchical bipartite matching problem. This matching problem is at the core of our mast algorithm and is defined as follows. Let T be a rooted tree. Denote r as the root of T. Let C(u) denote the set of children of node u. Every node u of T is associated with a positive integer w(u) and a weighted bipartite graph G satisfying the following properties: u w(u) w(v). • ≥ v∈C(u) P G = (X ,Y ,E ) where X = C(u). Each edge of G has a positive integer weight, and u u u u u u • there is no isolated node. For any node v X , the total weight of all the edges incident to u ∈ v is at most w(v). Thus, the total weight of the edges in G is at most w(u). u See Figure 1 for an example. For any weighted bipartite graph G, let mwm(G) denote a maximum weight matching of G. The hierarchical matching problem is to compute mwm(G ) for all internal u nodes u of T. Let b = max min X , Y and e = E . The problem can be solved u T u u u T u by applying directly our res∈ult{s for{|nod|e-|un|b}a}lanced graPphs∈; f|or |example, it can be solved in O( √b E logbw(u)) = O(√belogw(r)) time using the enhanced Gabow-Tarjan algorithm. u T u HoPwev∈er, th|is t|ime complexity is not yet satisfactory. When comparing labeled trees, we often encounter instances of the hierarchical bipartite matching problem with e being very large; in particular, e is asymptotically much greater than w(r). We further improve the running time to O(√bw(r) + e) by making additional use of our technique for weight-unbalanced graphs and exploiting trade-offs between the size of the bipartite graphs involved and their total edge weight. The rest of the paper is organized as follows. Section 2 details our techniques of speeding up the existing algorithms for unbalanced graphs. Section 3gives an efficient algorithm for solving the hierarchical matching problem. Finally, Section 4describes our algorithm for computingmaximum agreement subtrees. 3 1 x w(u)=20 u 2 4 y w(x)=3 w(y)=13 w(z)=3 x y z 8 z 1 T Gu Figure1: T isaninstan eofthehierar hi albipartitemat hingproblem. Gu isthebipartitegraph Figure 1: T is an instanceof thehierarchical bipartitematching problem. G is thebipartite graph asso iated with the node u. u associated with the node u. 2 Maximum weight mat hing of unbalan ed graphs 2 Maximum weight matching of unbalanced graphs Throughout this se tion, let G = (X;Y;E) be a weighted bipartite graph with no isolated nodes. TLehtronu=ghjoXutj+thjiYsjs,enctsio=n,mlientfGjX=j;jY(Xjg,,Ym,E=) bjEej,aNwebigehttheedlbaripgaesrttiteedggerawpehigwhti,thanndoWisoblaetetdhentoodteasl. Let n = X + Y , n = min X , Y , m = E , N be the largest edge weight, and W be the total edge wei|ght|. | | s {| | | |} | | edge weight. Suppose everypedge of G has a positive integer weight.pGabow and Tarjan [10℄ and Kao et al. [S1u5p℄pgaovsee aenveOry(edngmeloofg(GnNha))s-taimpeosailtgiovreitihnmtegaenrdwaenigOht(. nGWab)o-twimaenodnTeatorja nom[1p0u]teanmdwKma(oG)e,t al. [15] gave an O(√nmlog(nN))-time algorithm and an O(√nW)-time one to compute mwm(G), respe tively. respectively. 2.1 Mat hings of node-unbalan ed graphs 2.1 Matchings of node-unbalanced graphs The following theorem speeds up the omputation of mwm(G) if G is node-unbalan ed. The following theorem speeds up the computation of mwm(G) if G is node-unbalanced. Theorem 1. Theorem 1. p 1. mwm(G) an be omputed in O( nsmlog(nsN)) time. 1. mwm(G) can be computed in O(√nsmlog(nsN)) time. 2. mwm(G) an be omputed in O(m+n2s:5log(nsN)) time. 2. mwm(G) can be computed in O(m+n2.5log(n N)) time. p s s 3. mwm(G) an be omputed in O( nsW) time. 3. mwm(G) can be computed in O(√nsW) time. Proof. Withoutlossofgenerality,weassumens =jXj (cid:20)jYj. Thestatementsareprovedasfollows. Proof. Withoutlossofgenerality, weassumen = X Y . Thestatements areprovedasfollows. Statement 1. For any node v in G, let (cid:11)(v)sbe t|he|n≤um| b|er of edges in ident to v. SupposeY = Statement 1. For any node v in G, let α(v) be the number of edges incident to v. SupposeY = fy1;y2;:::;ykns+rg where k (cid:21) 1, 0 (cid:20) r < ns, and (cid:11)(y1) (cid:20) (cid:11)(y2) (cid:20) (cid:1)(cid:1)(cid:1) (cid:20) (cid:11)(ykns+r). We partition y ,y ,...,y where k 1, 0 r < n , and α(y ) α(y ) α(y ). We partition {Y 1int2o Y0 =kfnys+1;r:}::;yrg;Y1≥= fyr+≤1;:::;yrs+nsg;:::;1an≤d Yk =2 f≤yr·+·(·k(cid:0)≤1)ns+1k;n:s:+:r;yr+knsg. Note that ex ept Y0, every set has ns nodes. 4 4 Y into Y = y ,...,y ,Y = y ,...,y ,..., and Y = y ,...,y . Note 0 { 1 r} 1 { r+1 r+ns} k { r+(k−1)ns+1 r+kns} that except Y , every set has n nodes. 0 s For any Y Y, denote G(Y ) as the subgraph of G induced by all the edges incident to Y . ′ ′ ′ ⊆ SupposethatM isamaximumweightmatchingofG(Y Y Y ). LetY = y (x,y) M . i 0∪ 1∪···∪ i Mi { | ∈ i} Note that a maximum weight matching of G(Y Y ) is also one of G(Y Y Y ). Mi ∪ i+1 0 ∪ 1 ∪···∪ i+1 Therefore, we can compute mwm(G) using the following algorithm: Step 1. Compute a maximum weight matching M of G(Y ). 0 0 • Step 2. For i= 1 to k, • let Y = y (x,y) M ; Mi−1 { | ∈ i−1} compute a maximum weight matching M of G(Y Y ). i Mi−1 ∪ i Step 3. Return M . k • The runningtime is analyzed below. Let α(Y ) be the total number of edges in G(Y ). For 1 ′ ′ ≤ i k, α(Y ) α(Y ), andα(Y Y ) 2α(Y ). Usingthematching algorithm byGabow and ≤ Mi−1 ≤ i Mi−1∪ i ≤ i Tarjan[10],wecancomputemwm(G(Y Y ))inO( Y Y α(Y )log(Y Y N))time. Mi−1∪ i q| Mi−1 ∪ i| i | Mi−1∪ i| Notethat|YMi−1|≤ ns and|Yi|= ns. Hence,thewholealgorithmusesO( ki=1√nsα(Yi)log(nsN)) = O(√nsmlog(nsN)) time. P Statement 2. Since we suppose X Y , any matching of G contains at most X = n edges. s | | ≤ | | | | Thus, for every u X, we can discard the edges incident to u that are not among the n heaviest s ∈ ones; the remaining n2 edges must still contain a maximum weight matching of G. Note that we s can find these n2 edges in O(m) time, and from Statement 1, we can compute mwm(G) from them s in O(√nsn2slog(nsN)) time. The total time taken is O(m+ns2.5log(nsN)). Statement 3. The algorithm in the proof of Statement 1 can be adapted to find mwm(G) in O(√nsW) time by using the O(√nW)-time matching algorithm of Kao et al. [15] to compute each M . For any Y Y, we redefine α(Y ) to be the total weight of edges incident to Y. Then we can i ′ ′ ⊆ use the same analysis to show that the adapted algorithm runs in O( ki=1√nsα(Yi)) = O(√nsW) time. P 2.2 Matchings of weight-unbalanced graphs We show how to speed up the matching algorithm of Kao et al. [15] when the input graph G is weight-unbalanced. The key technique is stated in Lemma 2 below. The following example illustrates how this lemma can help. Suppose that G has O(1) dominating nodes. Let G be the ′ subgraph of G with the dominating nodes removed. Let W be the total edge weight of G. Since ′ ′ G is weight-unbalanced, we further assume W = o(W). To compute mwm(G), we can first use ′ Theorem1(3)tocomputemwm(G′)inO(√nsW′)timeandthenuseLemma2tocomputemwm(G) from mwm(G′) in O(mlogns) time. The total running time is O(mlogns+√nsW′) =o(√nsW), which is smaller than the running time of using the algorithm of Kao et al. [15] to find mwm(G) directly. Lemma 2. Let H = x ,x ,...,x be a subset of h nodes of X. Let G H be the subgraph 1 2 h { } − of G constructed by removing the nodes in H. Denote by E the set of edges in G H. Given ′ − mwm(G H), we can compute mwm(G) in O(E +(h2 E +h3)logn ) time. ′ s − | | | | 5 Proof. First, we show that using O(E ) time, we can find a set Υ of only O(min hE +h2,n2 ) ′ s | | { | | } edges such that Υ still contains a maximum weight matching of G. In the proof of Theorem 1(2), it has already been shown that we can find in O(E ) time a set of O(n2) edges that contains s | | mwm(G). Thus, it suffices to find in O(E ) time another set of O(hE +h2) edges that contains ′ | | | | mwm(G); Υ is just the smaller of these two sets. Let Y be the subset of nodes of Y that are ′ endpoints of E . For any x H, we select, among the edges incident to x , a subset of edges E , ′ i i i ∈ which is the union of the following two sets: (x ,y) y Y ; i ′ • { | ∈ } (x ,y) (x ,y) is among the h heaviest edges with y Y . i i ′ • { | 6∈ } Observe that E E E must contain a maximum weight matching of G, and these E ′ 1 h ′ ∪ ∪···∪ | ∪ E E =O(hE +h2) edges can be found in O(E ) time. 1 h ′ ∪··· | | | | | Bydiscardingallunnecessaryedges(i.e., edgesneitherinΥnorinmwm(G H)),wecanassume − that G has only O(min hE +h2,n2 ) edges, while still containing mwm(G) and mwm(G H). ′ s { | | } − This preprocessing requires an extra O(E ) time for finding mwm(G). | | Below, we describe a procedure which, given any bipartite graph D and any node x of D, finds mwm(D) from mwm(D x ) in O(m logm ) time, where m is the number edges of D. D D D −{ } Then, starting from G H, we can apply this procedure repeatedly h times to find mwm(G) from − mwm(G H). Since G is assumed to have only O(min hE +h2,n2 ) edges, this process takes ′ s − { | | } O(h((hE +h2)logn )) time. This lemma follows. ′ s | | Let M and M be a maximum weight matching of D and D x , respectively; denote by S x −{ } the set of augmenting paths and cycles formed in M M M M , and let σ be the augmenting x x ∪ − ∩ path in S starting from x. Note that the augmenting paths and cycles in S σ cannot improve −{ } the matching M ; otherwise, M is not a maximum weight matching of D x . Thus, we can x x −{ } transform M to M using σ. Note that σ is indeed a maximum augmenting path starting from x, x which can be found in O(m logm ) time [5]. D D 3 Hierarchical bipartite matching Throughout this section, let T be a rooted tree as defined in the definition of the hierarchical bipartite matching problem in 1. The root of T is denoted by r. For each node u of T, w(u) § and G = (X ,Y ,E ) denote the weight and the bipartite graph associated with u, respectively. u u u u Furthermore, let b = max min X , Y , and e= E . u T u u u T u Inthissection,wedescr∈ibe{ana{lg|orit|h|mf|o}r}computingPmw∈m|(G|)forallu T inO(√bw(r)+e) u ∈ time. Our algorithm is based on two crucial observations. One is that for any value x, there are at most w(r)/x graphs with its second maximum edge weight greater than x. The other is that most of these graphs have their total weight dominated by edges incident to a few nodes. For those graphs with a large second maximum edge weight, we compute their maximum weight matchings usingalessweight-sensitive algorithm. Astherearenotmanyofthem, thecomputationisefficient. For the other graphs, their weights are dominated by the edges incident to a few nodes. Thus, using Lemma 2 and a weight-efficient matching algorithm, we can compute the maximum weight matchings for these graphs efficiently. Details are as follows. Consider any subsetB of nodesof T. Let δ = min w(u). We say that B has a critical degree u B ∈ h if for every u B, u has at most h children with weight at least δ. For any internal node u, ∈ let secw(u) = 2nd-max w(v) v C(u) , i.e., the value of the second largest w(v) over all the { | ∈ } 6 childrenv ofu. Lemma3belowshowstheimportanceofsecw(u)andcritical degrees. Lemma3(1) shows that there are not many nodes u with large secw(u); for those nodes u with small secw(u), they should not have a large critical degree, and Lemma 3(2) states that the maximum weight matchings associated with these nodes can be computed efficiently. Lemma 3. 1. Let x be any positive number. Let A be the set of nodes u of T with secw(u) > x. Then A <w(r)/x. | | 2. Let B be any set of nodes of T. If B has critical degree h, then we can compute mwm(G ) u for all u B in O((√b+h3logb)w(r)+ E ) time. u B u ∈ ∈ | | P Proof. The statement are proved as follows. Statement 1. Let L be the set of nodes u in T such that (1) w(u) > x and (2) either u is a leaf or w(v) x for all children v of u. Since the subtrees rooted at the nodes of L are disjoint, ≤ w(r) w(u) > xL . Thus, L < w(r)/x. Let T be the tree in T induced by L, i.e., T u L ′ ′ contai≥nsPexa∈ctly the nod|es|of L and|th|e least common ancestor of every two nodes of L. Note that T has at most L leaves and at most L internal nodes. On the other hand, every node of A is ′ | | | | an internal node of T ; thus, A L < w(r)/x. ′ | | ≤ | | Statement 2. For every node u B, let H(u) be the set of u’s children that have a weight at ∈ least δ = min w(u) each. Let L(u) be the set of the rest of u’s children. Note that H(u) h u B ∈ | | ≤ because B has critical degree h. Since the weight of G H(u) is at most w(x) and b min X , Y , by Theorem 1 we can compute mwm(uG− H(u)) in time Px∈L(u) u u u ≥ {| | | |} − O(√b w(x)+ E ). (1) x∈L(u) | u| P Since G H(u) has at most w(x) edges and H(u) h, by Lemma 2, we can compute mwm(Gu)−from mwm(G H(Pu)x)∈Lin(ut)ime | | ≤ u u − O(E +(h2 w(x)+h3)logb) = O(h3logb w(x)+ E ). (2) | u| x∈L(u) x∈L(u) | u| P P From Equations (1) and (2), we can compute mwm(G ) for all u B in time u ∈ O (√b+h3logb) w(x)+ E . (cid:16) u∈B(cid:16) x∈L(u) | u|(cid:17)(cid:17) P P Since the subtrees rooted at some node in L(u) are disjoint, w(x) w(r). This statement follows. Su∈B Pu∈BPx∈L(u) ≤ We are now ready to compute mwm(G ) for all nodes u of T. We divide all the nodes in T into u two sets: Φ = u T secw(u) > b3 and Π= u T secw(u) b3 . { ∈ | } { ∈ | ≤ } Every node u Φ has secw(u) > b3; by Lemma 3(1), Φ w(r)/b3. Furthermore, by ∈ | | ≤ Theorem 1(2), the time for computing mwm(G ) for all u Φ is O( (b2.5log(bw(u))+ E )) = O(w(r)b2.5logw(r)+ E ) = O(w(r)loguw(r)+ ∈E ). TPhisu∈tΦime complexity is st|illuf|ar b3 u∈Φ| u| √b u∈Φ| u| from our goal as logw(rP) may be much larger than √Pb. To improve the time complexity, we first note that using the technique for proving Lemma 2, we can compute mwm(G ) in time depending u only on secw(u). Then, with a better estimation of secw(u), we can reduce the time complexity to O(w(r)+ E ). Details are given in Lemma 4(1). u∈Φ| u| P 7 For Π, we can handle the nodes u Π with w(u) > b3 easily. For nodes with w(u) < b3, we ∈ apply Lemma 3(2) to compute mwm(G ). The basic idea is to partition the nodes u Π into a u ∈ 1 constant number of sets according to w(u) such that every set has critical degree b7. This can ensure that the total time to compute all the mwm(G ) is O(√bw(r)+ E ). Details are given in Lemma 4(2). u Pu∈Π| u| Lemma 4. 1. We can compute mwm(G ) for all u Φ in O(w(r)+ E ) time. u ∈ u∈Φ| u| P 2. We can compute mwm(G ) for all u Π in O(√bw(r)+ E ) time. u ∈ u∈Π| u| P Proof. The two statements are proved as follows. Statement 1. Observe that for any u Φ, G has at most b2 edges relevant to the computation u ∈ of mwm(G ), and they can be found in O(E ) time. Let E be this set of edges. Below, we u u u′ | | assume that, for every u Φ, G has only edges in E . Otherwise, it costs O( E ) extra time to find all E and th∈e assumuption holds. u′ Pu∈Φ| u| u′ For every k 1, let Φ = u Φ 2k 1b3 < secw(u) 2kb3 . Obviously, the nonempty k − ≥ { ∈ | ≤ } sets Φ form a partition of Φ. Below, we show that for any nonempty Φ , we can compute k k mwm(G ) for all u Φ in O w(r)k/2k + E logb time. Thus, the time for computing u ∈ k (cid:16) u∈Φk| u′| (cid:17) mwm(G ) for all u Φ is O( w(r)k/P2k + E logb) = O(w(r) + E logb) = O(w(r)+u (w(r)/b3)b2∈logb) = OP(wk≥(1r)), and StatePmue∈nΦt |1 fu′o|llows. Pu∈Φ| u′| We now give the details of computing mwm(G ) for all u Φ . Let u be the child of u where u k ′ ∈ w(u) is the largest over all children of u. Since secw(u) 2kb3, every edge of G u has ′ u ′ ≤ − { } weight at most 2kb3. By Theorem 1(2) and Lemma 2, and the fact b min X , Y , we can u u ≥ {| | | |} find mwm(G ) in O(√bb2log(b2kb3)+ E logb) time. By Lemma 3(1), Φ w(r). Thus, we can u | u′| | k|≤ 2kb3 compute mwm(G ) for all u Φ in time u k ∈ O √bb2log(b2kb3)+ E logb = O w(r)b2.5(k+logb)+ E logb (cid:16) u∈Φk | u′| (cid:17) (cid:16) 2kb3 u∈Φk | u′| (cid:17) P P = O w(r)k/2k + E logb . (cid:16) u∈Φk| u′| (cid:17) P Statement 2. We partition Π as follows. Let Π be the set of nodes in Π with weight greater ′ than b3. For any 0 k 20, let Πk = u u Π and bk7 < w(u) bk+71 . Obviously, Π = ≤ ≤ { | ∈ ≤ } Π Π Π . ′ 0 20 ∪ ∪··· Since secw(u) b3 and w(u) > b3 for all nodes u in Π, Π has critical degree one. By ′ ′ ≤ Lemma 3(2), we can compute mwm(G ) for all nodes in Π using O(√bw(r)+ E ) time. Each Πk is handled as follows. Fuor every node u ′ Πk, u has at mostPbu17∈Πc′h|ildur|en with k k+1 ∈ 1 weight at least b7; otherwise w(u) > b 7 and u Πk. Thus, Πk has critical degree b7. By 6∈ L=emOm(√ab3w((2r)), +we can comEpu)tetimmwe.m(IGnus)ufmormaalrly,uw∈eΠckanincoOm((p√utbe+mbw73mlo(gGb)w)(fro)r+alPl uu∈Πk|ΠEui|n) u∈Πk| u| u ∈ O(√bw(r)+ P E ) time. u∈Π| u| P Theorem 5. We can compute mwm(G ) for all nodes u T in O(√bw(r)+e) time. u ∈ Proof. It follows from Lemma 4 and the fact that T = Φ Π and E = e. u T u ∪ ∈ | | P 8 d e f c f c a a e c c f d a b a b T T L k Figure 2: The restricted subtree T L with L = a,b,c . k { } 4 Computing maximum agreement subtrees By generalizing the work of Cole et al. [4] on binary evolutionary trees, we can easily derive an algorithm to compute a maximum agreement subtree of two labeled trees. There is, however, a bottleneck ofcomputingthemaximumweightmatchings of alargenumberofbipartitegraphswith nonconstant degrees. By using our result on the hierarchical bipartite matchings, we can eliminate this bottleneck and obtain the fastest known mast algorithm. Section 4.1 introduces basics of labeled trees. Section 4.2 uses our results on the hierarchical bipartite matchings to remove the bottleneck in our mast algorithm. Section 4.3 details our mast algorithm and analyzes its time complexity. Section 4.4 discusses the generalization of the work of Cole et al. [4]. Throughout this section, T and T denote two labeled tress with n nodes and of degree d 2. 1 2 ≥ Let ∆ = δ(u,v) where δ(u,v) = 1 if nodes u and v are labeled with the same T1,T2 u T1 v T2 symbol, and 0Poth∈erwPise∈. Also, let ∆ denote ∆ . T1,T2 4.1 Basics For a rooted tree T and any node u of T, let Tu denote the subtree of T that is rooted at u. For any set L of symbols, the restricted subtree of T with respect to L, denoted by T L, is the subtree k of T (1) whose nodes are the nodes with labels from L and the least common ancestors of any two nodes with labels from L and (2) whose edges preserve the ancestor-descendant relationship of T. Figure 2 gives an example. Note that T L may contain nodes with labels outside L. For any k labeled tree T , let T T denote the restricted subtree of T with respect to the set of symbols used ′ ′ k in T . ′ A centroid path decomposition [4] of a rooted tree T is a partition of its nodes into disjoint paths as follows. For each internal node u in T, let C(u) denote the set of children of u. Among the children of u, one is chosen as the heavy child, denoted by hvy(u), if the subtree of T rooted at hvy(u) contains the largest number of nodes; the other children of u are side children. We call the edge from u to its heavy child a heavy edge. A centroid path is a maximal path formed by heavy edges; the root centroid path is the centroid path that contains the root of T. See Figure 3 for an example. Let (T) denote the set of the centroid paths of T. Note that (T) can be constructed in D D 9 Figure 3: A centroid path decomposition of a rooted tree. O(T ) time. For every P (T), the root of P, denoted r(P), refers to the node on P that is the | | ∈ D closest to the root of T, and (P) denotes the set of the side children of the nodes on P. For any A node u on P, a subtree rooted at some side child of u is called a side tree of u, as well as a side tree of P. Let side-tree(P) be the set of side trees of P. Note that for every R side-tree(P), ∈ R Tr(P) /2. | | ≤ | | The following lemma states two useful properties of the centroid path decomposition. Lemma 6. Let T and T be two labeled trees. 1 2 1. PP∈D(T1)∆T1r(P),T2 ≤∆T1,T2logn. 2. min(d, Tr(P) )∆ √d∆ log 2n. PP∈D(T1)q | 1 | T1r(P),T2 ≤ T1,T2 d Proof. The two statements are proved as follows. Statement 1. A centroid path P is attached to another centroid path P if the root of P is the ′ child of a node on P . We define the level of a centroid path as follows. The root centroid path has ′ level zero. A centroid path has level i if it is attached to some centroid path with level i 1. Note − that any subtree attached to a centroid path with level i has size at most n/2i+1. Thus, there are at most logn different levels. Moreover, subtrees attached to centroid paths with the same level are all disjoint. For any 0 i < logn, denote by D the set of all centroid paths in (T ) with level i. Then i 1 ≤ D ∆ = ∆ logn ∆ ∆ logn. PPS∈Dta(tTe1m) enTt1r(P2.),TW2 e dPiv0id≤ei<tlhogencPenPtr∈oDidi pTa1rt(hPs),Tin2t≤o 2 grouPpsP.∈WDie fiTr1rs(tP)c,To2ns≤iderTt1h,Te2 centroid paths on level i where 0 i< log 2n. For any such i, ≤ d min d, Tr(P) ∆ √d∆ √d∆ . PXDiq { | 1 |} T1r(P),T2 ≤ PXDi T1r(P),T2 ≤ T1,T2 ∈ ∈ Thus, min d, Tr(P) ∆ √d∆ log 2n. NexPt,0≤wie<lcoogn2dsnidPerP∈thDeiqcentro{id p|a1ths|o}n lTe1rv(ePl),lTo2g≤2n +i wT1h,Te2re i d0. Note that for a path P on d ≥ level log 2n +i, Tr(P) d/2i+1. Thus, min d, Tr(P) ∆ is at most d | 1 |≤ Pi≥0PP∈Dlog2dn+iq { | 1 |} T1r(P),T2 d/2i+1∆ d/2i+1∆ = O(√d∆ ). Xi≥0P∈DXlog2dn+iq T1r(P),T2 ≤ Xi≥0q T1,T2 T1,T2 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.