CAVITY MATCHINGS, LABEL COMPRESSIONS, AND UNROOTED EVOLUTIONARY TREES∗ MING-YANG KAO†, TAK-WAH LAM‡, WING-KIN SUNG‡, AND HING-FUNG TING‡ Abstract. We present an algorithm for computing a maximum agreement subtree of two un- rootedevolutionarytrees. IttakesO(n1.5logn)timefortreeswithunboundeddegrees,matchingthe best known timecomplexity forthe rooted case. Our algorithm allows the input trees to be mixed trees, i.e., trees that may contain directed and undirected edges at the same time. Our algorithm 1 adopts a recursive strategy exploiting a technique called label compression. The backbone of this 0 techniqueisanalgorithmthatcomputes themaximumweightmatchings overmanysubgraphsofa 0 bipartitegraphasfastasittakestocomputeasinglematching. 2 n 1. Introduction. An evolutionary tree is one whose leaves are labeled with a distinct symbols representing species. Evolutionary trees are useful for modeling the J evolutionary relationship of species [1,4,6,16,17,25]. An agreement subtree of two 7 2 evolutionarytrees is an evolutionarytree that is also a topologicalsubtree ofthe two given trees. A maximum agreement subtree is one with the largest possible number ] of leaves. Different models about the evolutionary relationship of the same species E may result in different evolutionary trees. A fundamental problem in computational C biology is to determine how much two models of evolution have in common. To . s a certain extent, this problem can be solved by computing a maximum agreement c subtree of two given evolutionary trees [12]. [ Algorithms for computing a maximum agreement subtree of two unrooted evo- 2 lutionary trees as well as two rooted trees have been studied intensively in the past v few years. The unrooted case is more difficult than the rooted case. There is indeed 1 a linear-time reduction from the rooted case to the unrooted one, but the reverse is 3 0 not known. Steel and Warnow [24] gave the first polynomial-time algorithm for un- 1 rooted trees, which runs in O(n4.5logn) time. Farach and Thorup reduced the time 0 to O(n2+o(1)) for unrooted trees [10] and O(n1.5logn) for rooted trees [11]. For the 1 unrooted case, the time was improved by Lam, Sung and Ting [22] to O(n1.75+o(1)). 0 Algorithms that work well for rooted trees with degrees bounded by a constant have / s alsobeenrevealedrecently. ThealgorithmofFarach,PrzytyckaandThorup[9]takes c : O(nlog3n)time, andthatofKao[20]takesO(nlog2n)time. ColeandHariharan[7] v gave an O(nlogn)-time algorithm for the case where the input is further restricted i X to binary rooted trees. r This paper presents an algorithm for computing a maximum agreement subtree a of two unrooted trees. It takes O(n1.5logn) time for trees with unbounded degrees, matching the best known time complexity for the rooted case [11]. If the degrees are bounded by a constant, the running time is only O(nlog4n). We omit the details of this reduction since Przytycka [23] recently devised an O(nlogn)-time algorithm for the same case. ∗A preliminary version appeared as part of General techniques for comparing unrooted evolu- tionary trees,inProceedings of the29th Annual ACMSymposium onTheoryof Computing, 1997, pp. 54–65, and part of All-cavity Maximum Matchings, in Proceedings of the 8th Annual Interna- tionalSymposiumonAlgorithmsandComputation, 1997,pp.364-373. †Department of Computer Science, Yale University, New Haven, CT 06520, U.S.A., kao-ming- [email protected]. ResearchsupportedinpartbyNSFGrantCCR-9531028. ‡DepartmentofComputerScienceandInformationSystems,TheUniversityofHongKong,Hong Kong,{twlam,wksung,hfting}@csis.hku.hk. ResearchsupportedinpartbyHongKongRGCGrant HKU-7027/98E. 1 Our algorithm allows the input trees to be mixed trees, i.e., trees that may con- tain directed and undirected edges at the same time [15,18]. Such trees canhandle a broaderrangeofinformationthanrootedandunrootedtrees. To simplify the discus- sion, this paper focuses on unrooted trees. Our subtree algorithm adopts a concep- tually simple recursive strategyexploiting a noveltechnique calledlabel compression. Thistechniqueenablesouralgorithmtoprocessoverlappingsubtreesiterativelywhile keeping the total tree size very close to the original input size. Label compression buildsonanunexpectedly fastalgorithmfortheall-cavity maximum weight matching problem [21], which asks for the weight of a maximum weight matching in G u −{ } for each node u of a bipartite graph G with integer edge weights. If G has n nodes, m edges and maximum edge weight N, the algorithm takes O(√nmlog(nN)) time, which matches the best known time bound for computing a single maximum weight matching of G due to Gabow and Tarjan [13]. In 2, we solve the all-cavity matching problem. In 3, we formally define maxi- § § mum agreementsubtrees and outline our recursivestrategy for computing them. We describe label compression in 4, detail our subtree algorithm in 5, and discuss how § § to compute auxiliaryinformationfor labelcompressionin 6 and 7. We conclude by § § extending the subtree algorithm to mixed trees in 8. § 2. All-cavity maximum weight matching. Let G=(X,Y,E) be a bipartite graphwith n nodes and m edges where eachedge (u,v) has a positive integer weight w(u,v) N. Let mwm(G) denote the weight of a maximum weight matching in G. ≤ The all-cavity matching problem asks for mwm(G u ) for all u X Y. A naive −{ } ∈ ∪ approach to solve this problem is to compute mwm(G u ) separately for each u −{ } using the fastest algorithm for computing a single maximum weight matching [13], thus taking O(n1.5mlog(nN)) total time. A main finding of this paper is that the matchings in different subgraphs G u are closely related and can be represented −{ } succinctly. From this representation, we can solve the problem in O(√nmlog(nN)) time. By symmetry, we only detail how to compute mwm(G u ) for all u X. −{ } ∈ Below we assume m n/2; otherwise, we remove the degree-zero nodes and work on ≥ the smaller resultant graph. A node v of G is matched by a matching of G if v is an endpoint of an edge in the matching. In the remainder of this section, let M be a fixed maximum weight matching of G; also let w(H) be the total weight of a set H of edges. An alternating path is a simple path P in G such that (1) P starts with an edge in M, (2) the edges of P alternate between M and E M, and (3) if P ends at an edge (u,v) M, then − 6∈ v is not matched by M. An alternating cycle is a simple cycle C in G whose edges alternate between M and E M. P (respectively, C) can transform M to another − matching M′ = P M P M (respectively, C M C M). The net change ∪ − ∩ ∪ − ∩ induced by P, denoted by ∆(P), is w(M′) w(M), i.e., the total weightof the edges − of P in E M minus that of the edges of P in M. The net change induced by C is − defined similarly. The next lemma divides the computation of mwm(G u ) into two cases. −{ } Lemma 2.1. Let u X. ∈ 1. If u is not matched by M, then M is also a maximum weight matching in G u and mwm(G u )=mwm(G). −{ } −{ } 2. If u is matched by M, then G contains an alternating path P starting from u, which can transform M to a maximum weight matching in G u . −{ } Proof. Statement 1 is straightforward. To prove Statement 2, let M′ be a maxi- mum weight matching in G u . Consider the edges in M M′ M M′. They −{ } ∪ − ∩ 2 (a) (b) u1 5 v1 u1 −5 v1 2 2 0 u2 3 v2 u2 −3 v2 0 2 2 u3 3 v3 u3 −3 v3 0 t 3 3 u4 1 v4 u4 1 v4 0 2 2 u5 v5 u5 v5 0 7 −7 X Y Fig.2.1. (a) a bipartite graph G; (b) the corresponding directed graph D. form a set S of alternating paths and cycles. Since u is matched by M but not by M′, u is of degreeone in M M′ M M′. LetP be the alternating pathin S with ∪ − ∩ uasanendpoint. LetM′′ be the matchingobtainedbytransformingM onlywithP. Since u is notmatchedby M′′, M′′ is a matching in G u . M′ canbe obtainedby −{ } further transforming M′′ with the remaining alternating paths and cycles in S. The netchangeinducedby eachofthese alternatingpaths andcycles isnon-positive;oth- erwise,suchapathorcyclecanimproveM andweobtainacontradiction. Therefore, w(M′′) w(M′), i.e., both M′ and M′′ are maximum weight matchings in G u . ≥ −{ } By Lemma 2.1(2),we cancompute mwm(G u )for any u X matchedby M −{ } ∈ byfindingthe alternatingpathstartingfromuwiththelargestnetchange. Belowwe construct a directed graph D, which enables us to identify such an alternating path for every node easily. The node set of D is X Y t , where t is a new node. The ∪ ∪{ } edge set of D is defined as follows; see Figure 2.1 for an example. If x X is not matched by M, D has an edge from x to t with weight zero. • ∈ If y Y is matched by M, D has an edge from y to t with weight zero. • ∈ If M has an edge (x,y) where x X and y Y, D has an edge from x to y • ∈ ∈ with weight w(x,y). − If E M has an edge (x,y) where x X and y Y, D has an edge from y • − ∈ ∈ to x with weight w(x,y). Note that D has n+1 nodes and at most n+m edges. The weight of each edge in D is an integer in [ N,N]. − Lemma 2.2. 1. D contains no positive-weight cycle. 2. Each alternating path P in G that starts from u X corresponds to a simple ∈ path Q in D from u to t, and vice versa. Also, ∆(P)=w(Q). 3. For each u X matched by M, mwm(G u ) is the sum of mwm(G) and ∈ −{ } the weight of the longest path in D from u to t. Proof. Statement 1. Consider a simple cycle C =u ,u , ,u ,u in D. Since t has no 1 2 k 1 ··· 3 outgoing edges, no u equals t. By the definition of D, C is also an alternating cycle i in G. Therefore, w(C) is the net change induced by transforming M with C. Since M is a maximum weight matching in G, this net change is non-positive. Statement 2. Consider an alternating path P = u,u ,u , ,u in G starting 1 2 k ··· from u. In D, P is also a simple path. If u X, then u is not matched by M, and k k ∈ Dcontainstheedge(u ,t). Ifu Y,thenu ismatchedbyM,andDagaincontains k k k ∈ the edge (u ,t). Therefore, D contains the simple path Q = u,u ,u , ,u ,t. The k 1 2 k ··· weight of Q is ∆(P). The reverse direction of the statement is straightforward. Statement 3. This statement follows from Lemma 2.1(2) and Statement 2. Theorem 2.3. Given G, we can compute mwm(G u ) for all nodes u G in −{ } ∈ O(√nmlog(nN)) time. Proof. By symmetry and Lemmas 2.1(1) and 2.2(3),we compute mwm(G u ) −{ } for all u X as follows. ∈ 1. Compute a maximum weight matching M of G. 2. Construct D as above and find the weights of its longest paths to t. 3. For each u X, if u is matched by M, then mwm(G u ) is the sum of ∈ −{ } mwm(G) and the weight of the longest path from u to t in D; otherwise, mwm(G u )=mwm(G). −{ } Step 1 takesO(√nmlog(nN)) time. At Step2, constructingD takesO(n+m) time, andthesingle-destinationlongestpathsproblemtakesO(√nmlogN)time [14]. Step 3 takes O(n) time. Thus, the total time is O(√nmlog(nN)). 3. The main result. This section gives a formal definition of maximum agree- ment subtrees and an overview of our new subtree algorithm. 3.1. Basics. Throughout the remainder of this paper, unrooted trees are de- noted by U or X, and rooted trees by T, W or R. A node of degree 0 or 1 is a leaf; otherwise, it is internal. Adopted to avoid technical trivialities, this definition is somewhat nonstandard in that if the root of a rooted tree is of degree 1, it is also a leaf. For an unrooted tree U and a node u U, let Uu denote the rooted tree con- ∈ structed by rooting U at u. For a rooted tree T and a node v T, let Tv denote ∈ the rooted subtree of T that comprises v and its descendants. Similarly, for a node v Uu, Uuv is the rooted subtree of Uu rooted at v, which is also called a rooted ∈ subtree of U. An evolutionary tree is a tree whose leaves are labeled with distinct symbols. Let T be a rooted evolutionary tree with leaves labeled over a set L. A label subset L′ L induces a subtree of T, denoted by T L′, whose nodes are the leaves of ⊆ | T labeled over L′ as well as the least common ancestors of such leaves in T, and whose edges preservethe ancestor-descendantrelationshipof T. Consider two rooted evolutionary trees T and T labeled over L. Let T′ be a subtree of T induced by 1 2 1 1 some subset of L. We similarly define T′ for T . If there exists an isomorphism 2 2 between T′ andT′ mapping eachleaf in T′ to one in T′ with the same label, then T′ 1 2 1 2 1 and T′ are each called agreement subtrees of T and T . Note that this isomorphism 2 1 2 is unique. Consider any nodes u T and v T . We say that u is mapped to v in 1 2 ∈ ∈ T′ andT′ ifthis isomorphismmapsu to v. Amaximumagreementsubtree ofT and 1 2 1 T is one containing the largest possible number of labels. Let mast(T ,T ) denote 2 1 2 the number of labels in such a tree. A maximum agreement subtree of two unrooted evolutionary trees U and U is one with the largest number of labels among the 1 2 4 maximum agreement subtrees of Uu and Uv over all nodes u U and v U . Let 1 2 ∈ 1 ∈ 2 (3.1) mast(U ,U )=max mast(Uu,Uv) u U ,v U . 1 2 { 1 2 | ∈ 1 ∈ 2} Remark. The nodes u (or v) can be restricted to internal nodes when the trees have at least three nodes. We can also generalize the above definition to handle a pair of rooted tree and unrooted tree (T,U). That is, mast(T,U) is defined to be max mast(T,Uv) v U . { | ∈ } 3.2. Our subtree algorithm. The next theorem is our main result. The size U (or T ) of an unrooted tree U (or a rooted tree T) is its node count. | | | | Theorem 3.1. Let U and U be two unrooted evolutionary trees. We can com- 1 2 pute mast(U ,U ) in O(N1.5logN) time, where N =max U , U . 1 2 1 2 {| | | |} We prove Theorem 3.1 by presenting our algorithm in a top-down manner with an outline here. As in previous work,our algorithmonly computes mast(U ,U ) and 1 2 can be augmented to report a corresponding subtree. It uses graph separators. A separator of a tree is an internal node whose removal divides the tree into connected componentseachcontainingatmosthalfofthetree’snodes. Everytreethatcontains at least three nodes has a separator,which can be found in linear time. If U or U has at mosttwo nodes, mast(U ,U ) as defined in Equation(3.1)can 1 2 1 2 easily be computed in O(N) time. Otherwise, both trees have at least three nodes each, and we can find a separator x of U . We then consider three cases. 1 Case 1: In some maximum agreement subtree of U and U , the node x is 1 2 mapped to a node y U . In this case, mast(U ,U ) = mast(Ux,U ). To com- putemast(Ux,U ),we∈migh2tsimplyevaluatemast(U1 x,2Uy)fordiffer1enty2 inU . This 1 2 1 2 2 approach involves solving the mast problem for Θ(N) different pairs of rooted trees andintroducesmuchredundantcomputation. Forexample,considerarootedsubtree R of U . For all y U R, R is a common subtree of Uy. Hence, R is examined repeate2dlyintheco∈mpu2ta−tionofmast(Ux,Uy)forthesey. T2ospeedupthecomputa- 1 2 tion,wedevisethe techniqueoflabelcompressionin 4toelicitsufficientinformation between Ux and R so that we can compute mast(U§x,Uy) for all y U R with- 1 1 2 ∈ 2 − out examining R. This leads to an efficient algorithm for handling Case 1, the time complexity is stated in the following lemma. Lemma 3.2. Assume that U and U have at least three nodes each. Given an 1 2 internal node x U , we can compute mast(Ux,U ) in O(N1.5logN) time. ∈ 1 1 2 Proof. See 4 to 7. § § Case 2: In some maximum agreement subtree of U and U , two certain nodes 1 2 v and v of U are mapped to nodes in U , and x is on the path in U between v 1 2 1 2 1 1 and v . This case is similar to Case 1. Let U˜ be the tree constructed by adding a 2 2 dummy node in the middle of every edge in U . Then, mast(U ,U )=mast(Ux,U˜y) 2 1 2 1 2 for some dummy node y in U˜ . Thus, mast(U ,U ) = mast(Ux,U˜ ). As in Case 1, 2 1 2 1 2 mast(Ux,U˜ ) can be computed in O(N1.5logN) time. 1 2 Case 3: None of the above two cases. Let U ,U ,...,U be the evolutionary 1,1 1,2 1,b trees formed by the connected components of U x . Let J ,...,J be the sets of 1 1 b −{ } labels in these components, respectively. Then, a maximum agreement subtree of U 1 and U is labeled over some J . Therefore, mast(U ,U ) = max mast(U ,U J ) 2 i 1 2 1,i 2 i { | | i [1,b] , and we compute each mast(U ,U J ) recursively. 1,i 2 i ∈ } | Figure3.1summarizesthestepsforcomputingmast(U ,U ). Hereweanalyzethe 1 2 time complexity T(N) based on Lemma 3.2. Cases 1 and 2 each take O(N1.5logN) 5 /* U and U are unrooted trees. */ 1 2 mast(U ,U ) 1 2 find a separator x of U ; 1 construct U˜ by adding a dummy node w at the middle of each edge (u,v) in 2 U ; 2 val=mast(Ux,U ); 1 2 val′ =mast(Ux,U˜ ); 1 2 let U ,U ,...,U be the connected components of U x ; 1,1 1,2 1,b 1 −{ } for all i [1,b], let J be the set of labels of U ; i 1,i ∈ for all i [1,b], set val =mast(U ,U J ) ; i 1,i 2 i return m∈ax val,val′,max val ; | } 1≤i≤b i { } Fig.3.1. Algorithm for computing mast(U1,U2). time. Let N = U . Then Case 3 takes T(N ) time. By recursion, i | 1,i| i∈[1,b] i P T(N)=O(N1.5logN)+ T(N ). i i∈X[1,b] Since x is a separator of U , N N. Then, since N N, T(N) = 1 i ≤ 2 i∈[1,b] i ≤ O(N1.5logN) [5,19] and the time bound in Theorem 3.P1 follows. To complete the proof of Theorem 3.1, we devote 4 through 7 to proving Lemma 3.2. § § 4. Label compressions. Tocompute amaximumagreementsubtree,ouralgo- rithm recursively processes overlapping subtrees of the input trees. The technique of label compressioncompressesoverlappingparts of such subtrees to reduce their total size. We define label compressions with respect to a rooted subtree in 4.1 and with § respectto twolabel-disjointrootedsubtreesin 4.2. We donotuselabelcompression § with respect to three or more trees. Asawarm-up,letusdefineaconceptcalledsubtreeshrinking,whichisaprimitive form of label compression. Let T be a rooted tree. Let R be a rooted subtree of T. Let T R denote the rooted tree obtained by replacing R with a leaf γ. We say that ⊖ γ is ashrunkleaf. Theother leavesareatomic leaves. Similarly,fortwolabel-disjoint rooted subtrees R and R of T, let T (R ,R ) denote the rooted tree obtained by 1 2 1 2 ⊖ replacing R and R with shrunk leaves γ and γ , respectively. We extend these 1 2 1 2 notions to an unrooted tree U and define U R and U (R ,R ) similarly. 1 2 ⊖ ⊖ 4.1. Label compression with respect to one rooted subtree. Let T be a rooted tree. Let v be a node in T and u an ancestor of v. Let P be the path of T from u to v. A node lies between u and v if it is in P but differs from u and v. A subtree of T is attached to u if it is some Tw where w is a child of u. A subtree of T hangs between u and v if it is attached to some node lying between u and v, but its root is not in P and is not v. We are now ready to define the concept of label compression. Let T and R be rooted evolutionary trees labeled over L and K, respectively. The compression of T with respect to R, denoted by T R, is a tree constructed by affixing extra nodes to ⊗ T (L K) with the following steps; see Figure 4.1 for an example. Consider each | − node y in T (L K), let x be its parent in T (L K). | − | − Let (T,K,y) denote the set of subtrees of T that are attached to y and • A whose leaves are all labeled over K. If (T,K,y) is non-empty, compress all A 6 1 9 7 9 8 5 6 8 z1 9 7 9 4 p1 R 6 8 8 γ 7 1 2 3 4 5 2 3 7 z2 T T′ T R T′ R ⊗ ⊖ Fig. 4.1. An example of label compression. the trees in (T,K,y) into a single node z and attach it to y. 1 A Let (T,K,y)denotethe setofsubtreesofT thathangbetweenxandy (by • H definition of T (L K), these subtrees are all labeled over K). If (T,K,y) | − H is non-empty, compress the parents p ,...,p of the roots of the trees in 1 m (T,K,y)intoasinglenodep ,andinsertitbetweenxandy;alsocompress 1 H all the trees in (T,K,y) into a single node z and attach it to p . 2 1 H The nodes z ,z andp arecalledcompressed nodes,andthe leavesinT R thatare 1 2 1 ⊗ not compressed are atomic leaves. We further store in T R some auxiliary information about the relationship be- ⊗ tween T and R. For an internal node v in T R, let α(v) = mast(Tv,R). For a ⊗ compressed leaf v in T R, if it is compressed from a set of subtrees Tv1,...,Tvs, let ⊗ α(v)=max mast(Tv1,R), ...,mast(Tvs,R) . { } Let T and T be two rooted evolutionary trees. Assume T contains a rooted 1 2 2 subtree R. Given T R, we can compute mast(T ,T ) without examining R. We 1 1 2 ⊗ first construct T R by replacing R of T with a shrunk leaf and then compute 1 2 ⊖ mast(T ,T )fromT RandT R. Tofurtherourdiscussion,wenextgeneralizethe 1 2 1 2 ⊗ ⊖ definition of maximum agreementsubtree for a pair of trees that contain compressed leaves and a shrunk leaf, respectively. Let W =T R and W =T R. Let γ be the shrunk leaf in W . We define an 1 1 2 2 2 ⊗ ⊖ agreement subtree of W and W similar to that of ordinary evolutionary trees. An 1 2 atomicleafmuststillbe mappedto anatomicleafwiththe samelabel. However,the shrunk leaf γ of W can be mapped to any internal node or compressed leaf v of W 2 1 as long as α(v) > 0. The size of an agreement subtree is the number of its atomic leaves,plus α(v) if γ is mapped to a node v W . A maximumagreementsubtree of 1 ∈ W and W is one with the largest size. Let mast(W ,W ) denote the size of such a 1 2 1 2 subtree. The following lemma is the cornerstone of label compression. Lemma 4.1. mast(T ,T )=mast(W ,W ). 1 2 1 2 Proof. It follows directly from the definition. We can compute mast(W ,W ) as if W and W were ordinary rooted evolu- 1 2 1 2 tionary trees [9,11,20] with a special procedure on handling the shrunk leaf. The time complexity is stated in the following lemma. Let n = max W , W and 1 2 {| | | |} N =max T , T . 1 2 {| | | |} Lemma 4.2. Suppose that all the auxiliary information of W has been given. 1 Then mast(W ,W ) can be computed in O(n1.5logN) time and afterwards we can 1 2 retrieve mast(Wv,W ) for any node v W in O(1) time. 1 2 ∈ 1 Proof. We adapt Farach and Thorup’s rooted subtree algorithm [11] to compute mast(W ,W ). Details are given in A. 1 2 § 7 We demonstrate a scenario where label compression speeds up the computation of mast(Ux,U ) for Lemma 3.2. Suppose that we can identify a rooted subtree R of 1 2 U such that x is mapped to a node outside R, i.e., we can reduce Equation (3.1) to 2 (4.1) mast(Ux,U )=max mast(Ux,Uy) y is an internal node not in R . 1 2 { 1 2 | } Note that every Uy contains R as a common subtree. To avoid overlapping compu- 2 tation on R, we construct W =Ux R and X =U R. Then Xy =Uy R and from Lemma 4.1, mast(Ux,Uy)= mast1(⊗W,Xy). We rew2⊖rite Equation (4.1)2a⊖s 1 2 (4.2) mast(Ux,U )=max mast(W,Xy) y is an internal node of X . 1 2 { | } If R is large, then W and X are much smaller than Ux and U . Consequently, it is 1 2 beneficial to compress Ux and compute mast(Ux,U ) according to Equation (4.2). 1 1 2 4.2. Label compression with respect to two rooted subtrees. Let T, R , R be rooted evolutionary trees labeled over L, K , K , respectively, where 1 2 1 2 K K = φ. Let K = K K . The compression of T with respect to R and R , 1 2 1 2 1 2 ∩ ∪ denoted by T (R ,R ), is a tree constructed from T (L K) by the following two 1 2 ⊗ | − steps. For each node y and its parent x in T (L K), | − 1. if (T,K,y) is non-empty, compress all the trees in (T,K,y) into a single A A leaf z and attach it to y; create and attach an auxiliary node z¯to y; 2. if (T,K,y) is non-empty, compress the parents p , ..., p of the roots of 1 m H the subtrees in (T,K,y) into a single node p and insert it between x and 1 H y; compress the subtrees in (T,K,y) into a single node z and attach it to H p ; create and insert an auxiliary node p¯ between p and y; create auxiliary 1 1 1 nodes z¯and ¯z¯and attach them to p and p¯ , respectively. 1 1 The nodes p and z are compressed nodes of T (R ,R ). The nodes p¯ ,z¯, and ¯z¯ 1 1 2 1 ⊗ are auxiliary nodes. These nodes are added to capture the topology of T that is isomorphic with the subtrees R and R of T′. 1 2 We also store auxiliary information in T (R ,R ). Let R+ be the tree obtained 1 2 ⊗ by connecting R and R together with a node, which becomes the root of R+. 1 2 Consider the internal nodes of T (R ,R ). If v is an internal node inherited 1 2 ⊗ from T (L K), then let α (v) = mast(Tv,R ) and α (v) = mast(Tv,R ). If p | − 1 1 2 2 1 and p¯ are internal nodes compressed from some path p ,...,p of T, then only p 1 1 m 1 stores the values α1(p1) = mast(Tp1,R1), α2(p1) = mast(Tp1,R2), and α+(p1) = mast(Tp1,R+). We do not store any auxiliary information at the atomic leaves in T (R ,R ). 1 2 ⊗ Consider the other leaves in T (R ,R ) based on how they are created. 1 2 ⊗ Case 1: Nodes z,z¯are leaves created with respect to (T,K,y) for some node y A in T (L K). Let (T,K,y)= Tv1,...,Tvk . We store the following values at z. | − A { } • α1(z) = max{mast(Tvi,R1) | i ∈ [1,k]}, α2(z) = max{mast(Tvi,R2) | i ∈ [1,k] , α (z)=max mast(Tvi,R+) i [1,k] ; } + { | ∈ } β(z) = max mast(Tvi,R1)+mast(Tvi′,R2) Tvi and Tvi′ are distinct sub- • { | trees in (T,K,y) . A } Case 2: Nodes z,z¯, and z¯¯ are leaves created with respect to the subtrees in (T,K,y) = Tv1,...,Tvk for some node y in T (L K). We store the following H { } | − values at z: α (z), α (z), and α (z) as in Case 1; • 1 2 + β(z)=max mast(Tvi,R1)+mast(Tvj,R2) Tvi andTvj aredistinctsubtrees • { | in (T,K,y) that are attached to the same node in T ; H } 8 • βwβ12h≻≻e21r((ezz))Z===mmaaxx(j{{,mmj′aa)sstt((TTTvvvjjj,,,RRT12v))j′++mmaass(ttT((TT,Kvvjj′′,,,yRR)21a))n||d((jjt,,hjje′′))p∈∈arZZen}}t,aonfdvj in T is a { | ∈ H proper ancestor of the parent of vj′ . } Let T and T be rooted evolutionary trees. Let R and R be label-disjoint 1 2 1 2 rootedsubtrees of T . Let W =T (R ,R ) and W =T′ (R ,R ). Below, we give 2 1 1 2 2 1 2 ⊗ ⊖ the definition of a maximum agreement subtree of W and W . 1 2 Letγ andγ bethetwoshrunkleavesinW representingR andR ,respectively. 1 2 2 1 2 Let y be the least common ancestor of γ and γ in W . Intuitively, in a pair of c 1 2 2 agreement subtrees (W′,W′) of W and W , atomic leaves are mapped to atomic 1 2 1 2 leaves,andshrunk leavesare mappedto internalnodes orleaves. Moreover,we allow W′ to contain y as a leaf, which can be mapped to an internal node or leaf of W′. 2 c 1 Moreformally,werequirethatthereisanisomorphismbetweenW′ andW′ satisfying 1 2 the following conditions: 1. Every atomic leaf is mapped to an atomic leaf with the same label. 2. If W′ contains y asa leafandthus neither γ nor γ is found in W′, then y 2 c 1 2 2 c is mapped to a node v with α (v)>0. + 3. If only one of γ and γ exists in W′, say γ , then it is mapped to a node v 1 2 2 1 with α (v)>0. 1 4. If both γ and γ exist in W′, then any of the following cases is permitted: 1 2 2 γ andγ arerespectivelymappedtoacompressedleafz andits sibling 1 2 • z¯in W′ with β(z)>0. 1 γ and γ are respectively mapped to a compressed leaf z and the ac- 1 2 • companyingauxiliary leaf ¯z¯in W′ with β (z)>0, or the leaves ¯z¯and 1 1≻2 z in W′ with β (z)>0. 1 2≻1 γ andγ arerespectivelymappedto twoleavesorinternalnodes v and 1 2 • w with α (v), α (w)>0. 1 2 The way we measure the size of W′ and W′ depends on their isomorphism. For 1 2 example, if y is mapped to some node v in W′, then the size is the total number c 1 of atomic leaves in W′ plus α (v). More precisely, the size of W′ and W′ is defined 1 + 1 2 to be the total number of atomic leaves in W′ plus the corresponding α or β values 1 depending on the isomerphism between W′ and W′. A maximum agreement subtree 1 2 ofW andW is one with the largestpossible size. Let mast(W ,W ) denote the size 1 2 1 2 of such a subtree. The following lemma, like Lemma 4.1, is also the cornerstone of label compression. Lemma 4.3. mast(T ,T )=mast(W ,W ). 1 2 1 2 Proof. It follows directly from the definition of mast(W ,W ). 1 2 Again, mast(W ,W ) can be computed by adapting Farach and Thorup’s rooted 1 2 subtree algorithm [11]. The time complexity is stated in the following lemma. Let n = max W ,W and N =max T , T . 1 2 1 2 {| || |} {| | | |} Lemma 4.4. Suppose that all the auxiliary information of W has been given. 1 Thenwecancomputemast(W ,W )inO(n1.5logN)time. Afterwardswecanretrieve 1 2 mast(Wv,W ) for any v W in O(1) time. 1 2 ∈ Proof. See A. § 5. Computing mast(U1x,U2) — Proof of Lemma 3.2. At a high level, we first apply label compression to the input instance (Ux,U ). We then reduce the 1 2 problem to a number of smaller subproblems (W,X), each of which is similar to (Ux,U ) and is solved recursively. For each (W,X) generated, X is a subtree of U 1 2 2 with at most two shrunk leaves, and W is a label compression of Ux with respect to 1 9 /* W is a rooted tree with compressed leaves. X is unrooted with shrunkleaves. */ mast(W,X) let y be a separator of X; val=mast(W,Xy); if (X has at most one shrunkleaf) or (y lies between the two shrunkleaves) then new subproblem(W,X,y); for each (Wi,Xi), vali=mast(Wi,Xi); else let y′ be the node on the path between the two shrunk leaves that is the closest to y; val=mast(W,Xy′); new subproblem(W,X,y′); for each (Wi,Xi), set vali=mast(Wi,Xi); return max{val,maxbi=1vali}; /* Generate new subproblems {(W1,X1),...,(Wb,Xb)}. */ new subproblem(W,X,y) let v1,...,vb bethe neighbors of y in X; for all i∈[1,b] letXi betheunrootedtreeformedbyshrinkingthesubtreeXviy intoashrunkleaf; let Wi be therooted tree formed by compressing W with respect to Xviy; compute and store theauxiliary information in Wi for all i∈[1,b]; Fig.5.1. Algorithm for computing mast(W,X). some rooted subtrees of U that are represented by the shrunk leaves of X. Also, W 2 and X contain the same number of atomic leaves. 5.1. Recursivecomputationofmast(W,X). Oursubtreealgorithminitially sets W = Ux and X = U . In general, W = Ux R and X = U R, or W = 1 2 1⊗ 2⊖ Ux (R,R′) and X = U (R,R′) for some rooted subtrees R and R′ of U . If W or 1⊗ 2⊖ 2 X has at most two nodes, then mast(W,X) can easily be computed in linear time. Otherwise, both W and X each have at least three nodes. Let N = max U , U 1 2 {| | | |} and n = max W , X . Our algorithm first finds a separator y of X and computes {| | | |} mast(W,X) for the following two cases. The output is the larger of the two cases. Figure 5.1 outlines our algorithm. Case 1: mast(W,X)=mast(W,Xy). We rootX aty andevaluatemast(W,Xy). By Lemma 4.4, this takes O(n1.5logN) time. Case 2: mast(W,X) = mast(W,Xz) for some internal node z = y. We compute 6 max mast(W,Xz) z is an internal node and z =y by solving a set of subproblems { | 6 } mast(W ,X ),...,mast(W ,X ) wheretheirtotalsizeisnandmax mast(W,Xz) 1 1 b b { } { | z is an internal node and z = y = max mast(W ,X ) i [1,b] . Moreover, our i i 6 } { | ∈ } algorithm enforces the following properties. If X contains at most one shrunk leaf, every subproblem generated has size • at most half that of X. If X has two shrunk leaves, at most one subproblem (W ,X ) has size • io io greater than half that of X, but X contains only one shrunk leaf. Thus, in io the next recursionlevel, everysubproblem spawnedby (W ,X ) has size at io io most half that of X. 10