Table Of ContentSandetal.BMCBioinformatics2013,14(Suppl2):S18
http://www.biomedcentral.com/1471-2105/14/S2/S18
PROCEEDINGS Open Access
O n 2 n
A practical ( log ) time algorithm for
computing the triplet distance on binary trees
Andreas Sand1,2, Gerth Stølting Brodal2,3, Rolf Fagerberg4, Christian NS Pedersen1,2,5, Thomas Mailund1*
From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013)
Vancouver, Canada. 21-24 January 2013
Abstract
The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by
enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or
different. We present an algorithm that computes the triplet distance between two rooted binary trees in time
O (n log2 n). The algorithm is related to an algorithm for computing the quartet distance between two unrooted
binary trees in time O (n log n). While the quartet distance algorithm has a very severe overhead in the asymptotic
time complexity that makes it impractical compared to O (n2) time algorithms, we show through experiments that
the triplet distance algorithm can be implemented to give a competitive wall-time running time.
Background Efficient algorithms to compute these three distance
Using trees to represent relationships is widespread in measures exist. The Robinson-Foulds distance can be
manyscientificfields,inparticularinbiologywhere trees computedintimeO(n)[4]fortreeswithnleaves,which
are used e.g. to represent species relationships, so called isoptimal.Thequartetdistancecanbecomputedintime
phylogenies, the relationship between genes in gene O(nlogn)forbinarytrees[5],intimeO(d9nlogn)for
familiesorforhierarchicalclusteringofhigh-throughput trees where all nodes have degree less than d [6], and in
experimentaldata.Commonfortheseapplicationsisthat sub-cubictimeforgeneraltrees[7].SeealsoChristiansen
differences inthedata usedforconstructingthetrees,or etal.[8]foranumberofalgorithmsforgeneraltreeswith
differencesinthecomputationalapproachforconstructing different tradeoffs depending on the degree of inner
the trees, can lead to slightly different trees on the same nodes. For the triplet distance, O (n2) time algorithms
setofleafIDs. existforbothbinaryandgeneraltrees[2,9].
To compare such trees, distance measures are often Brodal et al. [5] present two algorithms for computing
used.Commondistance measures includetheRobinson- thequartetdistanceforbinarytrees;onerunningintime
Fouldsdistance[1],thetripletdistance[2],andthequartet O(nlogn),andonerunningintimeO(nlog2n).Thelat-
distance[3].Commonforthesethreedistancemeasuresis teristhemostpractical,andwasimplementedin[10,11],
that they all enumerate certain features of the trees they whereitwasshowntobeslowerinpracticecomparedto
compareandcounthowoftenthefeaturesdifferbetween a simpleO (n2) time algorithm [12] unless n is above
the twotrees.TheRobinson-Fouldsdistance enumerates 2000. In this paper we focus on the triplet distance and
all edges in the trees and tests if the bipartition they developanO(nlog2n)timealgorithmforcomputingthis
induce is found in both trees. The triplet distance (for distancebetweentwo rooted binary trees. The algorithm
rooted trees) and quartet distance (for unrooted trees) is related to the O (n log2 n) time algorithm for quartet
enumerate all subsets of leaves of size three and four, distance, but its core accounting system is completely
respectively,andtestiftheinducedtopologyoftheleaves changed.Aswedemonstratebyexperiments,theresulting
isthesameinthetwotrees. algorithm is not just theoretically efficient but also Effi-
cient in practice, as it is faster than a simple O (n2) time
*Correspondence:mailund@birc.au.dk algorithmbasedon[12]already fornlargerthan12,and
1BioinformaticsResearchCenter,AarhusUniversity,Denmark ise.g.fasterbyafactorof50whennis2900.
Fulllistofauthorinformationisavailableattheendofthearticle
©2013Sandetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons
AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin
anymedium,providedtheoriginalworkisproperlycited.
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page2of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
Methods time by adding a new leaf x above the two root nodes
The triplet distance measure between two rooted trees and limit the inspection of quartetsto the quartetscon-
with the same set of leaf IDs is based on the topologies taining this new leaf. However, the efficient algorithms
induced by a tree when selecting three leafs of the tree. forcomputingthequartetdistancepresented in[5,7]do
Wheneverthreeleaves,a,bandc,areselected,atreecan notexplicitlyinspecteveryquartetandthereforecannot
induceoneoffourtopologies:Itcaneithergroupaandb, be modified to compute the triplet distance following
aandc,orbandc,oritcanputthematequaldistanceto thissimpleapproach.
theroot,i.e.allthreepairsofleaveshavethesamelowest In the following, we develop an efficient algorithm for
commonancestor(seeFigure1).Forbinarytrees,thelast computing the triplet distance between two rooted bin-
caseisnotpossiblesincethiswouldrequireanodewithat ary trees T and T with the same set of leaf IDs. Our
1 2
leastthreechildren. key contribution is to show how all triplets in one tree,
The triplet distance is the number of triplets whose say T , can be captured by coloring the leaves with col-
1
topology differ in the two trees. It can naïvely be com- ors, and how the smaller half trick lets us enumerate all
puted by enumerating all O (n3) sets of three leafs and such colorings in time O (n log n). We will then con-
comparingtheinducedtopologiesinthetwotrees,count- struct a hierarchical decomposition tree (HDT) for T
2
inghowoftenthetreesagreeordisagreeonthetopology. that counts its number of compatible triplets. Unlike the
Triplettopologiesinatree,however,arenotindependent, algorithms for computing the quartet distance [5],
and faster algorithms can be constructed exploiting this, where the counting involves manipulations of polyno-
comparingsetsoftriplettopologiesfaster.Critchlowetal. mials, the HDT for triplets involves simple arithmetic
[2] for example exploit information about the depth of computations that are efficient in practice.
shared ancestors of leaves in a tree to achieve an O (n2)
timealgorithmforbinarytreeswhileBansaletal.[9]con- Counting shared triplets through leaf colorings
struct a table of shared leaf-sets and achieve an O (n2) A triplet is a set {a, b,(cid:2)c} o(cid:3)f three leaf IDs. For a tree T,
timealgorithmforgeneraltrees. n
we assign each of the triplets to the lowest com-
Forthequartetdistance,theanaloguetothetripletdis- 3
tance for unrooted trees, Brodal et al. [5] construct an mon ancestor in T of the three leaves containing a, b,
evenfaster algorithm by identifying setsoffour different and c. For a node v Î T we denote by τ the set of tri-
v
leaves inone tree throughcoloring allleaves withone of plets assigned to υ. Then {τ | v Î T} is a partition of
v
three different colors and then counting the number of the set T of triplets. Thus, {τ ∩ τ | v T , u Î T } is
v u 1 2
topologiescompatiblewiththecoloringintheothertree. also a partition of T. Our algorithm will find
Using a variant of the so-called “smaller half trick” for (cid:4)(cid:4)
keeping the number of different relevant colorings low, Shared(T)= Shared(τ ∩τ ),
v u
thealgorithmmanagestoconstructallrelevantcolorings v∈T1u∈T2
withO(nlogn)colorchanges.Thenumberoftopologies
where Shared(S) on a set S of triplets is its number of
compatible withthe coloringcan thenbe countedinthe
other tree using a data structure called a “hierarchical triplets having the same topology i(cid:2)n T(cid:3)1 and T2. The tri-
decompositiontree”.Maintainingthehierarchicaldecom- plet distance of T and T is then n −Shared(T).
position tree, however, involves a number of polynomial 1 2 3
manipulationsthat,whiletheoreticallycanbedoneincon- In the algorithm, we capture the triplets τ by a color-
v
stant time per polynomial, are quite time consuming in ing of the leaves: if all leaves not in the subtree of v
practice[10,11],makingthealgorithmslowinpractice. have no color, all leaves in one subtree of v are “red”,
A naïve algorithm that computes the quartet distance and all leaves in the other subtree are “blue”, then τ is
v
betweentwounrootedtreesbyexplicitlyinspectingeach exactly the triplets having two leaves colored “red” and
of the O (n4) quartets can be modified to compute the one leaf colored “blue”, or two leaves colored “blue” and
tripletdistancebetweentworootedtreeswithoutlossof one leaf colored “red”. See Figure 2.
(cid:1) (cid:2) (cid:3) (cid:1) (cid:3) (cid:2) (cid:2) (cid:3) (cid:1) (cid:1) (cid:2) (cid:3)
(cid:1)(cid:2)(cid:3) (cid:1)(cid:4)(cid:3) (cid:1)(cid:5)(cid:3) (cid:1)(cid:6)(cid:3)
Figure1Triplettopologies.Thefourdifferenttriplettoplogies.
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page3of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
(cid:1) (cid:1)
(cid:1)(cid:2)(cid:3) (cid:1)(cid:4)(cid:3)
Figure2Coloringwhenvisitinganode.Coloringofasub-treerootedinnodevletsuscountalltripletsrootedinv.
For such a coloring according to a node v Î T , and 5. Call recursively on S(v). When we return we
1
for a node u Î T , the number Shared(τ ∩ τ ) could be satisfy invariant 2 for returning from the recursive
2 v u
found as follows: let x and y be the two subtrees of u, call.
let x(r) andy (r) be the number of leaves colored red in
the subtrees x and y, respectively, and x(b) and y(b) the Using this recursive algorithm, we go through all col-
number of leaves colored blue. The number Shared(τ ∩ orings of the tree. In each instance (not counting recur-
v
(cid:2) (cid:3) (cid:2) (cid:3) (cid:2) (cid:3) (cid:2) (cid:3)
τu ) is then x(2b) ·y(r)+ y(2b) ·x(r)+ y(2r) ·x(b)+ x(2r) ·y(b). sciovnestcaanltlsn)u,mwbeeronolfyticmoelso.rTlheuasv,eas lienafSℓ(vis),oannlydcoonlolryeda
We call these the triples of u compatible with the
when visiting an ancestor node v where ℓ S(v), i.e. ℓ is
coloring.
in the smaller subtree of v. Since ℓ can have at most O
ExplicitlygoingthroughT andcoloringforeachnodev
1 (log n) such ancestors, each leaf will only be colored at
wouldtaketimeO(n)pernode,foratotaltimeofO(n2).
most O (log n) times, implying a total of O (n log n)
We reduce this to O (n log n) by using the smaller half
color changes.
trick. Going through T for each coloring and counting
2
thenumberofcompatibletripletswouldalsotaketimeO
Hierarchical decomposition tree
(n2).Usinga HDTwe findthiscountinO(1)time,while
We build a data structure, the hierarchical decomposi-
updatingthestructuretakestimeO(logn)aftereachleaf
tion tree (HDT), on top of the second tree T in order
colorchange.TheresultisO(nlog2n)totalrunningtime. 2
to count the triplets in T compatible with the coloring
In essence, the HDT performs the inner sum of 2
(cid:4) (cid:4) of leaves in the first tree T . The HDT is a balanced
Shared(T)= v∈T1 u∈T2Shared(τv∩τu), while the binary tree where each node 1corresponds to a connected
coloringalgorithmperformstheoutersum. part of T2. Each node in the HDT, or component, keeps
a count of the number of compatible triplets the corre-
Smaller half trick spondent part of T2 contains, plus some additional
We go through nodes v in a depth first order while book-keeping that makes it possible to compute this
maintaining two invariants of the algorithm: 1) Before count in each component in constant time using the
we process v, the entire subtree of v is colored “red” information stored in the component’s children in the
and the rest of the tree has no color; 2) When we return HDT.
from the depth first recursion, the entire tree has no The HDT contains three different kinds of compo-
color. nents:
For v let S(v) denote the smallest subtree of v and let
L(v) denote the largest subtree of v. We go through the (cid:129) L: A leaf in T2,
coloring as follows (see Figure 3): (cid:129) I: An inner node in T2,
(cid:129) C: A connected sub-part of T ,
2
1. Color S(v) “blue”. Now v has the coloring that
enable us to count the triplets for v. where for type C we require that at most two edges in
2. Remove the color for S(v). Now we can call recur- T2 crosses the boundary of the component; at most one
sively on L(v) while satisfying the invariant. going up towards the root and at most one going down
3. Returning from the recursive call, the entire tree to a subtree.
is colorless by invariant 2. The leaves and inner nodes of T2 are transformed into
4. Color S(v) “red”. Now we satisfy invariant 1 for L and I components, respectively, and constitute the
calling recursively on S(v). leaves of the HDT. C components are then formed by
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page4of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
pairwise joining other components along an edge in T remaining edges in next are updated in lines 21-22, such
2
by one of two compositions, see Figure 4. C components that they point to either a newly created component or
can be thought of as consisting of a path from a sub- still point to the same component. The edges in next
tree below the C component going up towards the root are finally moved back to es in line 23, and a new itera-
of T , such that all trees branching o to other children tion is started. The algorithm finishes when es is empty,
2
along the path are all contained in the component. In and the root of the HDT is the C component resulting
the following we show how the HDT of T can be con- from joining the ends of the last edge. We now argue
2
structed in time O(n), and we prove that the height of that height of the HDT is O(log n), and that the con-
the HDT is O (log n). struction time is O(n). Each iteration of the code in
The construction algorithm operates on components Figure 6 takes time linear in the number E = |es| of
and edges. Each component is of one of the types L, I, edges remaining to be contracted at the beginning of
or C. It has a parent pointer, pointing to its parent com- the iteration. The height and time bounds follow if we
ponent in the HDT, and a bit, down_closed, indicating if can argue that the numberof edgesdecreasesgeometri-
the component corresponds to a complete subtree in callyforeachiteration.Inthefollowingwearguethatthe
the original tree. When a component does not yet have numberofedgesafteroneiterationisatmost11/12.E.
a parent in the HDT, we make the parent pointer point Wefirstarguethatthenumberofcontractibleedgesat
to the component itself. We use this pointer to test if the beginning of the iteration is at least E/4. Note that
an edge has been contracted in the HDT construction. only edges incident to I components might not be con-
Edges consist of two pointers, an up pointer and a down tractible, and that the number of down-closed compo-
pointer that points to the components above and below nents is at least one larger than the number of I
the edge in T . components. If the number of I components is at most
2
In a single traversal of the tree, the algorithm initially E/4, then at most 3 · E/4 incident edges might not be
builds a component for each node in the tree (an L contractible, i.e. at least E/4 edges are contractible.
component for each leaf and an I component for each Otherwise the number of I components is more than
inner node) and an edge for each edge in the tree. The E/4, and therefore the number of down-closed compo-
parent pointer of each component is initially set to nentsismore than E/4+1. Since the parentedgesfrom
point to the component itself, and the down_closed bit alldown-closedcomponentsarecontractible(forE≥1),
is set to true for L components and false for I compo- thenumberofcontactableedgesisagainatleastE/4.
nents. The edges are put in a list es. We then perform a Since each contracted edge can prevent at most two
series of iterations, each constructing one level of the other edges incident to the two merged components
HDT. In each iteration, edges whose neighbors can be (see Figure 5) from being contracted in the same itera-
joined to form a C component via the constructions in tion, each iteration will contract at least 1/3 of the at
Figure 5 are greedily removed from es, and another list, least E/4 contractible edges. It follows that an iteration
next, is used to keep the edges that cannot be con- reduces the number of contractible edges by at least
tracted in this iteration. The details of an iteration is E/12.
shown in Figure 6. The process stops when es becomes
empty. Counting triplets in the hierarchicaldecomposition tree
In case 1, one of e’s neighbors already has a parent, In each component we keep track of N, the number of
thus this neighbor has already been contracted into a C triplets contained within the component (i.e., where the
component in this iteration and should not be con- threeleavesofthetripletarewithinthecomponent)that
tracted again. Case 2 is the situation in Figure 5a, and are compatible with the coloring. When we change the
case 3 is the situation in Figure 5b. In case 4, e.down is coloring, we update the HDT to reflect this, so we can
an I component or e.up is an I component and e.down always read o the total number of compatible triplets
does not correspond to a complete subtree, hence none fromtherootoftheHDTinconstanttime.
of the constructions in Figure 5 apply. After removing By adding a little book-keeping information to each
all edges from es, the up and down pointers of the component we make it possible to compute N (and the
(cid:1) (cid:1) (cid:1) (cid:1) (cid:1)
(cid:1)(cid:3)(cid:2) (cid:1)(cid:4)(cid:2) (cid:1)(cid:5)(cid:2) (cid:1)(cid:6)(cid:2) (cid:1)(cid:7)(cid:2)
Figure3Coloringalgorithm.Thefivestepsofthecoloringinthesmaller-halftrick.
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page5of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
(cid:1)(cid:3)(cid:2) (cid:1)(cid:4)(cid:2) (cid:1)(cid:5)(cid:2)
Figure 4 Component types in the HDT. The three different types of components. L and I components contain a single node from the
underlyingtreewhileCcomponentscontainaconnectedsetofnodes.
book-keeping information itself) for a component in (i) the number of blue leaves in T . Then
(cid:4) (cid:4) i
constant time from the information in the component’s r(cid:6)b= n n r(i) · b(j).
children in the HDT. This has two consequences: it i=1 j=i+1
makes it possible to add the book-keeping and N to all (cid:129) b(cid:6)r: The number of pairs of leaves with a red leaf in
components after the construction in linear time, and it one sub-tree and a blue in another sub-tree, where
makes it possible to update the information when a leaf the blue leaf is in a tree further down on the path in
changes color by updating only components on the path a C component.
from the leaf-component to the root in the HDT, a path
that is bounded in length by O (log n). Since we only We describe how the book-keeping variables and N
change the color of a leaf O (n log n) times in the triplet are computed through a case-analysis on how the com-
distance algorithm, it is this property of the HDT that ponents are constructed. L and I components are con-
keeps the total running time at O (n log2 n). For the structed in only one way, while C components are
book-keeping we store six numbers in each component constructed in one of two ways (see Figure 5).
in addition to N: L components: For a leaf component, R is 1 if the leaf
is colored red and 0 otherwise, B is 1 if the leaf is
(cid:129) R: The number of leaves colored red in the colored blue and 0 otherwise, and all other counts are 0.
component. I components: All counts are 0.
(cid:129) B: The number of leaves colored blue in the C components, case Figure 5a: Let x be one of the
component. counters R, B, (cid:5)rr,... listed above for a C component, and
(cid:129) (cid:5)rr: The number of pairs of red leaves found in the let x(1) and x(2) denote the corresponding counter in
same sub-tree of a C component. Let Ti, i = 1,..., n, component C1 and C2, respectively, with C2 above C1 in
denote the sub-trees in the component, see Figure the underlying tree. Then
7a, and let r(i) denote the number of red leaves in
(cid:2) (cid:3)
(cid:4) R=R(1)+R(2) B=B(1)+B(2)
tree T. Then (cid:5)rr = n r(i) . r(cid:5)r=(cid:5)rr(1)+(cid:5)rr(2) b(cid:5)b=b(cid:5)b(1)+b(cid:5)b(2)
i i=1 2 r(cid:6)b=r(cid:6)b(2)+r(cid:6)b(1)+R(1)·B(2) b(cid:6)r=b(cid:6)r(2)+b(cid:6)r(1)+B(1)·R(2).
(cid:129) b(cid:5)b: The number of pairs of blue leaves found in
the same sub-tree of a C component. The triplet count is then computed as
(cid:129) r(cid:6)b: The number of pairs of leaves with a red leaf (cid:2) (cid:3) (cid:2) (cid:3)
R(1) B(1)
in one sub-tree and a blue in another sub-tree, N=N(1)+N(2)+ · B(2)+ · R(2)
2 2
where the red leaf is in a tree further down on the +(cid:5)rr(2) · B(1)+b(cid:5)b(2) · R(1)+R(1) · r(cid:6)b(2)+B(1) · b(cid:6)r(2).
path in a C component. Let T, i = 1,..., n, denote
i
the sub-trees in the component, see Figure 7, and let C components, case Figure 5b: Let x(1) denote one
r(i) denote the number of red leaves in tree T and b of the counters listed above for component C . Then
i 1
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page6of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
(cid:1)
(cid:6)
(cid:1)
(cid:2)
(cid:1)
(cid:5)
(cid:1)
(cid:5)
(cid:1)(cid:2)(cid:3) (cid:1)(cid:4)(cid:3)
Figure5ComponentcompostionsintheconstructionoftheHDT.ThetwodifferentwaysofconstructingaCcomponentbymergingtwo
underlyingcomponents.ThetopmostofthecomponentscaneitherbeaCcomponent(a)oranIcomponent(b)whilethebottommost
component,C,mustbeaCorLcomponent.IfthetopmostcomponentisanIcomponent,thebottommostmustbedownwardsclosed,i.e.it
1
cannothaveadownwardsedgecrossingitsboundary.
(cid:2) (cid:3) (cid:2) (cid:3)
R = R(1), B = B(1), (cid:5)rr = R(1) , b(cid:5)b= B(1) , Results and discussion
2 We implemented the algorithm in C++ and a simple O
2
r(cid:6)b=b(cid:6)r =0. Since the inner node in the composition (n2) time algorithm to ensure that it computes the cor-
rect triplet distance.
does not contain any leaves, the triplet count is simply
N = N (1).
Figure6Algorithmforconstructingthehierarchicaldecompositiontree.Algorithmforconstructingthehierarchicaldecompositiontree.
ThelistingshowsthealgorithmrunforeachleveloftheHDTconstruction.Thisalgorithmisrepeateduntiltheeslistisempty.
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page7of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
(cid:2) (cid:2)
(cid:3) (cid:3)
(cid:4) (cid:4) (cid:1)
(cid:2) (cid:2)
(cid:6) (cid:6)
(cid:4) (cid:4)
(cid:2) (cid:2)
(cid:5) (cid:5)
(cid:4) (cid:4) (cid:4)
(cid:1)(cid:2)(cid:3) (cid:1)(cid:4)(cid:3)
Figure7CountingtripletsinaCcomponent.ThetwocasesofcountingtripletsinaCcomponent.
Wethenverifiedtherunningtimeofouralgorithm,see When changing the color of leaves, we spent time
Figure 8a. As seen in the figure, the running time evens O (log n) updating the book-keeping in the HDT for
out when we divide with n log2 n giving us confidence each leaf. We only count after a complete sub-tree has
that the analyzed running time is correct. We also mea- changedcolor,however,soinsteadofupdatingtheHDT
suredwhereinthealgorithmtimewasspent,whetheritis for each color-change we could just mark which leaves
inconstructingtheHDT,incoloringtheleavesinthefirst have changed color and then update the HDT bottom-
tree, or in updating the counts in the HDT. Figure 8b up, so each inner node would only be updated once
illustrates the time spend on eachof these three parts of whenitisonapathfromachangedleaf.Weimplemen-
the algorithm, normalized so the running time sums to ted this, but found that the extra book-keeping from
one. For small trees, constructing the HDT makes up a markingnodesandthen updatingincreasedthe running
sizable fraction of the running time, not surprisingsince timeby10%-15%comparedtojustupdatingtheHDT.
the overhead in constructing components is larger than To render the use of the algorithm in practice, we
updating them. As the size of the trees increase, more implemented an Efficient O(n2) time algorithm based on
time is spent on updating the HDT, as expected since thequartetdistancealgorithmpresentedin[12].Figure9
updating the HDT runs in O (n log2 n) while the other shows the ratio of the running time for the O(n2) time
operationsareasymptoticallyO(nlogn). algorithm against the O(n log2 n) time algorithm. It is
Figure8Validation ofrunningtime. (a) Totalrunning timedivided bynlog2 n..(b) Thepercentoftimeusedonthethree partsofthe
algorithm.Theleftfigureshowsthetotalrunningtimedividedbynlog2nshowingthatthetheoreticalrunningtimeisachievedinthe
implementation.Therightfigureshowsthepercentoftimeusedonthethreepartsofthealgorithm:constructingtheHDTofthesecondtree,
coloringtheleavesinthefirsttree,andupdatingtheHDTaccordingly.ConstructingtheHDTtakesaconsiderablepartofthetime,butasthe
treesgrow,updatingtheHDTtakesalargerpart.Theplotsshowtheaverageover50experimentsforeachsizen.
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page8of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
Figure9ComparisontoO(n2)algorithm.RatiobetweentherunningtimesfortheO(n2)timealgorithmandtheO(nlog2n)timealgorithm.
evident that our algorithmis fastestforall practical pur- distance [6],but thismakesthealgorithm depend onthe
poses.Thespeed-upfactorforn=2900is46. degree of nodes, and future work is needed to develop a
sub-quadraticalgorithmforgeneralrootedtrees.
Conclusions
We have presented an O (n log2 n) time algorithm for Authors’contributions
computing the triplet distance between two binary trees Allauthorscontributedtothedesignofthepresentedalgorithm.AS
implementedthedatastructuresandthealgorithm.AS,CNSP,andTM
and experimentally validated its correctness and time
designedtheexperiments,andASconductedthese.Allauthorshave
analysis. contributedto,seenandapprovedthemanuscript.
ThealgorithmbuildsupontheideasintheO(nlog2n)
Competinginterests
time algorithm for computing the quartet distance
Theauthorsdeclarethattheyhavenocompetinginterests.
betweenbinarytrees[13],butwherethebook-keepingin
thequartetdistance algorithmisratherinvolved,making Declarations
ThepublicationcostsforthisarticlewerefundedbyPUMPKIN,Centerfor
it inefficient in practice, the book-keeping in the triplet
MembranePumpsinCellsandDisease,aCenteroftheDanishNational
distance algorithm in thispaper isentirely different, and ResearchFoundation,Denmark.
significantlysimplerandfaster. ThisarticlehasbeenpublishedaspartofBMCBioinformaticsVolume14
Supplement2,2013:SelectedarticlesfromtheEleventhAsiaPacific
Compressing the HDT during the algorithm makes it
BioinformaticsConference(APBC2013):Bioinformatics.Thefullcontentsof
possibletoreducetherunningtimeofthequartetdistance thesupplementareavailableonlineathttp://www.biomedcentral.com/
algorithmto O(nlogn)andthe sameapproachcanalso bmcbioinformatics/supplements/14/S2.
reduce the running time of the triplet algorithm to O (n
Authordetails
logn).Wehaveleftforfutureworktoexperimentallytest 1BioinformaticsResearchCenter,AarhusUniversity,Denmark.2Departmentof
whetherthismethodincurstoomuchoverheadtomakeit ComputerScience,AarhusUniversity,Denmark.3MADALGO,Centerfor
MassiveDataAlgorithms,aCenteroftheDanishNationalResearch
practicallyworthwhile.
Foundation,Denmark.4DepartmentofMathematicsandComputerScience,
UnliketheO(n2)timealgorithmofBansaletal.[9]our UniversityofSouthernDenmark,Denmark.5PUMPKIN,CenterforMembrane
algorithm does not generalize to non-binary trees. It is PumpsinCellsandDisease,aCenteroftheDanishNationalResearch
Foundation,Denmark.
possible to extend the algorithm to non-binary trees by
employing more colors, as was done for the quartet Published:21January2013
Sandetal.BMCBioinformatics2013,14(Suppl2):S18 Page9of9
http://www.biomedcentral.com/1471-2105/14/S2/S18
References
1. RobinsonDF,FouldsLR:ComparisonofPhylogeneticTrees.Mathematical
Biosciences1981,53:131-147.
2. CritchlowDE,PearlDK,QianCL:Thetriplesdistanceforrooted
bifurcatingphylogenetictrees.SystematicBiology1996,45(3):323-334.
3. EstabrookGF,McMorrisFR,MeachamCA:ComparisonofUndirected
PhylogeneticTreesBasedonSubtreesofFourEvolutionaryUnits.
SystematicZoology1985,34(2):193.
4. DayWHE:Optimal-AlgorithmsforComparingTreeswithLabeledLeaves.
JournalofClassification1985,2:7-28.
5. BrodalGS,FagerbergR,PedersenCNS:Computingthequartetdistance
betweenevolutionarytreesintimeO(nlogn).Algorithmica2004,
38(2):377-395.
6. StissingMS,PedersenCNS,MailundT,BrodalGS,FagerbergR:Computing
thequartetdistancebetweenevolutionarytreesofboundeddegree.
Proceedingsofthe5thAsia-PacificBioinformaticsConference(APBC)Imperial
CollegePress;2007,101-110.
7. NielsenJ,KristensenA,MailundT,PedersenCNS:Asub-cubictime
algorithmforcomputingthequartetdistancebetweentwogeneral
trees.AlgorithmsforMolecularBiology2011,6:15.
8. ChristiansenC,MailundT,PedersenCNS,RandersM,StissingMS:Fast
calculationofthequartetdistancebetweentreesofarbitrarydegrees.
AlgorithmsMolBiol2006,1:16.
9. BansalMS,DongJ,Fernández-BacaD:Comparingandaggregating
partiallyresolvedtrees.TheoreticalComputerScience2011,
412(48):6634-6652.
10. MailundT,PedersenCNS:QDist-quartetdistancebetweenevolutionary
trees.Bioinformatics2004,20(10):1636-1637.
11. StissingMS,MailundT,PedersenCNS,BrodalGS,FagerbergR:Computing
theall-pairsquartetdistanceonasetofevolutionarytrees.Journalof
BioinformaticsandComputationalBiology2008,6:37-50.
12. BryantD,TsangJ,KearneyP,LiM:Computingthequartetdistance
betweenevolutionarytrees.ProceedingsoftheeleventhannualACM-SIAM
symposiumonDiscretealgorithms2000,285-286,SocietyforIndustrialand
AppliedMathematics.
13. BrodalGS,FagerbergR,PedersenCNS:Computingthequartetdistance
betweenevolutionarytreesintimeO(nlog2n).InProceedingsofthe12th
InternationalSymposiumonAlgorithmsandComputation(ISAAC).Volume
2223.Springer;2001:731-742,LectureNotesinComputerScience.
doi:10.1186/1471-2105-14-S2-S18
Citethisarticleas:Sandetal.:ApracticalO(nlog2n)timealgorithmfor
computingthetripletdistanceonbinarytrees.BMCBioinformatics2013
14(Suppl2):S18.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit