JOURNALOFCOMPUTATIONALBIOLOGY Volume6,Numbers3/4,1999 MaryAnnLiebert,Inc. Pp.405–418 An Anytime Local-to-Global Optimization Algorithm for Protein Threading in (m2n˜2) Space RICHARDH.LATHROP ABSTRACT This paper describes a novel anytime branch-and-bound or best-(cid:142)rst threading search algo- rithm for gapped block proteinsequence–structure alignment with general sequence residue pair interactions. The new algorithm (1) returns a good approximate answer quickly, (2) it- eratively improves that answer to the global optimum if allowed more time, (3) eventually producesaproofthatthe(cid:142)nalanswerfoundisindeedtheglobaloptimum,and(4)alwayster- minatescorrectlywithinaboundednumberofstepsifallowedsuf(cid:142)cientspaceandtime.Itruns in polynomial space, which is asymptotically dominated by the (m2n˜2) space required by O thelowerboundcomputation.UsingpreviouslypublisheddatasetsandtheBryant–Lawrence (1993) objective function,the algorithm found the true (proven) global optimum in less than 5 min in all search spaces size 1025 or smaller (sequences to 478 residues), and a putative (not guaranteed) optimum in less than 5 hr in all search spaces size 1060 or smaller (se- quences to 793 residues, cores to 42 secondary structure segments). The threading in the largest case studied was eventually proven to be globally optimal; the corresponding search speed in that case was the equivalent of 1.5 1056 threadings/sec, a speed-up exceeding 1025 £ over previously published batch branch-and-bound speeds, and exceeding 1050 over previ- ously published exhaustive search speeds, using the same objective function and threading paradigm. Implementation-independent measures of search ef(cid:142)ciency are de(cid:142)ned for equiv- alent branching factor, depth, and probability of success per draw; empirical data on these measures are given. The general approach should apply to other alignment methodologies and search methods that use adivide-and-conquer strategy. Key words: protein threading, inverse folding, fold recognition, sequence, structure, alignment, pair potentials, contact potentials, knowledge-based potentials, search. 1. INTRODUCTION PROTEIN STRUCTURE PREDICTION from sequence is one of the great unsolved challenges of molecular biology. Protein sequence–structure alignment, alsocalledprotein threading, attempts to model anovel sequenceusingaknownstructurebyaligningthesequencetothestructureunderanobjectivefunction.After alignment, the sequence is given a similar three-dimensional (3D) fold by assigning each sequence residue the3Dcoordinatesofthealignedstructureresidue.ForreviewsseeBowieandEisenberg(1993),Bryantand Altschul (1995), Fetrow and Bryant (1993), Jernigan and Bahar (1996), Jones and Thornton (1993), Jones DepartmentofInformationandComputerScience,University ofCalifornia, Irvine, California. 405 406 LATHROP andThornton(1996),Lemeretal.(1995),Sippl(1995),andWodakandRooman(1993),whileforcautionary notesseeCrippen(1996),LathropandSmith(1996),Moultetal.(1995),Ouzounisetal.(1993),Russelland Barton (1994), Smith etal.(1997),andThomasandDill(1996). Theproteinsequence–structurealignmentproblemconsistsofasequence,astructure[consideredasdrawn from a library of structures (Lathrop et al., 1998)], a set of legal sequence–structure alignments, and an objectivefunction thatmapseachlegalalignmenttoarealnumber.Theobjectivefunction value isthescore, pseudoenergy, or potential of any given alignment; here, by analogy to energy, lower numbers are more favorable.Thesequence,structure,andobjectivefunction togetherdeterminetheentirealignmentlandscape including the locations and values of allglobal and local minima. The accuracy of the objective function is measured by the agreement (or lack of it) between crystallographic alignments and the objective function alignmentminima(Fischeretal.,1996). 1.1. Search algorithms For purposes of this paper, assume that the protein sequence, structure, objective function, and set of legal alignments all are (cid:142)xed in advance. This (cid:142)xes the alignment landscape and produces a computational optimization task: to (cid:142)nd the alignments within the searchspace that minimize the objective function. The searchspaceisthesetoflegalalignments,thesearchspacesizeistheircardinality, and thesearchgoalisto (cid:142)nd alignments that minimize the objective function. This optimization task is the primitive computational stepinmanyapproachestoprotein threading bysequence–structurealignment. Theobjective function determinesthethreading landscapeand whetheritmatchescrystallographic align- ments, while the search algorithm determines whether the alignments that are found actually match the landscapeminimaindicatedbytheobjectivefunction.Consequently,searchalgorithmsareevaluatedonhow well and how quickly they can (cid:142)nd the objective function alignment minima. Search accuracy is measured by agreement (or lack of it) between the objective function minima and alignments returned by the search algorithm. Ef(cid:142)ciency is measured by the amount of computational resources expended. In comparisons, it is important to use a previously characterized objective function and a large diverse set of sequences and structuresbecausesearchef(cid:142)cienciesvaryconsiderablyacrossthese.TheBryant–Lawrence(1993)objective function(BryantandLawrence,1993) wasthefastestof(cid:142)veobjectivefunctionsstudied(LathropandSmith, 1996). Speeds have beenreported of 1.4 102 threadings/secfor exhaustive search (Bryant and Lawrence, £ 1993) and 6.8 1028 for branch-and-bound search (Lathrop and Smith, 1996), both on faster machines £ thanhere. 1.1.1. Categorization. Exhaustive searchcouldaccomplishtheoptimizationtaskusing anumber ofob- jectivefunctionevaluationsequaltothesearchspacesize,butthesearchspacebecomescombinatoricallylarge andsootheralgorithmshavebeensought.Proteinthreadingsearchalgorithmclassesdependoncomputational behavior andassumptions aboutprotein structure: PairInteractions:Modeledvs.Not. Ifinteractionsbetweenpairsofsequenceresiduesarenotmodeled ² then the problem is low-order polynomial using dynamic programming (Sankof and Kruskal, 1983). In contrast,ifbothalignmentgapsandpairinteractionsbetweensequenceresiduesaremodeled,thenrestricted subproblemsmaybe polynomial of varying degree(Xuand Uberbacher,1996; Xuetal., 1998) while the generalpairinteractionproblem treatedhereisNP-hard(AkutsuandMiyano,1997; Lathrop,1994). Alignment: Gaps Anywhere vs. Gapped Block. Alignment gaps may be permitted anywhere in the ² structure,orincontrastmaybecon(cid:142)nedtoloopregionsbetweendiscreteblocksofcoresecondarystructure segments.Permittinggapsanywheremayintroduceunphysicalmainchainbreaksintheinteriorofsecondary structure segments,whilecon(cid:142)ninggapstotheloops mayproduce erroneoussecondarystructuresegment lengths.Gapsanywhere resultsinmuchlargersearchspacesizesthangapped block. GlobalOptimality:Exact(Proven)vs.Approximate(NotGuaranteed). Anexactalgorithmproduces ² animplicitproofthattheanswerfoundisindeedtheglobaloptimum.Incontrast,anapproximatealgorithm produces ananswerquickly, butwithout proof ofglobal optimality; itsoptimality, ifany,isunknown. Answers:Anytime vs. Batch. An anytime algorithm produces a good answer quickly, then iteratively ² improves that answer if allowed more time. After a short initialization period it may be stopped at any time and a usable answer obtained. Approximate algorithms that rely on iterative sampling are anytime algorithms because theymaybestopped atany timeandreturn their currentbestcandidate. In contrast,a batchalgorithm producesnoansweruntiltheendoftherun,andifstoppedearlyproducesnoansweratall. ALGORITHMFORPROTEINTHREADING 407 Examplesofthesecategoriesmaybedrawnfromtheliterature;thediscussionhereassumesthatpairinter- actions aremodeled. Within the gaps anywhere paradigm, examples of approximate algorithms include the batchdoubledynamicprogrammingalgorithm(TaylorandOrengo,1989)andtheanytimefrozenapproxima- tionalgorithm(Godziketal.,1992).Withinthegappedblockparadigm,examplesofapproximatealgorithms includetheanytimeGibbsSampler(Lawrenceetal.,1993;Madejetal.,1995)andthestochasticalgorithmof Crawford(1999).Examplesofexactalgorithmsincludeexhaustivesearch(BryantandLawrence,1993),batch branch-and-bound (Lathrop andSmith,1996), andbatchdivide-and-conquer(XuandUberbacher,1996; Xu etal.,1998). 1.2. This paper The algorithm described in this paper is a gapped block alignment algorithm that models fully general sequenceresiduepairinteractionsintheobjectivefunction.Itisananytimealgorithmthatisinitiallyapproxi- matebutbecomesexactafteraboundednumberofsteps.Itrunsinpolynomialspace,whichisasymptotically dominated by the space required by the lower bound computation, (m2n˜2). The algorithmic strategy em- ployed is to adapt a batch branch-and-bound threading algorithm (Lathrop and Smith, 1996) to employ a best-(cid:142)rst anytime control structure. The result is a branch-and-bound or best-(cid:142)rst threading algorithm that returns (1) a good answer quickly, (2) the true global optimum shortly thereafter in all caseswe have been abletoverify,and(3)eventually,aproof ofoptimality ifallowedsuf(cid:142)cienttimetoruntocompletion. 1.2.1. Batchbranch and bound background. Themethodpresentedhereisamodi(cid:142)edversionofabatch branch-and-bound threading algorithm (Lathrop and Smith, 1996). An alignment may be represented as a vectort,wheret givesthesequenceresiduealignedtotheithstructurecoordinate.Thevectorspacedimension i ism,thesequencelengthisn,andthenumberofdistinctsequenceresiduesthatmayaligntoagivenstructure coordinateisn˜.Thehyperrectangle[b,d],whosecornersarethevectorsbandd,containsallalignmentstsuch thatb t d .Eachhyperrectangle corresponds toa setof partiallyinstantiated alignments,anditsvector i i i · · subspace dimension corresponds to the number of unaligned core segments. A lower bound of complexity (m2n˜2)mapsahyperrectangletoalowerboundonthepossiblevaluesoftheobjectivefunctiononanypoint in the hyperrectangle. A splitting function partitions a hyperrectangle into a small set of mutually disjoint andexhaustivesmallerhyperrectangles,atleastoneofwhichisoflowervectorsubspacedimensionthanthe parent. A priority queue (or heap) holds a sorted list of allcurrently instantiated hyperrectangles, sorted by thelowerbound. Initially the queue holds a single hyperrectangle,[1,n˜],that covers the entire search space.Ateachstep, thehyperrectanglehavingthecurrentlylowestlowerboundisremovedfromthequeue.Ifitcontainsonlyone alignmentitisreturnedasaglobaloptimum,becausenootherhyperrectangleonthequeuecanpossiblyachieve alowerscore.Otherwiseitispartitionedusingthesplittingfunction,andtheresultingsmallerhyperrectangles aremergedintothequeue orderedbythelowerbound. 2. METHODS First a new method of searching for threading landscape minima is described. Then implementation and hardwaredetailsaregiven.Nextthedatasets,objectivefunction,andrunconditionsarestated.Finally,several implementation-independentmeasuresof searchef(cid:142)ciencyarede(cid:142)ned. 2.1. A local-to-global anytime search method In this paper, the original algorithm (Lathrop and Smith, 1996) is modi(cid:142)ed to use m 1priority queues. C Queue Q holdshyperrectanglesofvectorsubspacedimensionm k.Thehyperrectanglesineachqueueare k ¡ sortedbytheirlowerbound.Figure1givesaschematicillustration.Thequeuesarearrangedinacascade(left- to-rightinFig.1)accordingtothenumberofalignedcoresegments,i.e.,indecreasingorderofhyperrectangle dimension. The idea is that if the lower bound is not too loose, then hyperrectangles that enclose favorable threadings will tend to sort to the front of the queues in Fig. 1 and from there migrate to the more fully instantiatedhigher-numberedqueues.TheAppendixgivespseudocode forthecontrol architecturedescribed inthissection. Whenahyperrectangleissplit,atleastoneresultinghyperrectangleisguaranteedtodecreaseindimension andsowillmovetoahigher-numberedqueue(arrowsinFig.1).Consequently,itisalwayspossibleto“sweep” acrossthequeuesbybeginning atthelowest-numberedoccupiedqueueandateachstep:(1)popping thebest 408 LATHROP FIG.1. Schematicillustration ofanytimequeuecascade. Queue Q0holdscompletelyuninstantiated alignments,queue Qm holds fully instantiated alignments, and intermediate queues Qk hold partial alignments corresponding to hyper- rectangles of dimension m k.Here, queue Qk holds partial alignments with exactly k aligned ((cid:142)xed) core segments. ¡ In a gap-anywhere paradigm, Qk might hold partial alignments with k aligned residues. When a partial alignment is removedfromonequeueandpartitioned, theresulting partialalignmentsarereinsertedintonewqueuescorresponding to their vector subspace dimension. Ifthedimension hasnotchanged they arereinserted into thesamequeue (notshown); otherwise theyarereinserted intotheappropriate queueholding morecompletely instantiated alignments (arrows). partialalignmentfrom thetopofthecurrentqueue,(2)splittingitandreinsertingthechildrenintoqueuesas appropriate,and(3)advancingtothenexthighest-numberedoccupiedqueue.Onesuchsweepcanbedonein (m)lowerboundevaluations,andisguaranteedtoproduceafullyinstantiatedalignmentfrom Q attheend. m In addition to sweeping across the queues, it is possible to operate opportunistically on the queue that currently appears to contain the most promising possibilities. The basic idea is simple. Ignoring outliers beyondsomez-scorethresholdabovethecurrentmean,thealgorithmestimatesinternallytherecent(decaying average)meanandstandarddeviationoftheincreaseinlowerbound thatprevioussplitshaveachievedwhen movingfromqueue Q toqueue Q .Asdescribedbelow,theseyieldaverycrudeestimateoftheprobability i i 1 C that the top hyperrectangle in each queue contains an alignment that scores better than the current anytime best.Thehyperrectanglewiththehighestcrudeestimateischosenandsplit. Although best-(cid:142)rst search usually requires both exponential time and exponential space, a polynomial space bound can be achieved for the task considered here; the time bound, however, remains exponential. The user speci(cid:142)es an arbitrary space constant, C (currently 100,000 in our implementation). A second set of m 1 priority queues is used, called the annihilation queues AQ, which mirror the primary queues Q C described above. Annihilation queue AQ also holds hyperrectangles of vector subspace dimension m k, k ¡ alsosortedbytheirlowerbound.Wheneverthetotalnumberofhyperrectanglesontheprimaryqueuesexceeds C, one hyperrectangle is removed from the primary queues and placed in its mirroring annihilation queue. Sinceeachsplitconsumesonehyperrectangleandproducesatmostthree,thenumberofhyperrectangleinthe primaryqueuescanneverexceedC 2.Wheneveranyoftheannihilationqueuescontainanyhyperrectangles, C processing of the primary queues is suspended and only hyperrectangles in the annihilation queues are processeduntiltheannihilationqueuesagainareempty.Theresultisthatthetotalnumberofhyperrectanglesin theprimaryqueuesisdecreasedbyone.Annihilationqueuesareprocessedbychoosingthehighest-numbered occupiedqueue,removing andsplitting itstophyperrectangle,and inserting itschildrenintotheappropriate annihilation queues. In this process, at most two children arereturned to the same annihilation queue from which the parent wasremoved, while atmost three children are promoted to higher-numbered annihilation queues.Onehyperrectangleultimatelycangenerateatmostn˜/3 1descendantsthatarereturnedtoitssame C annihilation queue,andsotherecanbenomorethann˜ 3hyperrectanglesinany annihilation queue.Since C there arem 1annihilation queues, the total number of hyperrectangles on allqueues is (C mn˜).Each C C hyperrectangle requires 2m integers, so the total space complexity of all queues is (Cm m2n˜). This is C dominatedbythelowerbound spacecomplexity,whichis (m2n˜2).Consequently, theentirealgorithm runs in (m2n˜2)space. 2.1.1. Overallcontrolarchitecture. Supposetheuserwishestoreturnthe K bestthreadingsinthesearch space.Normally K 1,indicating to return only theglobal optimum (calledthe “anytimebest” below),but D sometimesnear-optimalalignmentsareofinteresttoo.TheoverallcontrolarchitecturebeginswithatleastK sweepsto generate the initial K candidates,thereafter runs in opportunistic mode in an attempt to improve them,andtemporarilyshiftstoannihilation mode wheneverthetotalnumberofhyperrectanglesexceedsthe user-speci(cid:142)edspaceparameterC.Firstthequeuesareinitialized. Q holdsasinglehyperrectanglecontaining 0 the entire search space,[1,n˜],and allother queues areempty. Initially max(K,5) sweepsaredone in order toproduce K goodthreadings quicklyandtoinitializethemeanandstandarddeviationestimates.Thereafter thealgorithmopportunisticallyoperatesonthemostproductiveregionsofthesearchspace.Wheneverafully instantiatedthreading isproduced thatscoresbetterthan the worstscoring of the current K bestcandidates, itreplacesthatcandidate. ALGORITHMFORPROTEINTHREADING 409 The algorithm remembers its best K candidates so far, and presents this as its best guess if it is stopped beforeitterminatesnormally.Otherwise,ifallowedsuf(cid:142)cienttime,eventuallyitreachesastateinwhichthe top hyperrectangle of every queue has a lower bound that is equal to or greater than the actualscore of the worst scoring of the current K best candidates. This constitutes an implicit proof that the current anytime bestisin factaglobal optimum, becausenootherhyperrectangleanywhereinthesearchspacecanpossibly contain an alignment of lower score. When this occurs the algorithm announces that it has proven that the current K candidates aretheglobally optimal K bestthreadings, andthenterminatesnormally. 2.1.2. An improved splitting function. The original batch splitting function (Lathrop and Smith, 1996) partitionedhyperrectanglesaccordingtoacomplicatedfunctionofpartialprobabilitiesoverpartialalignment scores.Asimplerandmoreeffectivesplittingfunction,usedforallresultspresentedinthispaper,isasfollows: 1. Ifonesegmentinteractswithmoredifferentsegmentsthananyother,splitonthatsegment. 2. Breaktiesby splitting onthesegmentthathasthemosttotalnumberofresidueinteractions. 3. Breaktheremainingtiesbysplitting onthesegmentfurthestfromany(cid:142)xedsegment. 4. Breaktheremainingtiesbysplitting onthesegmentclosesttothemiddle ofthestructure. A second splitting function, alsomore effective than the original (Lathrop and Smith, 1996), exploits the lowerbound computationtoguidethechoiceofsplitpoint.Thischoosesthesegmentwhosesplitmaximizes thelowestlowerboundofthechildren,i.e.,themaximumoverallsegmentsoftheminimumoverallchildren resulting from splitting on that segment of the lower bound of the child. If done at each split, this requires 3m lower bound computations to choose one split point, and is impractical. However, an acceptable static compromise results from doing the full computation only once, on the very (cid:142)rst initialization sweep, and recording in a table the order in which segments were chosen to be split. Thereafter, splits are chosen by looking upthenextsplitsegmentintherecordedtable. 2.2. Implementation and hardware The algorithm is implemented in Common Lisp. The results below were obtained on a 110-MHz SPARCstation 5desktop workstation. Forverylargeproblem sizesthechoiceofalgorithm completelydom- inatesthechoiceofprogramming language ormachinehardwareasanin(cid:143)uenceonsearchspeed. 2.3. Data sets, objective function, run conditions ThealgorithmwasrunusingthepreviouslypublishedobjectivefunctionandthreadingparadigmofBryant andLawrence(1993). DatasetsweretakenfromLathrop andSmith(1996), Rostetal.,(1997), andaSCOP (Murzinetal.,1995)coredomaindatabase.ThedatasetofLathropandSmith(1996)has60sequences(length to478residues)and60cores(of3to23coresegments)yieldingsearchspacesizesupto9.4 1031.Thedata £ setof Rost et al.(1997) has 16 sequences (length to 224 residues) and 16 cores (of 4 to 19 core segments) yieldingsearchspacesizesupto2.1 1022.IntheSCOP(Murzinetal.,1995)coredomaindatabasethe(cid:142)ve £ largestcores (39 to 42 core segments) were run against all 473 sequences (length to 793 residues)yielding searchspacesizesupto2.1 1059. £ The algorithm was run to a proven global optimum on every trial in the data sets of Lathrop and Smith (1996) and Rost etal. (1997). On the sequences and cores from SCOP (Murzin et al., 1995) the algorithm wasstopped afterithadgenerated 100 candidate alignments beyond the currentanytimebestwithout either improving its score or proving itoptimal. Consequently, every anytime bestin the data sets of Lathrop and Smith(1996)andRostetal.(1997)isatrueglobaloptimum(proven),whilemanyanytimebestsintheSCOP largestcoresdatasetareonly putative (notguaranteed). Othertrialswereruntocomparethealgorithm’sbehavioracrossdifferentobjectivefunctions,aswellasto compare the anytime branch-and-bound performance to the previous batch branch-and-bound (Lathrop and Smith,1996). Thealgorithm wasrunonthedatasetofLathrop andSmith(1996) usingtwootherpreviously published objective functions (White etal.,1994; Sippl, 1993). Theseobjective functions, aswellas that of BryantandLawrence(1993),havebeencharacterizedpreviously onthisdataset(LathropandSmith,1996). 2.4. Implementation-independent metricsof search ef(cid:142)ciency The most important measure of search algorithm ef(cid:142)ciency is the number of elapsed seconds, since this is how long we must wait for an answer. However, it is convenient to have other measures for algorithm comparisonthatareimplementationindependent. 410 LATHROP Let S bethesizeof thesearchspaceand L be the numberof primitive objective function or lowerbound evaluations. Let the subscript “Exh” mean exhaustive search and “Alg” mean the search algorithm being compared.Below,“Alg”willbefurtherspecializedsothat“Any”meansthepoint atwhichthe(cid:142)nalanytime bestwasencountered and“Lim”meansthepoint atwhichsearchwasabandoned. 2.4.1. Branching factor. Suppose a uniform search tree of branching factor b and depth d, then there arebd leafnodes. For exhaustive search,bd S and d m,sothe branching factorfor exhaustive searchis b S1/m mpS.Theequivalent branchingDfactorforDthealgorithm (Pearl,1984) isb L1/m mpL . exhD D algD alg D alg 2.4.2. Searchdepth. Recentexperimentalresults(KorfandReid,1998)suggestthatperhapsthebranching factorremainsunchanged atb ,and thatheuristic searchinsteadreducestheeffectivesearchdepth. Inthis exh casethesearchdepthforexhaustive searchisd logS/logb m,andd logL /logb . exhD exhD algD alg exh 2.4.3. Probability of success. Let P be the probability per draw without replacement of guessing the optimalinoneblind guess.Then P 1/S and P 1/L . exhD algD alg 3. RESULTS Theresultsbelow areadaptedfrom apreliminarypresentation in Lathrop (1999). Figure2Ashowsahistogramoftherankofthe(cid:142)nalanytimebestinthelistofcandidates.Therankis1ifit wasthe(cid:142)rstcandidateproduced,2ifthesecond,andsoforth.Figure2Bshowsahistogram ofthemaximum FIG.2. FinalanytimebestrankandmaximumDrank.(A)Histogramoftherankintheenumeratedcandidatealignments atwhichthe(cid:142)nalanytimebestwasfound.(B)Histogram ofthemaximumchange intherankofthecurrentanytimebest atanytimeduring thesearch. Inbothhistograms, blackisthedatasetofLathropandSmith(1996), horizontal stripes is Rostetal.,(1997), andwhiteisthe(cid:142)velargestcoresfromtheSCOPdomaindatabase (Murzinetal.,1995).ForLathrop andSmith(1996) andRostetal.(1997) the(cid:142)nalanytimebestwasproven tobetheglobal optimum;forSCOP(Murzin etal.,1995)search wasabandoned after100candidates wereenumerated withoutimprovement. ALGORITHMFORPROTEINTHREADING 411 FIG.3. log10totalsecondsto(cid:142)nalanytimebest.(A)The(cid:142)velargestcoresfromtheSCOPdataset(Murzinetal.,1995). (B)ThedatasetofRostetal.(1997).(C)ThedatasetofLathropandSmith(1996).Inallgraphsthex-axisisthelog10of thesearchspacesize.Thedashedlinecorresponds to5min.Therunscorrespond toFig.2.Foreachsearch,an“x”marks thelog10ofthetotalnumberofsecondstothe(cid:142)nalanytimebest.In(A),an“L”marksthelog10ofthenumberofseconds atwhich search wasabandoned ifthe(cid:142)nal anytime best wasnotproven globally optimal (after 100 candidates without change). Absence ofan“L”impliesthatproofofglobal optimality wasobtained. Thepointcorresponding to(32.6,4.2) isaperformance outlier inwhichthe“x”isobscured by“L”s.In(B)and(C)search wasalwayscontinued untilthe(cid:142)nal anytimebestwasproven tobeglobally optimal. Drank observed. The Drank is the rank of the current anytime bestminus the rank of the previous anytime bestthatitreplaced.Thatis,itisthenumberof candidatesenumeratedfrom one change of theanytimebest tothenext. Figure 3 shows the total elapsed clock seconds until the (cid:142)nal anytime best was produced vs. the search spacesize,onlog –log axes(dashedline 5min).InFig.3A an“L”marksthetimeatwhichsearchwas 10 10 D abandoned if the global optimum was not proven. In Fig. 3B and C the search was always continued until the global optimum wasproven. Hereitproduced the true global optimum as its anytimebestwithin 5min oftotalelapsedtime(dashedlines),although ofcoursetoproduce theproof ofoptimalityrequiredadditional time. Figure4showshowmeasuresderivedfromFigs.2and3dependonsearchspacesize.Figure4Ashowsthe rank of the (cid:142)nalanytime best,and Fig.4B shows the maximum Drank observed. Figure 4C shows the total elapsedseconds/m2n˜2,i.e., the total time divided by the formalcomplexity of the lowerbound calculation, whichis (m2n˜2). Figure5showstheimplementation-independentmetricsofsearchef(cid:142)ciencydescribedabove,withexhaus- tivesearchcomparedtobranch-and-bound.Figure5Ashowsthebranchingfactorb,Fig.5Bshowsthesearch depthd,andFig.5C shows theprobability ofsuccessperdraw P. Tables 1 and 2 provide detailed analyses of the (cid:142)ve largest SCOP cores against the (cid:142)ve longest SCOP sequences.Notethat in approximately half(13/25) of the casesshown, the (cid:142)nalanytime bestwasobtained in15minor lesstotalelapsedtime. 412 LATHROP FIG.4. Searchspacesizedependencies. (A)Rankof(cid:142)nalanytimebest;thepointscorrespond toFig.2Awithalldata setspooled. (B)Maximum Drank;thepointscorrespond toFig.2Bwithalldatasetspooled. (C)log10 [totalseconds to (cid:142)nalanytimebest/(n˜2m2)];thepointscorrespond toFig.3withalldatasetspooledanddividedbytheformallowerbound complexity, (m2n˜2).Inallgraphs,thex-axisisthelog10ofthesearch spacesize.ThepointatxD32.6corresponds to the performance outlier (32.6, 4.2) in Fig.3A. Bycoincidence, its maximum Drank was 100, numerically equal to the Drankcut-off forabandoning thesearch. Table 3 compares the three data sets across the same objective function, and three different previously characterizedobjectivefunctionsacrossthesamedataset.Itprovidesmeanrankandmeanmaximumchange in rank of the anytime best, as wellas slopes and intercepts of the best-(cid:142)tting regression line to the log–log plot of elapsedtime vs.searchspacesize.Relative objective function dif(cid:142)culty, asmeasuredby theslope of the regression line, is the same for the anytime besttime,the anytime proof time,and the batch proof time. However,theslope for theanytimebestisconsistently lessthantheslope for theproofs. 4. DISCUSSION Theanytimebranch-and-boundorbest-(cid:142)rstapproachabovehasseveralusefulcharacteristics.Itappearsto returntheglobaloptimumquicklyinmanybutnotallrealisticcases.Itcanreturngoodapproximatealignments quicklywheninanearlyexploratorymode,thenlaterlockthoseinwithmoreeffortwhenapredictiveeffortis inits(cid:142)nalstages.Ithasauser-adjustablespaceboundthatpermitsspacetobetradedfortimeinacontrollable way.Becausethesearchspaceisexplicitlyrepresentedandsampledwithoutreplacement,itneverreturnsthe samecandidate twice. It is plausible that many of the putative results on the SCOP trials marked “L” in Fig. 3A, for which the anytime best wasnot proven to be the global optimum, indeed eventually would be proven to be the global optimumiftheprogramwereallowedtorunsuf(cid:142)cientlylong.Theysharemanygrosscharacteristicswiththe provenglobaloptimainthesmallertrials:theirranks,Dranks,searchtimes,branchingfactors,searchdepths, andprobabilitiesperdrawarecomparable.Indeed,thesearchwascontinuedtotheproofofglobaloptimality onboth thelargesttrialofTable1(No.25,1bgwvs.1tsp)andtheperformanceoutlierofFig.3A(32.6,4.2). In both casesthe anytime bestshown eventually wasproven to beglobally optimal, though the proof in the caseof1bgw vs.1tsprequiredover15days of computertime(datanotshown). ALGORITHMFORPROTEINTHREADING 413 FIG.5. Implementation-independent metricsofsearch ef(cid:142)ciency.(A)log10 equivalent branching factor,bexhandbany. (B)Equivalent search depth, dexh and dany.(C)log10 equivalent probability of success per draw, Pexh and Pany.In all graphs, the x-axis is the log10 of the search space size. The (cid:142)nal anytime best (any) is marked by “x” and exhaustive search (exh) is marked by “o.” The points correspond to Figs. 2 and 3, with all data sets pooled. Points with x > 32 correspond tothe large SCOPcores; their exhaustive search values separate because m islarger than inother data sets. The y-axismetricsarede(cid:142)nedinthetext.In(C),“o”follows y x andistruncated at y 10asuninformative. D¡ D¡ Itwould beprematureto conclude thatthe global optimum isknown in any unproven casewith certainty, or even with high probability. The point (32.6, 4.2) in Fig. 3A and Fig. 4 indicates that occasional extreme performance outliers may occur in realistic data (the heavy-fail phenomenon). Any approximate algorithm returns only the bestresults that could be found with limited searcheffort. Without a proof, itcan neverbe guaranteedthatthereturnedalignments areglobally optimal. Finally, the basic ideas above should apply to related divide-and-conquer algorithms. For example, both thegeneralizationofthispapertogapsanywhere,andthedivide-and-conquermethodofXuandUberbacher (1996; Xuetal.,1998) mightbeadaptabletoananytimecontrol architecture. APPENDIX A. PSEUDO-CODE This Appendix givespseudo-code forthecontrol architecturedescribedabove. A.1. Opportunistic mode pseudoprobability formulas These formulas compute a crude heuristic approximation of the probability that a given hyper-rectangle contains a threading thatwillscorebetterthan theworst-scoringof the currentbest K candidates. Although probabilistic terminology is used for convenience, the resulting quantity is not really a probability in any meaningful senseoftheword;itisjustasearchcontrol heuristic. A.1.1. Update. Whenahyper-rectangleissplit,theupdate()function iscalledoncefor eachresulting child. Lb-oldandQ-ndx-oldarethe lowerbound and queue index of theparent; Lb-newandQ-ndx-new arethelowerbound andqueue indexofthechild.Arraysareinitializedto zero. 414 LATHROP TABLE1. SCOPFIVELARGESTSEQUENCESVS.FIVELARGESTCORES:SEARCHRANKSANDTIMES Seq Core Rank Rank Max Sec Sec Threadings No. Seq Core Len Segs Size Lim Any Drank Lim Any perSecAny 1 1iphA2 1lci 597 36 9.2d 42 104 4 1 74298 348 2.7d 40 C C 2 1vnc 1lci 609 36 4.0d 43 103 3 2 84809 375 1.1d 41 C C 3 1iphA2 1cxsA2 597 42 5.1d 43 108 8 1 27221 393 1.3d 41 C C 4 1cxsA2 1lci 625 36 2.6d 44 107 7 1 39013 486 5.3d 41 C C 5 1vnc 1cxsA2 609 42 5.5d 44 135 35 25 47310 8360 6.6d 40 C C 6 1iphA2 1oen 597 39 9.5d 45 104 4 3 49661 386 2.5d 43 C C 7 1cxsA2 1cxsA2 625 42 1.1d 46 107 7 1 54124 419 2.6d 43 C C 8 1vnc 1oen 609 39 4.5d 46 107 7 1 66278 663 6.8d 43 C C 9 2sblB1 1lci 690 36 2.0d 47 107 7 3 38049 931 2.1d 44 C C 10 1cxsA2 1oen 625 39 3.3d 47 107 7 3 36084 761 4.3d 44 C C 11 1iphA2 4aahA 597 39 4.4d 49 109 9 4 32840 1878 2.3d 46 C C 12 1iphA2 1tsp 597 40 1.5d 50 111 11 5 36473 1850 8.1d 46 C C 13 1vnc 4aahA 609 39 1.7d 50 107 7 4 32091 875 1.9d 47 C C 14 2sblB1 1cxsA2 690 42 3.0d 50 107 7 1 59142 543 5.6d 47 C C 15 2sblB1 1oen 690 39 4.0d 50 116 16 10 27639 2686 1.5d 47 C C 16 1vnc 1tsp 609 40 7.9d 50 113 13 4 42485 1820 4.3d 47 C C 17 1bgw 1lci 793 36 8.5d 50 129 29 1 83361 10535 8.1d 46 C C 18 1cxsA2 4aahA 625 39 9.6d 50 109 9 1 29742 1343 7.2d 47 C C 19 1cxsA2 1tsp 625 40 6.5d 51 103 3 2 61700 445 1.5d 49 C C 20 2sblB1 4aahA 690 39 5.3d 53 107 7 3 55726 873 6.1d 50 C C 21 1bgw 1oen 793 39 3.2d 54 111 11 2 42168 3568 8.9d 50 C C 22 2sblB1 1tsp 690 40 1.3d 55 111 11 3 80361 2475 5.2d 51 C C 23 1bgw 1cxsA2 793 42 6.2d 55 103 3 2 62434 815 7.6d 52 C C 24 1bgw 4aahA 793 39 1.9d 57 107 7 2 36534 2621 7.4d 53 C C 25 1bgw 1tsp 793 40 2.1d 59 105 5 4 171377 1369 1.5d 56 C C Theentriescorrespondtosomesearchspacesizesabove1040inthe(cid:142)gures.SeqandCoreareidenti(cid:142)ersintheSCOP(Murzinetal., 1995)coredomaindatabase.SeqLenisthesequencelength,CoreSegsisthenumberofcoresecondarystructuresegments,andSizeis theresultingsearchspacesize.RankLimisthetotalnumberofcandidatealignmentsthatwereenumeratedbeforesearchwasabandoned (100candidatesbeyondthe(cid:142)nalanytimebestwithoutimprovement).RankAnyistherankofthe(cid:142)nalanytimebestinthelistofcandidate alignments.MaxDrankisthemaximum amountbywhichtherankofthecurrentanytimebestchangedduringthesearch.SecLimis thetotalnumberofsecondsatwhichthesearchwasabandoned. SecAnyisthetotalnumberofsecondsatwhichthe(cid:142)nalanytimebest candidatewasfound.ThreadingsperSecAnyistheratioofSizetoSecAny. PARAMETER MAX-Z-DELTA 2.0; PARAMETER MAX-DECAY 0.95; GLOBAL VARIABLE Total-Counts[m 1]; C GLOBAL VARIABLE Update-Counts[m 1]; C GLOBAL VARIABLE Decay[m 1]; C GLOBAL VARIABLE Mean[m 1]; C GLOBAL VARIABLE Var[m 1]; C DEFINE update(Init, Lb-old, Q-ndx-old, Lb-new, Q-ndx-new) BEGIN IF (Q-ndx-new = Q-ndx-old + 1) THEN BEGIN Total-Counts[Q-ndx-old] Total-Counts[Q-ndx-old] + 1; Ã Delta Lb-new - Lb-old; Ã IF (0 < Var[Q-ndx-old]) THEN Z-Delta (Delta - Mean[Q-ndx-old]) / sqrt(Var[Q-ndx-old]); Ã ELSE Z-Delta 0; Ã WHEN (Z-Delta < MAX-Z-DELTA) OR (Var[Q-ndx-old] < 1) OR (Init = TRUE) DO update-one(Q-ndx-old, Delta); END END
Description: