ebook img

Efficient sequential and parallel algorithms for planted motif search. PDF

1.3 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Efficient sequential and parallel algorithms for planted motif search.

NicolaeandRajasekaranBMCBioinformatics2014,15:34 http://www.biomedcentral.com/1471-2105/15/34 RESEARCH ARTICLE OpenAccess Efficient sequential and parallel algorithms for planted motif search MariusNicolae* andSanguthevarRajasekaran Abstract Background: MotifsearchingisanimportantstepinthedetectionofrareeventsoccurringinasetofDNAorprotein sequences.Oneformulationoftheproblemisknownas(l,d)-motifsearchorPlantedMotifSearch(PMS).InPMSwe aregiventwointegerslanddandnbiologicalsequences.Wewanttofindallsequencesoflengthlthatappearin eachoftheinputsequenceswithatmostdmismatches.ThePMSproblemisNP-complete.PMSalgorithmsare typicallyevaluatedoncertaininstancesconsideredchallenging.Despiteampleresearchinthearea,aconsiderable performancegapexistsbecausemanystateoftheartalgorithmshavelargeruntimesevenformoderately challenginginstances. Results: ThispaperpresentsafastexactparallelPMSalgorithmcalledPMS8.PMS8isthefirstalgorithmtosolvethe challenging(l,d)instances(25,10)and(26,11).PMS8isalsoefficientoninstanceswithlargerlanddsuchas(50,21). WeincludeacomparisonofPMS8withseveralstateoftheartalgorithmsonmultipleprobleminstances.Thispaper alsopresentsnecessaryandsufficientconditionsfor3l-merstohaveacommond-neighbor.Theprogramisfreely availableathttp://engr.uconn.edu/~man09004/PMS8/. Conclusions: WepresentPMS8,anefficientexactalgorithmforPlantedMotifSearch.PMS8introducesnovelideas forgeneratingcommonneighborhoods.Wehavealsoimplementedaparallelversionforthisalgorithm.PMS8can solveinstancesnotsolvedbyanypreviousalgorithms. Background in PCR primer design, genetic probe design, discover- This paper presents an efficient exact parallel algorithm ingpotentialdrugtargets, antisense drugdesign,finding forthePlanted MotifSearch(PMS)problemalso known unbiased consensus of a protein family, creating diag- asthe(l,d)motifproblem[1].Astringoflengthliscaller nosticprobesandmotif finding(seee.g.,[2]).Therefore, anl-mer.Thenumberofpositionswheretwol-mersuand efficientalgorithmsforsolvingthePMSproblemarevery v differ is called theirHamming distance and is denoted importantinbiologyandbioinformatics. byHd(u,v).Forany stringT, T[i..j]is the substringofT A PMS algorithm that finds all the motifs for a given starting at position i and ending at position j. The PMS inputis calledanexact algorithm.Allknownexactalgo- problemisthefollowing.GivennsequencesS ,S ,...,S rithms have an exponential worst case runtime because 1 2 n oflength meach,fromanalphabet(cid:2) and twointegers l thePMSproblemisNP-complete[2].Anexactalgorithm andd,identifyalll-mersM,M∈(cid:2)l,thatoccurinatleast canbebuiltusingtwoapproaches.Oneissampledriven: onelocationineachofthensequenceswithaHamming forall(m−l+1)npossiblecombinationsofl-merscoming distance ofatmostd. Moreformally, M isamotififand from different strings, generate the common neighbor- only if ∀i,1 ≤ i ≤ n,∃j,1 ≤ j ≤ m−l +1, such that hood. The other is pattern-driven: for all (cid:2)l possible i i Hd(M,S[j..j +m−1])≤d. l-mers checkwhichare motifs.Many algorithmsemploy i i i ThePMSproblemisessentiallythesameastheClosest a combination of these two techniques.For example, [3] Substring problem. These problems have applications and[4]generatethecommonneighborsforeverypairof l-merscomingfromtwooftheinputstrings.Everyneigh- bor is then matched against the remaining n − 2 input *Correspondence:[email protected] DepartmentofComputerScienceandEngineering,UniversityofConnecticut, stringstoconfirmorrejectitasamotif.Otheralgorithms Storrs,CT,USA ([5,6])considergroupsofthreel-mersinsteadoftwo. ©2014NicolaeandRajasekaran;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsofthe CreativeCommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse, distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page2of10 http://www.biomedcentral.com/1471-2105/15/34 PMS algorithms are typically tested on instances gen- theintersectionofthed-neighborhoodsofalll-mersinT. erated as follows (also see [1,4]): 20 DNA strings of Tocomputecommonneighborhoods,anaturalapproach length 600 are generated according to the i.i.d. (inde- is to traverse the tree of all possible l-mers and iden- pendent identically distributed) model. A random l-mer tify the common neighbors. A pseudocode is given in is chosen as a motif and planted at a random loca- Appendix1.Anodeatdepthk,whichrepresentsak-mer, tion in each input string. Every planted instance is is not explored deeper if certain pruning conditions are modified in at most d positions. For a given integer l, met.Thus,thebetterthepruningconditionsare,thefaster the instance (l,d) is defined to be challenging if d is willbethealgorithm.Wediscusspruningconditionsina the smallest integer for which the expected number latersection. of motifs of length l that occur in the input by ran- PMS8 consists of a sample driven part followed by a dom chance is ≥ 1. Some of the challenging instances pattern driven part. In the sample driven part we gen- are (13,4),(15,5),(17,6),(19,7),(21,8),(23,9),(25,10), erate tuples of l-mers originating from different strings. (26,11),etc. In the pattern driven part we generate the common The largest challenging instance solved up to now has d-neighborhoodofsuchtuples.Initiallywebuildamatrix been(23,9).Tothebestofourknowledgetheonlyalgo- R of size n × (m − l + 1) where row i contains all the rithmtosolve(23,9)hasbeenqPMS7[5].Thealgorithm l-mersinS.Wepickanl-merxfromrow1ofRandpush i in[7]cansolveinstanceswithrelativelylargel(upto48) it on a stack. We filter out any l-mer in R at a distance providedthatdisatmostl/4.However,mostofthewell greaterthan2d fromx. Thenwe pickan l-mer fromthe knownchallenginginstanceshaved > l/4.PairMotif[3] second row of R and push it on the stack. We filter out cansolveinstanceswithlargerl,suchas(27,9)or(30,9), any l-mer in R that does not have a common neighbor but these are significantly less challenging than (23,9). withthel-mersonthestack;thenwerepeattheprocess. Furthermore, for instances that current algorithms have Ifanyrowbecomesempty,wediscardthetopofthestack, beenabletosolve,theruntimesareoftenquitelarge. revert to the previous instance of R and try a different Inthispaperweproposeanewexactalgorithm,PMS8, l-mer. If the stack size is above a certain threshold (see whichcanefficientlysolvebothinstanceswithlargeland section on Memory and Runtime) we generate the com- instances with large d. The efficiency of PMS8 comes mond-neighborhoodofthel-mersonthestack.Foreach mainlyfromreducingthesearchspacebyusingtheprun- neighborMwecheckwhetherthereisatleastonel-mer ingconditionspresentedlaterinthepaper,butalsofrom uineachrowofRsuchthatHd(M,u) ≤ d.Ifthisistrue a careful implementation which utilizes several speedup then M is a motif. The pseudocode of PMS8 is given in techniquesandemphasizescachelocality. Appendix2. One of the basic steps employed in many PMS algo- Anecessaryandsufficientconditionfor3l-merstohave rithms (suchas PMSprune [4],PMS5 [8], PMS6 [9],and acommonneighborisdiscussedinthesectiononpruning qPMS7 [5]) is that of computing all the common neigh- conditions.For4ormorel-mersweonlyhavenecessary bors of three l-mers. In qPMS7, this problem is solved conditions, so we may generate tuples that will not lead using an Integer LinearProgramming (ILP) formulation. tosolutions.However,duetothewaythefilteringworks, In particular, a large number of ILP instances are solved thefollowingobservationsapply.Letthestackatanyone asapartofapreprocessingstepandatableispopulated. time contain t l-mers: r ,r ,...,r where r is at the top 1 2 t t This table is then repeatedly looked up to identify com- of the stack. Evidently, the l-mers on the stack pass the mon neighbors of three l-mers. This preprocessing step necessary pruning conditions for t l-mers. However, the takesaconsiderableamountoftimeandthelookuptable following are also true. For any two l-mers r and r on i j callsforalargeamountofmemory.Inthispaperweoffera the stack, there exists a common neighbor. This is true novelalgorithmforcomputingallthecommonneighbors becausewheneverwefilteranl-merwemakesureithas ofthreel-mers.Thisalgorithmeliminatesthepreprocess- at least one neighbor in common with the l-mer at the ingstep.Inparticular,wedon’tsolveanyILPinstance.We stop of the stack. Thus, we can prove by induction that alsodon’temployanylookuptablesandhencewereduce anytwol-mersinthestackhaveacommonneighbor.Sec- the memory usage. We feel that this algorithm will find ond, any triplet of the form r ,r ,r, 2 < i ≤ t, has at 1 2 i independentapplications.Specifically,westateandprove least one common neighbor. This is true because when necessaryandsufficientconditionsfor3l-merstohavea the stack had only the first two l-mers we filtered out commonneighbor. any l-mer which doesn’t make a compatible triplet with the two. In general, for any number p < t the (p + 1) Methods tuples of the form r ,r ,...,r ,r where p < i ≤ t 1 2 p i For any l-mer u we define its d-neighborhood as the mustpassthepruningconditionsforp+1l-mers.There- set of l-mers v for which Hd(u,v) ≤ d. For any set of fore, even though the pruning for more than 3 l-mers l-mersT wedefinethecommond-neighborhoodofT as is not perfect, the algorithm implicitly tests the pruning NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page3of10 http://www.biomedcentral.com/1471-2105/15/34 conditions on many subsets of the l-mers in the stack Findmotifsforasubsetofthestrings and thus decreases the number of false positive tuples Weusethespeeduptechniquedescribedin[10]:compute generated. themotifsforn(cid:10) <noftheinputstrings,thentesteachof themagainsttheremainingn−n(cid:10)strings.Fortheresultsin (cid:10) Speeduptechniques thispapern washeuristicallycomputedusingtheformula Sortrowsbysize providedinAppendix4. Animportantspeeduptechniqueistoreordertherowsof Rbysizeaftereveryfilteringstep.Thisreducesthenum- MemoryandRuntime beroftuplesthatweconsideratlowerstacksizes.These SincewestoreallmatricesRinthespaceofasinglematrix tuplesrequirethemostexpensivefilteringbecauseasthe theyonlyrequireO(n(m−l+1))wordsofmemory.Tothis stacksizeincreases,fewerl-mersremaintobefiltered. we add O(n2) words to storerow sizes forthe at most n matriceswhichsharethesamespace.Thebitsofinforma- Compressl-mers tionforcompatiblel-merpairstakeO((n(m−l+1))2/w) WecanspeedupHammingdistance operationsbycom- words,wherewisthenumberofbitsinamachineword. pressingall the l-mers of R in advance. For example, for The table of compressed l-mers takes O(n(m − l + 1)) DNA we store 8 characters in a 16 bit integer, divided words.Therefore,thetotalmemoryusedbythealgorithm into8groupsof2bits.Forevery16bitintegeriwestore isO(n(n+m−l+1)+(n(m−l+1))2/w). in atable the number of non-zero groupsof bitsin i.To Themoretimewespendinthesampledrivenpart,the compute the Hamming distance between two l-mers we lesstimewehavetospendinthepatterndrivenpartand first perform an exclusive or of their compressed repre- vice-versa.Ideallywewanttochoosethethresholdwhere sentations. Equal characters produce bits of 0, different weswitchbetweenthetwopartssuchthattheirruntimes characters produce non-zero bits. Therefore, one table are almost equal. The optimal threshold can be deter- lookup providesthe Hamming distance for 8 characters. mined empirically by running the algorithm on a small Onecompressedl-merrequiresl∗(cid:8)log|(cid:2)|(cid:9)bitsofstorage. subsetof thetuples. In practice, PMS8 heuristicallyesti- However,weonlyneedthefirst16bitsofthisrepresenta- mates the threshold t such that it increases with d and tionbecause thenext 16bitsare thesameas thefirst16 |(cid:2)| to avoid generating very large neighborhoods and it bitsofthel-mer8positionstotherightofthecurrentone. decreaseswithmtoavoidspendingtoomuchtimeonfil- Therefore, the table of compressed l-mers only requires tering. All the results reported in this paper have been O(n(m−l+1))wordsofmemory. obtainedusingthedefaultthresholdestimationprovided inAppendix4. Preprocessdistancesforpairsofl-mers The filtering step tests many times if two l-mers have a Parallelimplementation distanceofnomorethan2d.Thus,foreverypairofl-mers Wecanthinkofm−l+1independentsubproblems,one wecomputethisbitofinformationinadvance. for eachl-mer in the firststring. The first stringin each subproblemisanl-meroftheoriginalfirststringandthe Cachelocality rest of the strings are the same as in the original input. WecanupdateRinanefficientmannerasfollows.Every Becauseofthis,theproblemseemstobe“embarrassingly (cid:10) rowintheupdatedmatrixR isasubsetofthecorrespond- parallel.”Astraightforwardparallelizationideaistoassign ing row in the current matrix R, because some elements anequalnumberofsubproblemstoeachprocessor.This (cid:10) willbefilteredout.Therefore,wecanstoreR inthesame method has the advantage that no inter-processor com- memorylocationsasR.Todothis,ineachrow,wemove munication is necessary beyond broadcasting the input (cid:10) theelementsbelongingtoR atthebeginningoftherow. toallprocessors.Thismethodwouldworkwellforalgo- Inaddition,wekeeptrackofhowmanyelementsbelongto rithms where each subproblem is expected to have a (cid:10) (cid:10) R.TorevertfromR backtoR,werestoretherowsizesto similarruntime.However,forPMS8,theruntimeofeach theirpreviousvalues.Therowelementswillbethesame subproblem is very sensitive to the size of the neighbor- even if they have been moved withinthe row.The same hoods for various combinations of l-mers and therefore process can be repeated at every step of the recursion, someprocessorsmayendupstarvingwhileothersarestill therefore the whole “stack” of R matrices is stored in a busy. singlematrix.Thisreducesthememoryrequirementand Toalleviate the above shortcomingwe employ the fol- improves cache locality. The cache locality is improved lowingstrategy.Theprocessorwithrank0isascheduler because at every step of the recursion, in each row, we andtheothersareworkers.Theschedulerspawnsasep- accessasubsetoftheelements weaccessedintheprevi- arateworkerthreadtoavoidusingoneprocessorjustfor ous step,and thoseelements are in contiguous locations scheduling.Theschedulerreadstheinputandbroadcasts ofmemory. ittoallworkers.Theneachworkerrequestsasubproblem NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page4of10 http://www.biomedcentral.com/1471-2105/15/34 Figure1Proofoftheorem1,case1.Proofoftheorem1,case1:Thereexistsi,1≤i≤3forwhichni≥di.Withoutlossofgeneralityweassume i=1.Thetop3rowsrepresenttheinputl-mers.ThelastrowshowsacommonneighborM.Inanycolumn,identicalcolorsrepresentsmatches, differentcolorsrepresentmismatches. from the scheduler, solves it and repeats. The scheduler foranyM,thenwecanuseitasapruningcondition.Ifthe loops until all jobs have been requested and all workers lowerboundisgreaterthan|T|dthenthereisnocommon havebeennotifiedthatnomorejobsareavailable.Atthe neighbor for T. One such lower bound is the consensus end,allprocessorssendtheirmotifstothescheduler.The totaldistance. schedulerloopsthroughalltheprocessorsandcollectsthe results.Theschedulerthenoutputstheresults. Definition1. LetTbeasetofl-mers,wherek = |T|.For everyi,thesetT [i],T [i],..,T [i]iscalledthei-thcolumn 1 2 k Pruningconditions of T. Let mi be the maximum(cid:2)frequency of any character Inthissectionwepresentpruningconditionsappliedfor in column i. Then Cd(T) = i=1..lk − mi is called the consensustotaldistanceofT. filteringl-mersinthesampledrivenpartandforpruning enumerationtreesinthepatterndrivenpart. The consensus total distance is a lower bound for the Two l-mers a and b have a common neighbor M such total distance between any l-mer M and the l-mers in that Hd(a,M) ≤ d and Hd(b,M) ≤ d if and only if a b T because, regardless of M, the distance contributed by Hd(a,b) ≤ da + db. For 3 l-mers, no trivial necessary column i to the total distance is at least k − m. The i andsufficientconditionshavebeenknownuptonow.In consensus total distance for a set of two l-mers A and B [8] sufficient conditions for 3 l-mers are obtained from willbedenotedbyCd(A,B).AlsonoticethatCd(A,B) = a preprocessed table. However, as l increases the mem- Hd(A,B).Wecaneasilyprovethefollowinglemma. ory requirement of the table becomes a bottleneck. We willgive simple necessaryand sufficient conditionsfor3 Lemma 1. Let T be a set of l-mers and k = |T|. Let l-merstohaveacommonneighbor.Theseconditionsare d ,d ,...d benon-negativeintegers.Thereexistsal-mer 1 2 k alsonecessaryformorethan3l-mers. MsuchthatHd(M,Ti)≤di,∀i,onlyifCd(T)≤(cid:2)ik=1di. (cid:2)Let T be a set of l-mers and M be an l-mer. If u∈THd(M,u) > |T|d then, by the pigeonhole princi- Theorem 1. Let T be a set of 3 l-mers and d1,d2,d3 be ple,onel-mermusthaveadistancefromMgreaterthan non-negative integers. There exists a l-mer M such that d. Therefore, M cannot be a common n(cid:2)eighbor of the Hd(M,Ti) ≤ di,∀i,1 ≤ i ≤ 3ifand onlyifthefollowing l-mersinT.Ifwehavealowerboundon u∈THd(M,u) conditionshold: Figure2Proofoftheorem1,case2.Proofoftheorem1,case2:ni<diforalli,1≤i≤3.Thetop3rowsrepresenttheinputl-mers.Thelastrow showsacommonneighborM.Inanycolumn,identicalcolorsrepresentsmatches,differentcolorsrepresentmismatches. NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page5of10 http://www.biomedcentral.com/1471-2105/15/34 oftypeN .Letn,∀i,0≤i≤4bethenumberofcolumns 4 i oftypeN.Considertwocases: i Case1)Thereexistsi,1 ≤ i ≤ 3forwhichn ≥ d.We i i constructMasillustratedinFigure1.Pickd columnsof i typen.ForeachchosencolumnksetM[k]=T[k]where i j j (cid:11)= i. For all othercolumns set M[k]= T[k].Therefore i Cd(T,M) = d. For j (cid:11)= i we know that Cd(T,T) ≤ i i i j d + d from condition i) (condition i is assumed to be i j trueatthispointbecauseweareprovingthe“if”part).We alsoknowthatCd(T,M)+Cd(M,T) ≤ Cd(T,T)from i j i j the triangle inequality. It follows that Cd(M,T) ≤ d. j j SinceCd(M,T)=Hd(M,T)itmeansthatMisindeeda j j commonneighborofthethreel-mers. Case 2) For all i,1 ≤ i ≤ 3 we have n < d. We i i constructMasshowninFigure2.Forcolumnsk oftype N ,N andN wesetM[k]= T [k].Forcolumns oftype 0 2 3 1 N wesetM[k]= T [k].Foranyi,1 ≤ i ≤ 3thefollow- 1 2 Figure3Speedupoftheparallelimplementationoverthesingle ing applies. If n +n ≤ d then the Hamming distance i 4 i coreversion.Speedupofthemulti-coreversionofPMS8overthe betweenMandT islessthand regardlessofwhatchar- i i singlecoreversion,forseveraldatasets. acters we choose for M in the columns of type N . On 4 the other hand, if n + n > d then M and T have i 4 i i to match in at least n + n − d columns of type N . i) Cd(T,T)≤d +d,∀i,j,1≤i<j≤3 i 4 i 4 i j i j Thus, we pick max(0,n + n − d) columns of type N ii) Cd(T)≤d +d +d i 4 i 4 1 2 3 and for each such column k we set M[k]= T[k]. Now i weprovethatweactually have enough columns tomake Proof. The“onlyif”partfollowsfromlemma1. For the“if” part we showhow to construct acommon the above choices,inotherwords (cid:2)i3=1max(0,ni +n4 − d) ≤ n . This is equivalent to the following conditions neighborMprovidedthattheconditionshold.Wesaythat i 4 beingtrue: a column k where T [k]= T [k]= T [k] is of type N . 1 2 3 0 If T1[k](cid:11)= T2[k]= T3[k] then the column is of type N1. a) Foranyi,1≤i≤3wewantni+n4−di ≤n4.This If T1[k]= T3[k](cid:11)= T2[k]thecolumn is oftype N2 and if istruebecauseni <di. T [k]=T [k](cid:11)=T [k]thenthecolumnisoftypeN .Ifall b) Foranyi,j,1≤i<j≤3wewant 1 2 3 3 threecharactersinthecolumnaredistinct,thecolumnis (ni+n4−di)+(nj+n4−dj)≤n4.Thiscanbe Figure4Theeffectofthe“sortrowsbysize”speeduponruntime.ComparisonbetweenPMS8andPMS8withoutsortingrowsbysize(seethe sectiononSpeeduptechniques).Notethat,inbothversions,alltheotherspeedupsareenabled.Theruntimesareforsinglecoreexecution, averagedover5randominstances. NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page6of10 http://www.biomedcentral.com/1471-2105/15/34 Figure5PMS8runtimesformultiplecombinationsoflandd.PMS8runtimesfordatasetswithlupto50anddupto25averagedover5 randomdatasets.Whitebackgroundsignifiessinglecoreexecution.Bluebackgroundsignifiesexecutionusing48cores.Instancesingrayhave morethan500spuriousmotifs.Orangecellsindicateunsolvedinstances.Timeisreportedinhours(h),minutes(m)andseconds(s). rewrittenasn +n +n ≤d +d.Thelefthandside combination we report the average runtime over 5 ran- i j 4 i j isHd(T,T)whichweknowislessorequaltod +d. dom instances. For several challenging instances, in i j i j c) Wewant(cid:2)i3=1ni+n4−di ≤n4.Thiscanberewritten Figure 3 we present the speedup obtained by the paral- asn +n +n +2n ≤d +d +d .Thelefthand lelversionoverthesinglecoreversion.Forp = 48cores 1 2 3 4 1 2 3 sideisCd(T)whichweknowislessthand +d +d . thespeedupisclosetoS = 45and thustheefficiency is 1 2 3 E=S/p=94%. Oneofourreviewerskindlypointedoutthattheabove Table1ComparisonbetweenqPMS7andPMS8 proofissimilartoanalgorithmin[11]. Instance qPMS7 PMS81 PMS816 PMS832 PMS848 (13,4) 29s 7s 3s 2s 2s Resultsanddiscussion PMS8 is implemented in C++ and uses OpenMPI for (15,5) 2.1m 48s 5s 4s 3s communicationbetweenprocessors.PMS8wasevaluated (17,6) 10.3m 5.2m 22s 12s 9s on the Hornet cluster in the Booth Engineering Center (19,7) 54.6m 26.6m 1.7m 52s 37s forAdvancedTechnology(BECAT)atUniversityofCon- (21,8) 4.87h 1.64h 6.5m 3.3m 2.2m necticut. The Hornet cluster consists of 64 nodes, each (23,9) 27.09h 5.48h 21.1m 10.7m 7.4m equippedwith12IntelXeonX5650Westmerecoresand (25,10) - 15.45h 1.01h 30.4m 20.7m 48GBofRAM.ThenodesuseInfinibandnetworkingfor MPI.Inourexperimentsweemployedatmost48coreson (26,11) - - - - 46.9h atmost4nodes. ComparisonbetweenqPMS7andPMS8onchallenginginstances.PMS8Pmeans We generated random (l,d) instances according to [1] PMS8usedPCPUcores.Bothprogramshavebeenexecutedonthesame hardwareandthesamedatasets.Thetimesareaverageruntimesover5 and as described in the introduction. For every (l,d) instancesforeachdataset. NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page7of10 http://www.biomedcentral.com/1471-2105/15/34 are solved efficiently using a single CPU core. For more challenging instances we report the time taken using 48 cores. A comparison between PMS8 and qPMS7 [5] on chal- lenging instances is shown in Table 1. Both programs have been executed on the Hornet cluster. qPMS7 is a sequentialalgorithm.PMS8wasevaluatedusingupto48 cores. The speedup of PMS8 single core over qPMS7 is showninFigure6.Thespeedupishighforsmallinstances because qPMS7 has to load an ILP table. For larger instances the speedup of PMS8 sharply increases. This is expected because qPMS7 always generates neighbor- hoodsfortuplesof3l-mers,whichbecomeverylargeasl anddgrow.Ontheotherhand,PMS8increasesthenum- berofl-mersinthetuplewiththeinstancesize.Witheach Figure6SpeedupofPMS8singlecoreoverqPMS7.Ratioof runtimesbetweenqPMS7andPMS8runningonasinglecore.Both l-mer added to the tuple, the size of the neighborhood programshavebeenexecutedonthesamehardwareandthesame reduces exponentially, whereas the number of neighbor- datasets.Thetimesareaverageruntimesover5instancesforeach hoodsgeneratedincreasesbyalinearfactor.TheILPtable dataset. precomputationrequiressolvingmanyILPformulations. Thetable thenmakesqPMS7 lessmemoryefficientthan InFigure4weshowtheeffectofthefirstspeeduptech- PMS8.ThepeakmemoryusedbyqPMS7forthechalleng- nique (sorthing rows by size) on the runtime. Note that ing instances in Table 1 was 607 MB whereas for PMS8 all otherspeedupsareenabled, onlysortingrowsbysize it was 122 MB. Furthermore, due to the size of the ILP is varied. The figure shows that sorting the rows by size table, qPMS7 is not able to solve any instances where improvesthespeedofPMS8by25%to50%. l>23.PMS8isthefirstalgorithmtosolvethechallenging Theruntime ofPMS8oninstanceswithlupto50and instances(25,10)and(26,11). d up to 21 is shown in Figure 5. Instances which are Some recent results in the literature have also focused expected to have more than 500 motifs simply by ran- on instances other than the challenging ones presented domchance(spuriousmotifs)areexcluded.Theexpected above. A summary of these results and a comparison number of spurious motifs was computed as described with PMS8 is presented in Table 2, starting with the in Appendix 3. Instances where d is small relative to l mostrecentresults.Theseresultshavebeenobtainedon Table2 ComparisonbetweenPMS8andrecentresultsintheliterature Previousalgorithm Instance Time Cores PMS8time PMS8cores Abbasetal.2012[12],PHEP_PMSprune (21,8) 20.42h 8 6.5m 1 Yuetal.2012[3],PairMotif (27,9) 10h 1 4s 1 (24,6) 347s 1 1s 1 DesarajuandMukkamala2011[7] (48,12) 188s 1 1s 1 (21,8) 3.7h 16 6.5m 16 Dasarietal.2011[13],mSPELLER/gSPELLER (21,8) 2.2h 6.5m 16 4GPUsx240cores Dasarietal.2010[14],BitBased (21,8) 1.1h 6.5m 16 DasariandDesh2010[15],BitBased (21,8) 6.9h 16 6.5m 16 Sahooetal.2011[16] (16,4) 106s 4 1s 1 Sunetal.2011[17],TreeMotif (40,14) 6h 1 6s 1 Heetal.2010[18],ListMotif (40,14) 28,087s 1 6s 1 Faheem2010[19],skip-BruteForce (15,4) 2934s 96nodes 1s 1 (24,8) 4h 1 5s 1 Hoetal.2009[6],iTriplet (38,12) 1h 1 1s 1 (40,12) 5m 1 1s 1 SidebysidecomparisonbetweenPMS8andrecentresultsintheliterature.Timeisreportedinseconds(s),minutes(m)orhours(h).Notethatthehardwareis different,thoughwetriedtomatchthenumberofprocessorswhenpossible.Also,theinstancesarerandomlygeneratedusingthesamealgorithm,howeverthe actualinstancesusedbythevariouspapersaremostlikelydifferent.ForPMS8,thetimesareaveragesover5randomlygeneratedinstances. NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page8of10 http://www.biomedcentral.com/1471-2105/15/34 Table3RuntimecomparisonbetweenPMS8andqPMS7onrealdatasetsfrom[20] Dataset n Totalno.bases l d PMS8time qPMS7time dm01r 4 6000 21 4 1 55 dm01r 4 6000 23 5 1 6 dm04r 4 8000 21 4 1 5 dm04r 4 8000 23 5 1 5 hm01r 18 36000 21 6 10 14 hm01r 18 36000 23 7 25 40 hm02r 9 9000 21 6 1 11 hm02r 9 9000 23 7 4 35 hm03r 10 15000 21 6 3 24 hm03r 10 15000 23 7 14 146 hm04r 13 26000 21 6 6 44 hm04r 13 26000 23 7 15 39 hm05r 3 3000 21 4 1 6 hm05r 3 3000 23 5 1 46 hm08r 15 7500 17 5 1 7 hm08r 15 7500 17 6 46 251 hm19r 5 2500 23 5 1 5 hm19r 5 2500 23 6 1 5 hm20r 35 70000 21 6 27 32 hm20r 35 70000 23 7 56 136 hm26r 9 9000 23 6 1 5 hm26r 9 9000 23 7 5 46 mus02r 9 9000 21 6 1 11 mus02r 9 9000 23 7 2 45 mus04r 7 7000 21 6 1 15 mus04r 7 7000 23 7 2 22 mus05r 4 2000 21 5 1 79 mus05r 4 2000 23 6 1 5 mus07r 4 6000 21 5 1 79 mus07r 4 6000 23 5 1 6 mus10r 13 13000 21 6 2 56 mus10r 13 13000 23 7 2 70 mus11r 12 6000 21 7 8 150 mus11r 12 6000 23 8 23 938 yst01r 9 9000 21 6 2 14 yst01r 9 9000 23 7 8 63 yst02r 4 2000 21 5 1 5 yst02r 4 2000 23 6 1 6 yst03r 8 4000 21 6 1 8 yst03r 8 4000 23 7 1 19 yst04r 6 6000 21 4 1 5 yst04r 6 6000 23 5 1 5 yst05r 3 1500 21 4 1 5 yst05r 3 1500 23 5 1 5 NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page9of10 http://www.biomedcentral.com/1471-2105/15/34 Table3 RuntimecomparisonbetweenPMS8andqPMS7onrealdatasetsfrom[20](Continued) yst06r 7 3500 21 6 1 6 yst06r 7 3500 23 7 2 12 yst08r 11 11000 21 5 1 6 yst08r 11 11000 23 6 1 6 yst09r 16 16000 21 6 2 17 yst09r 16 16000 23 7 6 68 Foreachdatasetwetestedtwocombinationsoflandd.ForqPMS7wesetq=n.BothalgorithmswereexecutedonasingleCPUcore.Timeisreportedinseconds, roundeduptothenextsecond. various types of hardware: single core, multi-core, GPU, GenerateNeighborhood(T,r,p) grid. In the comparison, we try to match the number if(p≤l)then of processors whenever possible. The speed difference if(notprune(T,r))then is of several orders of magnitude in some cases which forα ∈(cid:2)do indicatesthatthepruningconditionsemployedbyPMS8 x :=α p exponentiallyreducethesearchspacecomparedtoother for(i=1..|T|)do algorithms. T(cid:10) :=T[2..|S|] i i i We also compared PMS8 with qPMS7 on the real r(cid:10) :=r; i i datasetsdiscussedin[20].Weexcludeddatasetswithless if(T[0](cid:11)=α)thenr(cid:10) :=r(cid:10)−1; i i i than 4 input sequences because these are not very chal- endfor lenging. For each dataset we chose two combinations of GenerateNeighborhood(T(cid:10),r(cid:10),p+1) l and d. These combinations were chosen on a dataset endfor basisbecauseforlargevaluesofdthenumberofreported endif motifsisexcessiveandforsmallvaluesofdtheinstanceis else notverychallenging. Tomake qPMS7 behave likePMS8 reportl-merx wesetthequorumpercentto100%(q=n).InTable3we endif report the dataset name, the total number of sequences, thetotalnumberofbasesineachdataset,thelanddcom- Appendix2PMS8pseudocode binationandtheruntimesofthetwoalgorithms.Notethat Algorithm2.PMS8(T,d) both algorithms are exact algorithms and therefore the for(i=1..n)doR ={u|u∈S} i i sensitivityandspecificityarethesame.Similartothecom- stack ={} parison on synthetic data, the comparison on real data GenerateMotifs(1,stack,R) revealsthatPMS8outperformsqPMS7. GenerateMotifs(p,stack,R) Conclusions for(u∈R )do p We have presented PMS8, an efficient algorithm for the stack.push(u) PMSproblem.PMS8isabletoefficientlygenerateneigh- R(cid:10) :=filter(R,stack) borhoods for t l-mers at a time, by using the pruning if(R(cid:10).size>0)then conditions presented in this paper. Previous algorithms if(ThresholdCondition)then generate neighborhoods for only up to three l-mers at N :=GenerateNeighborhood(stack,d) a time whereas in PMS8 the value of t is increased as for(m∈N)do (cid:10) theinstancesbecomemorechallengingandthereforethe if(isMotif(m,R))thenoutputm; exponential explosion is postponed. The second reason else fortheefficiencyofPMS8comesfromthecarefulimple- GenerateMotifs(p+1,R(cid:10)) mentationwhichemploysseveralspeeduptechniquesand stack.pop() emphasizescachelocality. endfor Appendix Appendix1Generatingneighborhoods Appendix3Challenginginstances For a fixed l, as d increases, the instance becomes more Algorithm1.GenerateNeighborhood(T,d) challenging.However,asdincreases,thenumberoffalse for(i=1..|T|)dori :=d; positivesalsoincreases,becausemanymotifswillappear GenerateNeighborhood(T,r,1) simply by random chance. The expected number of NicolaeandRajasekaranBMCBioinformatics2014,15:34 Page10of10 http://www.biomedcentral.com/1471-2105/15/34 spuriousmotifsinarandominstancecanbeestimatedas DiscreteAlgorithms,17-19January1999,Baltimore,Maryland,ACM/SIAM follows(seee.g.,[4]).Thenumberofl-mersintheneigh- 1999:633–642. borhoodofagivenl-merMisN((cid:2),l,d)=(cid:2)id=0(ld)(|(cid:2)|− 3. aYulgQor,iHthumoHfo,ZrhpalnagntYe,dG(ulo,dH):DPaNiArMmootitfi:fasenaerwchp.aPtLtoeSrnO-NdEri2v0e1n2, 1)d. The probabilitythat M is a d-neighbor of a random 7(10):e48442. l-merisp((cid:2),l,d) = N((cid:2),l,d)/|(cid:2)|l.Theprobabilitythat 4. DavilaJ,BallaS,RajasekaranS:Fastandpracticalalgorithmsfor planted(l,d)motifsearch.IEEE/ACMTransComputBioland Mhasatleastoned-neighboramongthel-mersofastring Bioinformatics2007,4(4):544–552. oflengthmisthusq(m,(cid:2),l,d)=1−(1−p((cid:2),l,d))m−l+1. 5. DinhH,RajasekaranS,DavilaJ:qPMS7:AFastAlgorithmforFinding TheprobabilitythatMhasatleastoned-neighborineach (l,d)-MotifsinDNAandProteinSequences.PLoSONE2012, 7(7):e41425. of n random strings of length m is q(m,(cid:2),l,d)n. Finally, 6. HoE,JakubowskiC,GundersonS,etal.:iTriplet,arule-basednucleic the expected number of spurious motifs in an instance acidsequencemotiffinder.AlgoMolBiol2009,4:14. with n strings of length m each is: |(cid:2)|lq(m,(cid:2),l,d)n. 7. DesarajuS,MukkamalaR:Multiprocessorimplementationof modelingmethodforPlantedMotifProblem.In2011WorldCongress In this paper we consider all combinations of l and d onInformationandCommunicationTechnologies(WICT),Dec11-14,2011, wherel isatmost50and thenumberofspuriousmotifs Mumbai,India,IEEE2011:524–529. (expectedbyrandomchance)doesnotexceed500.Note 8. DinhH,RajasekaranS,KundetiV:PMS5:anefficientexactalgorithmfor that for a fixed d, if we can solve instance (l,d) we the(l,d)-motiffindingproblem.BMCBioinformatics2011,12:410. 9. BandyopadhyayS,SahniS,RajasekaranS:PMS6:Afastalgorithmfor can also solve all instances (l(cid:10),d) where l(cid:10) > l, because motifdiscovery.InIEEE2ndInternationalConferenceonComputational theyarelesschallengingthan(l,d). AdvancesinBioandMedicalSciences,ICCABS2012,LasVegas,NV,USA, February23-25,2012,IEEE2012:1–6. 10. RajasekaranS,DinhH:Aspeeduptechniquefor(l,d)-motiffinding algorithms.BMCResNotes2011,4:54.[http://www.biomedcentral.com/ Appendix4Heuristicsfortandn 1756-0500/4/54] In the methods section we mentioned that we heuristi- 11. GrammJ,NiedermeierR,RossmanithP:Exactsolutionsforclosest callyestimatethethresholdtatwhichweswitchfromthe stringandrelatedproblems.InAlgorithmsandComputation,Volume 2223ofLectureNotesinComputerScience.EditedbyEadesP,TakaokaT. pattern driven to the sample driven part. The exact for- BerlinHeidelberg:Springer;2001:441–453. mula(cid:3)used by the algorithm to compute t was t = max 12. AbbasM,AbouelhodaM,BahigH:Ahybridmethodfortheexact (2,(cid:12) 2(d+1)log(cid:2)−logm(cid:13)).Thisfollowstheintuition planted(l,d)motiffindingproblemanditsparallelization.BMC that t should increase with (cid:2) to avoid large neighbor- Bioinformatics2012,13(Suppl17):S10.[http://www.biomedcentral.com/ 1471-2105/13/S17/S10] hoods and decrease withm to avoid spending too much 13. DasariN,RanjanD,ZubairM:Highperformanceimplementationof timeonfiltering. plantedmotifproblemusingsuffixtrees.In2011International ConferenceonHighPerformanceComputing&Simulation,HPCS2012, In the Speedup techniques section we mentioned a Istanbul,Turkey,July4-8,2011,IEEE2011:200–206. speedupwherewecomputemotifsforasubsetofn(cid:10) < n 14. DasariN,DeshR,ZubairM:SolvingplantedmotifproblemonGPU.In strings. By default, the algorithm heuristically computes InternationalWorkshoponGPUsandScientificApplications,GPUScA2010, n(cid:10) asn(cid:10) = min(n,t+n/4−logt).Thesesimpleheuristics Vienna,Austria,September11,2010,DepartmentofScientificComputing, UniversityofVienna,TR-10-32010. workedwellenoughonallourtestcases,howevertheuser 15. DasariNS,DeshR,ZubairM:Anefficientmulticoreimplementationof caneasilyoverridethem. plantedmotifproblem.InProceedingsofthe2010International ConferenceonHighPerformanceComputing&Simulation,HPCS2010,June 28-July2,2010,Caen,France.IEEE2010:9–15. 16. SahooB,SouravR,RanjanR,PadhyS:Parallelimplementationofexact Competinginterests algorithmforplantedmotifsearchproblemusingSMPcluster. Theauthorsdeclarethattheyhavenocompetinginterests. EuropeanJScientificRes2011,64(4):484–496. 17. SunH,LowM,HsuW,TanC,RajapakseJ:Tree-structuredalgorithmfor Authors’contributions longweakmotifdiscovery.Bioinformatics2011,27(19):2641–2647. MNandSRdesignedandanalyzedthealgorithms.MNimplementedthe 18. SunHQ,LowM,HsuWJ,RajapakseJ:ListMotif:Atimeandmemory algorithmsandcarriedouttheempiricalexperiments.MNandSRanalyzedthe efficientalgorithmforweakmotifdiscovery.In2010International empiricalresultsanddraftedthemanuscript.Allauthorsreadandapproved ConferenceonIntelligentSystemsandKnowledgeEngineering(ISKE),15-16 thefinalmanuscript. November2010,Hangzhou,China.IEEE2010:254–260. 19. FaheemHM:Acceleratingmotiffindingproblemusinggrid Acknowledgements computingwithenhancedbruteforce.InProceedingsofthe12th TheauthorswouldliketothankProf.Chun-Hsi(Vincent)Huang,Dr.HieuDinh, internationalconferenceonAdvancedcommunicationtechnology, TianMiandGabrielSebastianIlieforhelpfuldiscussions.Thisworkhasbeen ICACT’10,Piscataway,NJ,USA:IEEEPress2010:197–202. supportedinpartbythefollowinggrants:NSF0829916andNIH 20. TompaM,LiN,BaileyTL,ChurchGM,DeMoorB,EskinE,FavorovAV, R01-LM010101. FrithMC,FuY,KentWJ,etal.:Assessingcomputationaltoolsforthe discoveryoftranscriptionfactorbindingsites.NatureBiotechnol2005, Received:15March2013 Accepted:27January2014 23:137–144. Published:31January2014 doi:10.1186/1471-2105-15-34 References Citethisarticleas:NicolaeandRajasekaran:Efficientsequentialandparallel 1. PevznerP,SzeS,etal.:Combinatorialapproachestofindingsubtle algorithmsforplantedmotifsearch.BMCBioinformatics201415:34. signalsinDNAsequences.InProceedingsoftheEighthInternational ConferenceonIntelligentSystemsforMolecularBiology,August19-23,2000, LaJolla/SanDiego,CA,USA,Volume8,AAAI2000:269–278. 2. LanctotJ,LiM,MaB,WangS,ZhangL:Distinguishingstringselection problems.InProceedingsoftheTenthAnnualACM-SIAMSymposiumon

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.