Algorithms for Sequence Alignment David Richard Powell B.Sc (Hons) School of Computer Science and Software Engineering Monash University Australia. Submitted for the degree of Doctor of Philosophy August 2001 Abstract Sequence alignment is an important tool for describing relationships between sequences. Manysequencealignmentalgorithmsexist,differinginef(cid:2)ciency,andintheirmodelsofthe sequences and of the relationship between sequences. The focus of this thesis is on algo- rithms for the optimal alignment of two or three sequences of biological data, particularly DNA sequences. The algorithms are discussed with particular emphasis on space and time complexity. Adivide-and-conquermethodispresentedforusewithanumberofdifferentalignmentalgo- rithms. Thismethodmaybe usedtoreducethespace complexityof an alignmentalgorithm with little or no effect to the time complexity. The advantages of this divide-and-conquer method include its simplicity and the ease with which it can be applied to many different alignment algorithms. These advantages are demonstrated by using the divide-and-conquer methodinconjunctionwithseveralknownalignmentalgorithms. Anef(cid:2)cientalignmentalgorithmispresentedfortheimportantproblemofoptimallyaligning three sequences using a linear function for costing gaps in the alignment. For sequences of length n, and a minimum edit cost of d, this new algorithm has a time complexity of O(d3 + n). The algorithm is further developed by using the aforementioned divide-and- conquer method to improve its space complexity. This combination results in a time and space ef(cid:2)cient algorithm, while also illustrating the usefulness of the divide-and-conquer method. It is important when aligning sequences to correctly account for any non-randomness that is signi(cid:2)cantin the sequences. For example,if certain statisticalpatterns appear throughout sequencesfromacertainfamily,itisimportanttomakeuseofthisinformationwhenaligning sequences from this family. Common, unsurprising, patterns provide less evidence for the relatednessofsequencesthanmoresurprisingregionsprovide. Anewalgorithmispresented to align optimally two non-random sequences. For a particular sequence model, this new algorithm apportions weight to every part of the alignment dependent on the importance of that part as determined by the sequence model. This algorithm is then developed further so thatitcanbeusedtoinferwhethertwonon-randomsequencesare related. Declaration This thesis contains no material that has been accepted for the award of any other degree in any university or other institution. To the best of my knowledge, this thesis contains no material previously published or written by another person, except where due reference is madeinthetextofthethesis. DavidR.Powell August2001 Acknowledgements I would like to thank my supervisors, Trevor Dix and Lloyd Allison, both of whom gave importantfeedbackafterassiduouslyproofreadingthisthesis. Ihavealwaysbeenamazedat their willingness to give their time to assist me. They have given useful advice throughout my candidature on topics both related and unrelated to research. The many interesting dis- cussions we had were both productive and enjoyable. They have shown me more patience thanIdeserved. I would also like to thank the members of the School of Computer Science and Software Engineering for providing an enjoyable atmosphere in which work. My time at Monash wouldnothavebeenthesamewithoutthisgroupoffriendlyandtalentedpeople. Ithas been apleasure toshare an of(cid:2)ce withTorstenSeemann. Ideeply indebtedtohimfor thehelphehasgivenmebyalwaysbeingwillingtolisten. Mythanksalsogotohimforhis effortinproofreadingthismanuscript. Finally, I would like thank my family. Without their continual love and support this thesis wouldneverhavebeenwritten. Contents 1 Introduction 1 1.1 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 FindinganOptimalAlignment . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 TheBasicDPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 TheDPAforLinearGapCosts . . . . . . . . . . . . . . . . . . . . 7 1.2.3 AlignmentinO(n)Space . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Ukkonen’sAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Ukkonen’sAlgorithmwithLinearGapCosts . . . . . . . . . . . . 13 1.4 Three-way Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1 DPAforThree Sequences . . . . . . . . . . . . . . . . . . . . . . 16 1.4.2 OtherThree-wayAlignmentAlgorithms . . . . . . . . . . . . . . . 17 1.5 AligningNon-randomSequences . . . . . . . . . . . . . . . . . . . . . . . 18 1.6 TaxonomyofAlignmentAlgorithms . . . . . . . . . . . . . . . . . . . . . 19 2 Sequence AlignmentinLinearSpace 22 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 TheBasic DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Check-Pointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 DPAwithLinearGapCosts . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Ukkonen’sAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Ukkonen’sAlgorithminLinearSpace . . . . . . . . . . . . . . . . 32 i 2.4.2 ComplexityofUkk2sandUkk2s cp . . . . . . . . . . . . . . . . 34 2.5 Ukkonen’sAlgorithmwithLinearGapCosts . . . . . . . . . . . . . . . . 39 2.6 AligningThreeorMoreSequences . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 Aligning Three Sequenceswith LinearGap Costs 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 AlignmentofThree Sequences . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 LinearGapCosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 AlignmentofThree SequenceswithLinearGapCosts. . . . . . . . . . . . 49 3.4.1 UsingaDPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 UsingUkkonen’sAlgorithm . . . . . . . . . . . . . . . . . . . . . 53 3.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 MemoryManagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Check-Pointing on Ukkonen’s AlgorithmforThree sequences 62 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 TheUkk3l cpAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1 TheDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Complications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 TheFree Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.2 TheBack-loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.3 Correspondence BetweentheU andD matrices . . . . . . . . . . . 69 4.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 TimeandSpace Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 ii 5 Alignment ofLow InformationSequences 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Standard SequenceAlignment . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 CostinganAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 SpecifyingtheModelforDataGeneration . . . . . . . . . . . . . . 81 5.3.2 ModelMofR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 EncodingR,S1,S2andtheirAlignment. . . . . . . . . . . . . . . . . . . 84 5.5 Search for OptimalR andAlignment(0thOrderMM). . . . . . . . . . . . 86 5.5.1 HandlingInserts . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 FirstOrderMarkovModel . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.7 NullEncoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.8 MeasuringtheRelatednessofTwoSequences . . . . . . . . . . . . . . . . 96 5.8.1 NullEncoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.9 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.9.1 ResultsofTestingRelatedness . . . . . . . . . . . . . . . . . . . . 103 5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6 Conclusion 107 A Sample Alignmentofthe Transthyretin Gene 118 B Simple CostsAlignmentofthe Transthyretin Genes 122 C Pair-wiseAlignmentsofthe Transthyretin Genes 125 iii Listings 1.1 ThebasicDPAorDPA2salgorithm. . . . . . . . . . . . . . . . . . . . . . 6 1.2 TheDPA2lalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Ukkonen’salgorithmforLevenshteincosts(Ukk2s). . . . . . . . . . . . . 11 1.4 Ukkonen’salgorithmforlineargapcosts. . . . . . . . . . . . . . . . . . . 14 1.5 TheDPA3salgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 The DPA2s cp algorithm: simple costs DPA with check-pointing to deter- minethealignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 The Ukk2s cp algorithm: the check-point method on Ukkonen’s algorithm withLevenshteinmutationcosts. . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Calculation of a cell for a DPA to (cid:2)nd an optimal 3-way alignment with lineargapcosts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 CalculatingallcellsforaDPAfor3-wayalignmentwithlineargapcosts. . 53 3.3 TheUkk()functionimplementinga memo-array . . . . . . . . . . . . . . . 54 3.4 CalculationofacellforUkkonen’salgorithmwiththreesequencesandlinear gapcosts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 TheextendDiagonal()functionusedintheUKKcalcCell function. . . . . . 56 4.1 TheUkk()functionfortheUkk3l cpalgorithm. . . . . . . . . . . . . . . . 65 4.2 TheUkkInLimits()functionfortheUkk3l cpalgorithm. . . . . . . . . . . 66 5.1 Thecalculationtodeterminethematrixcellat(i,j). . . . . . . . . . . . . . 88 iv List of Figures 1.1 AnexampleoftheDPAmatrixafteraligningthesequencesACCGGTCGGC andTGGTCGCCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 AnillustrationoftheHirschberg’salgorithm. . . . . . . . . . . . . . . . . 10 1.3 An example of the U matrix after aligning the sequences ACCGGTCGGC andTGGTCGCCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 All-pairscostsandstarcosts. . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 All-pairscosts,starcostsandtree costsforfoursequences. . . . . . . . . . 15 2.1 Anillustrationofthenewcheck-pointmethodontheDPA. . . . . . . . . . 24 2.2 Anexampleofcheck-pointingontheDPAmatrixfor sequencesCGCA and AAGT,usingLevenshteincosts. . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 ComparisonofthenumberofiterationsoftheloopinthebasicDPAagainst thatoftheDPAwithcheck-pointing. . . . . . . . . . . . . . . . . . . . . . 28 2.4 An exampleof the DPA matrixwhen two check-point rowsare kept instead ofone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 The DPA matrix for aligning the sequences ATAGA and AGAGCGTAGC. The shaded cells correspond to the cells of the U matrix that are computed byUkkonen’salgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 AnexampleoftheUkk2s cpalgorithm . . . . . . . . . . . . . . . . . . . 34 2.7 Comparison of iterations of the outer loop for the Ukk2s and Ukk2s cp al- gorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8 Comparison of iterations of the inner loop for the Ukk2s and Ukk2s cp al- gorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.9 RunningtimeoftheUkk2salgorithmandtheUkk2s cp algorithm. . . . . . 38 v 3.1 Numbering of diagonals of the D matrix for Ukkonen’s algorithm for two sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 ThemaindiagonalofatheD matrixforathreesequencealgorithm. . . . . 45 3.3 Numberingofdiagonalsforthree-wayalignmentwithUkkonen’salgorithm 46 3.4 Amutationmachineandagenerationmachine. . . . . . . . . . . . . . . . 47 3.5 A3-stateFiniteStateMachinetoproduceonesequencefromanother. . . . 48 3.6 ExampleofadifferentalignmentfoundbytheUkk3lalgorithmandGotoh’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Log-logplotofthenumberofcallstoUKKcalcCellagainsteditcost. . . . 57 3.8 Plot of the inner loop iterations against edit cost for sequences of approxi- mately2000characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.9 Log-logplotofrunningtimeversuseditcost. . . . . . . . . . . . . . . . . 58 3.10 VisualisationofthememoryneededbytheUkk3lalgorithm(anoctahedron) insidea boundingcube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.11 PlotofmemoryallocatedversuseditcostfortheUkk3lprogram. Note‘c’is a constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.12 PlotofmemoryallocatedversuseditcostfortheUkk3lprogramwhichdoes notrecoveranalignment. Note‘c’ isaconstant. . . . . . . . . . . . . . . . 61 4.1 Log-logplotofthenumberofcallstoUKKcalcCellagainsteditdistance. 72 4.2 Plot of the inner loop iterations against edit distance for sequences of ap- proximately2000characters. . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Log-logplotofmemoryusageversuseditdistance. . . . . . . . . . . . . . 73 5.1 Northernmostandsouthernmostoptimalalignments . . . . . . . . . . . . . 79 5.2 The (cid:2)nite state machine model for generating a sequence S1, as a ‘noisy’ copyofaparentsequenceR. . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3 The combined (cid:2)nite state machine model for sequences S1 and S2 as inde- pendentnoisyobservationsofa parentsequenceR disallowinginsertions. . 87 5.4 The combined (cid:2)nite state machine model for sequences S1 and S2 as inde- pendentnoisyobservationsofa parentsequenceR withinsertions. . . . . . 89 vi
Description: