(cid:2) MULTIPLE BIOLOGICAL SEQUENCE ALIGNMENT (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) WileySerieson Bioinformatics:ComputationalTechniquesandEngineering Acompletelistofthetitlesinthisseriesappearsattheendofthisvolume. (cid:2) (cid:2) MULTIPLE BIOLOGICAL SEQUENCE ALIGNMENT Scoring Functions, Algorithms and Applications KENNGUYEN XUANGUO YIPAN (cid:2) (cid:2) (cid:2) (cid:2) Copyright©2016byJohnWiley&Sons,Inc.Allrightsreserved PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey PublishedsimultaneouslyinCanada Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformor byanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfeeto theCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400,fax (978)750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshould beaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ 07030,(201)748-6011,fax(201)748-6008,oronlineathttp://www.wiley.com/go/permissions. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitable foryoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernor authorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedto special,incidental,consequential,orotherdamages. Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactour CustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStatesat (cid:2) (cid:2) (317)572-3993orfax(317)572-4002. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmay notbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourwebsiteat www.wiley.com. LibraryofCongressCataloging-in-PublicationData: Names:Nguyen,Ken,1975-author.|Guo,Xuan,1987-author.|Pan,Yi,1960- author. Title:Multiplebiologicalsequencealignment:scoringfunctions,algorithms andapplications/KenNguyen,XuanGuo,YiPan. Description:Hoboken,NewJersey:JohnWiley&Sons,2016.|Includes bibliographicalreferencesandindex. Identifiers:LCCN2016004186|ISBN9781118229040(cloth)|ISBN9781119273752 (epub) Subjects:LCSH:Sequencealignment(Bioinformatics) Classification:LCCQH441.N482016|DDC572.8–dc23LCrecordavailableat http://lccn.loc.gov/2016004186 CoverimagecourtesyofGettyImages/OktalStudio Typesetin10/12ptTimesLTStdbySPiGlobal,Chennai,India PrintedintheUnitedStatesofAmerica 10987654321 (cid:2) (cid:2) CONTENTS Preface xi (cid:2) 1 Introduction 1 (cid:2) 1.1 Motivation, 2 1.2 TheOrganizationofthisBook, 2 1.3 SequenceFundamentals, 3 1.3.1 Protein, 5 1.3.2 DNA/RNA, 6 1.3.3 SequenceFormats, 6 1.3.4 Motifs, 7 1.3.5 SequenceDatabases, 9 2 Protein/DNA/RNAPairwiseSequenceAlignment 11 2.1 SequenceAlignmentFundamentals, 12 2.2 Dot-PlotMatrix, 12 2.3 DynamicProgramming, 14 2.3.1 Needleman–Wunsch’sAlgorithm, 15 2.3.2 Example, 16 2.3.3 Smith–Waterman’sAlgorithm, 17 2.3.4 AffineGapPenalty, 19 2.4 WordMethod, 19 2.4.1 Example, 20 2.5 SearchingSequenceDatabases, 21 (cid:2) (cid:2) vi CONTENTS 2.5.1 FASTA, 21 2.5.2 BLAST, 21 3 QuantifyingSequenceAlignments 25 3.1 EvolutionandMeasuringEvolution, 25 3.1.1 JukesandCantor’sModel, 26 3.1.2 MeasuringRelatedness, 28 3.2 SubstitutionMatricesandScoringMatrices, 28 3.2.1 IdentityScores, 28 3.2.2 Substitution/MutationScores, 29 3.3 GAPS, 32 3.3.1 SequenceDistances, 35 3.3.2 Example, 35 3.4 ScoringMultipleSequenceAlignments, 36 3.4.1 Sum-of-PairScore, 36 3.5 CircularSumScore, 38 3.6 ConservationScoreSchemes, 39 3.6.1 WuandKabat’sMethod, 39 3.6.2 Jores’sMethod, 39 3.6.3 LocklessandRanganathan’sMethod, 40 3.7 DiversityScoringSchemes, 40 (cid:2) 3.7.1 Background, 41 (cid:2) 3.7.2 Methods, 41 3.8 StereochemicalPropertyMethods, 42 3.8.1 Valdar’sMethod, 43 3.9 HierarchicalExpectedMatchingProbabilityScoringMetric(HEP), 44 3.9.1 BuildinganAACCHScoringTree, 44 3.9.2 TheScoringMetric, 46 3.9.3 ProofofScoringMetricCorrectness, 47 3.9.4 Examples, 48 3.9.5 ScoringMetricandSequenceWeightingFactor, 49 3.9.6 EvaluationDataSets, 50 3.9.7 EvaluationResults, 52 4 SequenceClustering 59 4.1 UnweightedPairGroupMethodwithArithmeticMean – UPGMA, 60 4.2 Neighborhood-JoiningMethod – NJ, 61 4.3 OverlappingSequenceClustering, 65 5 MultipleSequencesAlignmentAlgorithms 69 5.1 DynamicProgramming, 70 5.1.1 DCA, 70 5.2 ProgressiveAlignment, 71 (cid:2) (cid:2) CONTENTS vii 5.2.1 ClustalFamily, 73 5.2.2 PIMA:Pattern-InducedMultisequenceAlignment, 73 5.2.3 PRIME:Profile-BasedRandomizedIterationMethod, 74 5.2.4 DIAlign, 75 5.3 ConsistencyandProbabilisticMSA, 76 5.3.1 POA:PartialOrderGraphAlignment, 76 5.3.2 PSAlign, 77 5.3.3 ProbCons:ProbabilisticConsistency-BasedMultipleSequence Alignment, 78 5.3.4 T-Coffee:Tree-BasedConsistencyObjectiveFunctionfor AlignmentEvaluation, 79 5.3.5 MAFFT:MSABasedonFastFourierTransform, 80 5.3.6 AVID, 81 5.3.7 EulerianPathMSA, 81 5.4 GeneticAlgorithms, 82 5.4.1 SAGA:SequenceAlignmentbyGeneticAlgorithm, 83 5.4.2 GAandSelf-OrganizingNeuralNetworks, 84 5.4.3 FAlign, 85 5.5 NewDevelopmentinMultipleSequenceAlignmentAlgorithms, 85 5.5.1 KB-MSA:Knowledge-BasedMultipleSequence Alignment, 85 5.5.2 PADT:ProgressiveMultipleSequenceAlignmentBasedon (cid:2) (cid:2) DynamicWeightedTree, 94 5.6 TestDataandAlignmentMethods, 97 5.7 Results, 98 5.7.1 MeasuringAlignmentQuality, 98 5.7.2 RT-OSMResults, 98 6 PhylogenyinMultipleSequenceAlignments 103 6.1 TheTreeofLife, 103 6.2 PhylogenyConstruction, 105 6.2.1 DistanceMethods, 106 6.2.2 Character-BasedMethods, 107 6.2.3 MaximumLikelihoodMethods, 109 6.2.4 Bootstrapping, 110 6.2.5 SubtreePruningandRe-grafting, 111 6.3 InferringPhylogenyfromMultipleSequenceAlignments, 112 7 MultipleSequenceAlignmentonHigh-PerformanceComputing Models 113 7.1 ParallelSystems, 113 7.1.1 Multiprocessor, 113 7.1.2 Vector, 114 (cid:2) (cid:2) viii CONTENTS 7.1.3 GPU, 114 7.1.4 FPGA, 114 7.1.5 ReconfigurableMesh, 114 7.2 ExitingParallelMultipleSequenceAlignment, 114 7.3 Reconfigurable-MeshComputingModels – (R-Mesh), 116 7.4 PairwiseDynamicProgrammingAlgorithms, 118 7.4.1 R-MeshMaxSwitches, 118 7.4.2 R-MeshAdder/Subtractor, 118 7.4.3 Constant-TimeDynamicProgrammingonR-Mesh, 120 7.4.4 AffineGapCost, 123 7.4.5 R-MeshOn/OffSwitches, 124 7.4.6 DynamicProgrammingBacktrackingonR-Mesh, 125 7.5 ProgressiveMultipleSequenceAlignmentONR-Mesh, 126 7.5.1 HierarchicalClusteringonR-Mesh, 127 7.5.2 ConstantRun-TimeSum-of-PairScoringMethod, 128 7.5.3 ParallelProgressiveMSAAlgorithmandItsComplexity Analysis, 129 8 SequenceAnalysisServices 133 8.1 EMBL-EBI:EuropeanBioinformaticsInstitute, 133 8.2 NCBI:NationalCenterforBiotechnologyInformation, 135 (cid:2) (cid:2) 8.3 GenomeNetandDataBankofJapan, 136 8.4 OtherSequenceAnalysisandAlignmentWebServers, 137 8.5 SeqAna:MultipleSequenceAlignmentwithQualityRanking, 138 8.6 PairwiseSequenceAlignmentandOtherAnalysisTools, 140 8.7 ToolEvaluation, 142 9 MultipleSequenceforNext-GenerationSequences 145 9.1 Introduction, 145 9.2 OverviewofNextGenerationSequenceAlignmentAlgorithms, 147 9.2.1 AlignmentAlgorithmsBasedonSeedingandHashTables, 147 9.2.2 AlignmentAlgorithmsBasedonSuffixTries, 151 9.3 Next-GenerationSequencingTools, 154 10 MultipleSequenceAlignmentforVariationsDetection 161 10.1 Introduction, 161 10.2 GeneticVariants, 163 10.3 VariationDetectionMethodsBasedonMSA, 165 10.4 EvaluationMethodology, 172 10.4.1 PerformanceMetrics, 172 10.4.2 SimulatedSequenceData, 174 10.4.3 RealSequenceData, 175 10.5 ConclusionandFutureWork, 176 (cid:2)
Description: