Table Of ContentComputational Biology
Dan DeBlasio · John Kececioglu
Parameter
Advising
for Multiple
Sequence
Alignment
Computational Biology
Volume 26
Editors-in-Chief
AndreasDress,CAS-MPGPartnerInstituteforComputationalBiology,Shanghai,China
MichalLinial,HebrewUniversityofJerusalem,Jerusalem,Israel
OlgaTroyanskaya,PrincetonUniversity,Princeton,NJ,USA
MartinVingron,MaxPlanckInstituteforMolecularGenetics,Berlin,Germany
EditorialBoard
RobertGiegerich,UniversityofBielefeld,Bielefeld,Germany
JanetKelso,MaxPlanckInstituteforEvolutionaryAnthropology,Leipzig,Germany
GeneMyers,MaxPlanckInstituteofMolecularCellBiologyandGenetics,Dresden,
Germany
PavelA.Pevzner,UniversityofCalifornia,SanDiego,CA,USA
AdvisoryBoard
GordonCrippen,UniversityofMichigan,AnnArbor,MI,USA
JoeFelsenstein,UniversityofWashington,Seattle,WA,USA
DanGusfield,UniversityofCalifornia,Davis,CA,USA
SorinIstrail,BrownUniversity,Providence,RI,USA
ThomasLengauer,MaxPlanckInstituteforComputerScience,Saarbrücken,Germany
MarcellaMcClure,MontanaStateUniversity,Bozeman,MO,USA
MartinNowak,HarvardUniversity,Cambridge,MA,USA
DavidSankoff,UniversityofOttawa,Ottawa,ON,Canada
RonShamir,TelAvivUniversity,TelAviv,Israel
MikeSteel,UniversityofCanterbury,Christchurch,NewZealand
GaryStormo,WashingtonUniversityinSt.Louis,St.Louis,MO,USA
SimonTavaré,UniversityofCambridge,Cambridge,UK
TandyWarnow,UniversityofIllinoisatUrbana-Champaign,Champaign,IL,USA
LonnieWelch,OhioUniversity,Athens,OH,USA
The Computational Biology series publishes the very latest, high-quality research
devotedtospecificissuesincomputer-assistedanalysisofbiologicaldata.Themain
emphasis is on current scientific developments and innovative techniques in
computationalbiology(bioinformatics),bringingtolightmethodsfrommathemat-
ics, statistics and computer science that directly address biological problems
currentlyunderinvestigation.
The series offers publications that present the state-of-the-art regarding the
problemsinquestion;showcomputationalbiology/bioinformaticsmethodsatwork;
and finally discuss anticipated demands regarding developments in future
methodology. Titles can range from focused monographs, to undergraduate and
graduatetextbooks,andprofessionaltext/referenceworks.
Moreinformationaboutthisseriesathttp://www.springer.com/series/5769
Dan DeBlasio • John Kececioglu
Parameter Advising for
Multiple Sequence Alignment
123
DanDeBlasio JohnKececioglu
ComputationalBiologyDepartment DepartmentofComputerScience
CarnegieMellonUniversity TheUniversityofArizona
Pittsburgh,Pennsylvania,USA Tucson,Arizona,USA
ISSN1568-2684
ComputationalBiology
ISBN978-3-319-64917-7 ISBN978-3-319-64918-4 (eBook)
https://doi.org/10.1007/978-3-319-64918-4
LibraryofCongressControlNumber:2017955035
©SpringerInternationalPublishingAG2017
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
Printedonacid-freepaper
ThisSpringerimprintispublishedbySpringerNature
TheregisteredcompanyisSpringerInternationalPublishingAG
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Toallmyfriendsandfamily
DD
ToDimitri,Lorene,andZoe
JK
Preface
While multiple sequence alignment is essential to many biological analyses, its
standardformulationsareallNP-complete.Duetobothitspracticalimportanceand
computational difficulty, a plethora of heuristic multiple sequence aligners are in
useinbioinformatics.Eachofthesetoolshasamultitudeofparameterswhichmust
be set, and that greatly affect the quality of the output alignment. How to choose
thebestparametersettingforauser’sinputsequencesisabasicquestion,andmost
userssimplyrelyonthealigner’sdefaultsetting,whichmayproducealow-quality
alignmentoftheirspecificsequences.
Inthismonograph,wepresentanewgeneralapproachcalledparameteradvising
for finding a parameter setting that produces a high-quality alignment for a given
set of input sequences. In this framework, a parameter advisor is a procedure
that automatically chooses a parameter setting for the aligner, and has two main
ingredients: (a) the set of parameter choices considered by the advisor, and (b) an
estimatorofalignmentaccuracyusedtorankalignmentsproducedbythealigner.On
couplingaparameteradvisorwithanaligner,oncetheadvisoristrainedinalearning
phase, the user simply inputs sequences to align and receives an output alignment
fromthealigner,wheretheadvisorhasautomaticallyselectedtheparametersetting.
Thebookisorganizedintwoparts:thefirstlaysoutthefoundationsofparameter
advising, and the second provides applications and extensions of advising. The
content examines formulations of parameter advising and their computational
complexity, develops methods for learning good accuracy estimators, presents
approximationalgorithmsforfindinggoodsetsofparameterchoices,andassesses
software implementations of advising that perform well on real biological data.
Alsoexploredareapplicationsofparameteradvisingtoadaptivelocalrealignment,
where advising is performed on local regions of the sequences to automatically
adapttovaryingmutationrates;andensemblealignment,whereadvisingisapplied
to an ensemble of aligners to effectively yield a new aligner of higher quality
than the individual aligners in the ensemble. Finally, future directions in advising
researchareoffered.
vii
viii Preface
This work arose from a series of joint research papers by the coauthors, that
initiated and developed the theory and practice of parameter advising, and that
formedthebasisofthefirstauthor’sdoctoraldissertation.
Parameteradvisingisageneraltechnique,withthepotentialtobeofbroadutility
beyondsequencealignment.Wehopethismonographencouragesotherstoexplore
thisfruitfulareaofinvestigation.
DanDeBlasio Pittsburgh,Pennsylvania
JohnKececioglu Tucson,Arizona
October2017
Acknowledgements
The authors gratefully acknowledge funding from the US National Science Foun-
dation, through Grant IIS-1217886 to John Kececioglu, and by a PhD fellowship
to Dan DeBlasio from the University of Arizona IGERT Program in Comparative
GenomicsthroughGrantDGE-0654435,whichmadethisresearchpossible.While
a postdoctoral fellow at Carnegie Mellon University, Dan DeBlasio also received
support from Carl Kingsford through Gordon and Betty Moore Foundation Grant
GBMF4554,NSFGrantCCF-1256087,andNIHGrantR01HG007104.
ix
Contents
1 IntroductionandBackground............................................. 1
1.1 MultipleSequenceAlignment....................................... 1
1.2 ParameterAdvising .................................................. 3
1.3 RelatedApproaches.................................................. 6
1.3.1 AccuracyEstimation ....................................... 7
1.3.2 APrioriAdvising........................................... 10
1.3.3 Meta-alignment............................................. 11
1.3.4 ColumnConfidenceScoring............................... 12
1.3.5 RealignmentMethods...................................... 12
1.4 BackgroundonProteinStructure.................................... 13
1.5 Overview.............................................................. 14
PartI FoundationsofParameterAdvising
2 AlignmentAccuracyEstimation.......................................... 19
2.1 ConstructingEstimatorsfromFeatureFunctions................... 19
2.2 LearningtheEstimatorfromExamples............................. 21
2.2.1 FittingtoAccuracyValues................................. 21
2.2.2 FittingtoAccuracyDifferences............................ 23
3 TheFacetAccuracyEstimator.......................................... 29
3.1 FeatureFunctionsofanAlignment.................................. 29
3.2 SecondaryStructureBlockiness..................................... 30
3.3 SecondaryStructureAgreement..................................... 33
3.4 GapCoilDensity..................................................... 34
3.5 GapExtensionDensity............................................... 35
3.6 GapOpenDensity.................................................... 35
3.7 GapCompatibility.................................................... 36
3.8 SubstitutionCompatibility........................................... 36
3.9 AminoAcidIdentity ................................................. 37
xi
Description:This book develops a new approach called parameter advising for finding a parameter setting for a sequence aligner that yields a quality alignment of a given set of input sequences. In this framework, a parameter advisor is a procedure that automatically chooses a parameter setting for the input, an