Letter Reconstructing large regions of an ancestral mammalian genome in silico Mathieu Blanchette,1,4,5 Eric D. Green,2 Webb Miller,3 and David Haussler1,5 1HowardHughesMedicalInstitute,UniversityofCalifornia,SantaCruz,California95064,USA;2NationalHumanGenome ResearchInstitute,NationalInstitutesofHealth,Bethesda,Maryland20892,USA;3DepartmentofBiology, PennsylvaniaStateUniversity,UniversityPark,Pennsylvania16802,USA It is believed that most modern mammalian lineages arose from a series of rapid speciation events near the Cretaceous-Tertiary boundary. It is shown that such a phylogeny makes the common ancestral genome sequence an ideal target for reconstruction. Simulations suggest that with methods currently available, we can expect to get 98% of the bases correct in reconstructing megabase-scale euchromatic regions of an eutherian ancestral genome from the genomes of ∼20 optimally chosen modern mammals. Using actual genomic sequences from 19 extant mammals, we reconstruct 1.1 Mb of ancient genome sequence around the CFTR locus. Detailed examination suggests the reconstruction is accurate and that it allows us to identify features in modern species, such as remnants of ancient transposon insertions, that were not identified by direct analysis. Tracing the predicted evolutionary history of the bases in the reconstructed region, estimates are made of the amount of DNA turnover due to insertion, deletion, and substitution in the different placental mammalian lineages since the common eutherian ancestor, showing considerable variation between lineages. In coming years, such reconstructions may help in identifying and understanding the genetic features common to eutherian mammals and may shed light on the evolution of human or primate-specifictraits. [Supplementalmaterialisavailableonlineatwww.genome.organdhttp://genome.ucsc.edu/ancestors.] Followingcompletionofthehumangenomesequence,thereis inferancestralgenomesequences(Austinetal.1997;Marotaand now considerable interest in obtaining a more comprehensive Rollo2002).Thesecondislackofthenecessarybiotechnologyto understanding of its evolution (International Human Genome synthesizelargegenomicregionsfrommanysmallpieces.While SequencingConsortium[IHGSC]2001;InternationalMouseGe- thereisrecentprogressinovercomingthesecondobstacle(Smith nome Sequencing Consortium [IMGSC] 2002; Rat Genome Se- et al. 2003), the problem of loss of information appears to be quencingProjectConsortium[RGSPC]2004).Patternsofevolu- insurmountableforspeciesfrom,say,theJurassicorCretaceous tionaryconservationareusedtoscreenhumanDNAmutations periodsthathaveleftbehindfewmoderndescendants.However, topredictthosethatwillbedeleterioustoproteinfunction(Su- for ancient species with many different modern descendants, nyaev et al. 2001; Ng and Henikoff 2002) and to identify non- thereisstillthepossibilitythatlargeregionsoftheirgenomescan codingsequencesthatareundernegativeselection,andhence, beapproximatelyinferredfromthegenomesofmodernspecies mayperformregulatoryorstructuralfunctions(Hardison2000; usingamodelofmolecularevolution.Onasmallerscale,such Boffelli et al. 2003; Cooper et al. 2003; Margulies et al. 2003; ancestralreconstructionshavebeenperformedforproteinfami- Bejeranoetal.2004).Longperiodsofconservationfollowedby lies including rhodopsin (Chang et al. 2002), ultraviolet vision sudden change may provide clues to the evolution of new hu- geneSWS1(ShiandYokoyama2003),ribonucleases(Jermannet man traits (Goodman et al. 1971; Challem 1997; Enard et al. al. 1995; Zhang and Rosenberg 2002), Tu elongation factors 2002). All of these efforts depend, directly or indirectly, on re- (Gaucheretal.2003),steroidreceptors(Thorntonetal.2003)(for constructingtheevolutionaryhistoryofthebasesinthehuman review, see Chang and Donoghue 2000; Thornton 2004), for genome,andhence,onreconstructingthegenomesofourdis- transposons (Adey et al. 1994; Smit and Riggs 1996; Ivics et al. tantancestors. 1997;Jurka2000),andforsmallgenomeslikeHIV(Hillisetal. Thehopeoflearningaboutlongextinctspeciesbyrecover- 1994),inwhichcasethepredictedancestralsequenceswerecom- ingandcloningtheirDNAhasengagedthepopularaswellasthe paredwiththeknownones.However,studiesoflarge-scalecom- scientific imagination, but the reality of such endeavors falls putationalgenomereconstruction,anundertakingthatmightbe shortofexpectationsontwogrounds.Thefirstislackofinfor- termedcomputational“paleogenomics”(Birnbaumetal.2000), mation;thereisnotenoughintactDNAinthemodernremains havebeenlimitedtohigher-levelgenomepropertiessuchasgene of species that have been extinct for many millions of years to order (Blanchette et al. 1999; El-Mabrouk and Sankoff 1999; Pevzner and Tesler 2003; Bourque et al. 2004) or karyotype (Graphodatskyetal.2002;Yangetal.2003). 4Present address: McGill University, Montreal, Quebec H3A 2B4 Canada. Maximum likelihood algorithms for the reconstruction of 5Correspondingauthors. ancestral amino acids or DNA bases have been developed and [email protected];fax(831)459-4829. used by several groups (Yang et al. 1995; Koshi and Goldstein [email protected];fax(514)398-3387. 1996;Cunninghametal.1998;Schultzetal.1996;Pupkoetal. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.2800104. 2000,2002).Themaximallikelihoodapproachappearstowork 2412 Genome Research 14:2412–2423©2004byColdSpringHarborLaboratoryPress;ISSN1088-9051/04;www.genome.org www.genome.org Ancestral genome reconstruction betterthanparsimonymethods(ZhangandNei1997).Bayesian methodsthattakeintoaccountuncertaintiesinthetree,branch lengths,andmodelparametershavealsobeenexplored(Schultz andChurchill1999;HuelsenbeckandBollback2001),although these involve more computationally expensive Markov Chain Monte Carlo sampling methods. With few exceptions (Hein 1989; Fredslund et al. 2003), algorithms have been limited to pure substitution models, and have not considered reconstruc- tioninthepresenceofinsertionsanddeletions. We argue that a good target species for a genomic recon- struction is one that has generated a large number of indepen- dent, successful descendant lineages through a rapid series of ancestral speciation events. In this case, the problem can be viewed as attempting to reconstruct an original from many in- dependentnoisycopies.Inthelimitofaninstantaneousradia- tion,theaccuracyofthereconstructionapproaches100%expo- nentiallyfastasthenumberofcopiesincreases(seeDiscussion). From the Cretaceous period, a good choice for reconstruction wouldbethegenomeoftheeutherianancestor,asthisspeciesis believed to have spawned the relatively rapid radiation of the differentlineagesofmodernplacentalmammals(seeEiziriketal. 2001fortheradiationmodelusedinthisstudy,andSpringeret al.2003foralternatehypothesesaboutthepaceofthemamma- lianradiation).Thisancientspeciesalsohastheaddedadvantage ofbeingahumanancestor,soitsreconstruction,howeverspecu- lative,mayshedadditionallightonourownevolution,perhaps helping to explain features of the human and other modern Figure 1. Estimated reconstructability of ancestral mammalian se- mammalian genomes. This study uses computational simula- quences.Averagebase-by-baseerrorrateinthereconstructionofeach simulated ancestral sequence. The error rate shown is the sum of the tionstoshowthatlargepartsoftheeuchromaticgenomeofthat percentagesofbasesthataremissing,added,ormismatchedasaresult earlyeutherian,includingmanyofitsnoncodingregions,could oferrorsinthereconstruction,averagedover100simulationsofsetsof beaccuratelyreconstructedifsufficientlymanywell-chosenex- orthologoussequencesoflength∼50kb.Errorratesaregivenfirstforall tantmammaliangenomeswereavailable. regions,andinparenthesesfornonrepetitiveregionsonly.TheBoreoeu- therianancestor,whichistheancestorthatcanbestbereconstructed,is indicatedbythearrow.Branchescompletelylocatedinsidetheboxare Results called“earlybranches”(seetext).Thespeciesnamesattheleavesonly indicate what organisms we simulated; no actual biological sequences were used here. The tree topology and branch lengths are derived di- Simulations rectlyfromEiziriketal.(2001). Toassessthereconstructabilityofancestralmammaliangenomic sequences, we performed a series of computational simulations sequence alignment tool called TBA (Blanchette et al. 2004) of the neutral evolution of a hypothetical ∼50 Kb ancestral ge- based on the well-established pair-wise alignment program nomicregionintoorthologousregionsin20modernmammals BLASTZ (Schwartz et al. 2003). Given TBA’s multiple sequence (Fig. 1). Simulation parameters for substitution, deletion, and alignmentofthesoft-repeat-maskedextantsequencesandaphy- insertionwerebasedontheanalysisof∼1.8Mbofdatafromnine logenetic tree relating these sequences, whose topology is as- mammalsintheregionsorthologoustothehumanCFTRlocus sumedtobeknown,butwhosebranchlengthsareinferredusing (Margulies et al. 2003; Thomas et al. 2003), as well as on a ge- theHKYmodel(Hasegawaetal.1985)andthePHYMLprogram nome-widecomparisonofthehumanandmousegenomes(Kent (Guindon and Gascuel 2003), we predicted which positions of etal.2003),andonarecentlyderivedphylogenetictree(Eizirik the alignment correspond to ancestral bases and which corre- et al. 2001). The simulations included insertion of lineage- spondtonucleotidesinsertedaftertheancestor.Here,weuseda specific transposons and increased rates of substitution in CpG greedy algorithm that seeks to explain the observed alignment dinucleotides.Foreachpairoforthologoussequencesgenerated, using a set of insertions and deletions of maximum likelihood weverifiedthattheaveragenumberofsubstitutions,insertions, (see Methods). The identity of the nucleotide at each ancestral anddeletionsareclosetothoseobservedintheneutrallyevolv- position was then predicted using a context-dependent maxi- ingregionsofthegreaterCFTRregion.Wealsoverifiedthatthe mum-likelihoodestimation.Theonlydataavailabletothealign- distributionofthesizesofinsertionsanddeletions,aswellasthe mentandreconstructionprocedurewerethesequencesofextant frequencyandagedistributionofeachtypeofrepetitiveelement species.Noinformationaboutthesimulationprocess(neitherits are close to those previously reported (IHGSC 2001; IMGSC parametersnoritsrealization)wasusedtoinformorsetthepa- 2002).Furtherdetailsofthesimulationprocessanditsvalidation rameters of the reconstruction process apart from the assumed aregiveninMethodsandinBlanchetteetal.(2004). common knowledge of the phylogenetic tree topology, the pa- A crucial first step toward reconstructing ancestral se- rametersoftheHKYsubstitutionmodel,andtheknownclasses quencesistobuildanaccuratemultiplealignmentoftheextant oftransposons. sequences,thusestablishingorthologyrelationshipsamongthe We compared the actual ancestral sequence used in our nucleotides of each sequence. To this end, we used a multiple- simulations with the predicted ancestral sequence by aligning Genome Research 2413 www.genome.org Blanchette et al. themandcountingthenumberofmissingbases(thosepresent theriskofmisalignmentincreases,whiletheunavoidablelossof in the actual ancestor, but not in the reconstruction), added information in the early branches persists (dashed box, Fig. 1; bases(presentinthereconstruction,butnotintheactualances- alsoseeDiscussion).However,furtherimprovementstothemul- tor), and mismatch errors (positions in the reconstruction as- tiplealignmentmethodologymightchangethis. signedtheincorrectnucleotide).Thesumoftheratesofallthree Theaccuracyofthereconstructiondependscruciallyonthe typesoferrorswascalculatedseparatelyateachancestralnodein length of the early branches. Additional simulations (Supple- thephylogenetictree(Fig.1).Theresultsshowedthatunderthis mentalFig.S1)revealedthatifthemajorplacentallineageshad phylogenetic tree with a relatively rapid placental mammalian divergedinstantaneously(earlybranchesoflengthzero,seeFig. radiation, the neutral nonrepetitive regions of the Boreoeuthe- 1),wewouldbeabletoreconstructthesimulatedBoreoeutherian rianancestralgenomethathaveevolvedlikethoseinoursimu- ancestralsequence,includingrepetitiveregions,with<1%error. lationscanbereconstructedwithabout99%base-by-baseaccu- Incontrast,iftheearlybranchlengthsinferredbyEiziriketal. racy from the genomes of 20 present-day mammals. Repetitive (2001)turnedouttounderestimatetheactuallengthsbyafactor regions are not reconstructed as accurately, because they are oftwo,theerrorratewouldjumpto3%,andto6%iftheywere more often involved in misalignments, which can result in in- underestimatedbyafactoroffour. correctpredictions.Nonetheless,evencountingerrorsinrepeti- Theaccuracyofthereconstructionislessdependentonthe tiveregions,thetotalaccuracyis>98%.Ifareconstructedbaseis overall branch length, within reasonable limits. If the neutral chosen at random, chances are it lies at least within a 343-bp substitutionandindelratesusedinthemodelareincreasedby error-freesequence,showingthatreconstructionerrorsareoften 25%,whichisconsiderablymorethanthetypical10%regional clusteredtogether,leavinglargeerror-freeregions.Thesimulated neutralratefluctuationsobservedindifferentgenomicregionsin and reconstructed sequences, as well as statistics validating the human–mousegenomecomparisons(Hardisonetal.2003),the simulation process, are available at http://genome.ucsc.edu/ accuracyofreconstructiononlydecreasesto97.5%.Ontheother ancestors.Thesimulationssuggestthateveninthenonrepetitive hand, if the rates are uniformly half of the neutral rate, which regions,muchofthedifficultyofthereconstructionproblemlies correspondsroughlytotheratesobservedforcodingregions(Eiz- inthecomputationofthemultiplealignment,asareconstruc- iriketal.2001),thereconstructedbasesare>99.8%correct,with tion based on the correct multiple alignment derived from the mosterrorsduetoincorrectalignmentinthevicinityofrepeti- simulationitself(andthusunavailableforactualsequences)had tiveelements.Ifthetrueevolutionaryratesvaryfromsitetosite lessthanhalfthenumberofreconstructionerrors. betweentheseextremes,wewouldthusexpecttheoverallaver- Lookingatthereconstructabilityinotherancestralspecies agereconstructionaccuracyofaregiontobe>97.5%,withsig- inthetree,astrong“localtreetopologyeffect”isseen,whereby nificantlyhigherlocalaccuracyforthemoreevolutionarilycon- ancestral sequences at the center of rapid radiations are much strainedsubregions. morereconstructablethanthosewithlongerincidentbranches. Animportantassumptioninourreconstructionprocedure Thiseffectissostrongthatsequencesofearlyeutherianslivingin is that the topology of the phylogenetic tree is known in ad- times of rapid radiation can be reconstructed more accurately vance. Since the early branches of the eutherian tree are very thanthoseofmostofthemorerecentancestors. short,thereremainssomeuncertaintyabouttheprecisebranch- Examining reconstructions made using smaller subsets of ingorderofthemainmammalianphyla.Moreover,insituations thissetof20species,itwasfoundthat,includingrepetitivere- ofrapidspeciation,differentregionsofthegenomemayactually gions,anaccuracyofabout97%canbeachievedusingonly10 havedifferentphylogenetictreesbecauseofincompletelineage specieschosentosamplemostmajormammalianlineages(Fig. sortingduetodifferentrecombinationhistories(Shedlocketal. 2).Samplingonlyfiveofthemostslowlyevolvinglineagesyields 2000). To assess the consequences of using an incorrect tree as an accuracy of about 94%. Little is gained with our current re- inputtothereconstructionprocedure,werepeatedoursimula- constructionproceduresbyaddingmorethan10species,because tionusingtheoriginaltreetogeneratethesequences,butusing theincorrecttree(Xenartha,Laurasiatheria,Primates),(Rodents, Afrotheria)forthereconstructionoftheancestor.Wefoundthat thepseudo-“Xenartha-Laurasiatheria-Primates”ancestorinferred wasanapproximationofthetrueBoreoeutherianancestorthat was still 98.4% accurate. The robustness of the reconstruction withrespecttochangesinearlybranchingordermaybedueto therelativelysmallnumberofmutationaleventsontheseshort branches of the tree. However, similar robustness of ancestral reconstructiontominortree-topologychangeshasalsobeenob- servedinsimulationsofaminoacidevolutionformoregeneral kindsoftrees(ZhangandNei1997). Finally, in addition to estimates of the overall accuracy of the reconstruction, the simulations also suggest how we may estimate the confidence in the reconstruction of the ancestral base at a given site based on properties of the local alignment Figure2. EstimatedreconstructabilityoftheBoreoeutherianancestor. Fraction of the simulated Boreoeutherian ancestral sequence recon- containingthatsite.Inasituationwherethephylogenetictree structedincorrectlyasafunctionofthenumberofextantspeciesusedfor andsequencealignmentareknowntobecorrectandthereareno the reconstruction. For each number of species used, results are given insertionsordeletions,theposteriorprobabilitiesofeachofthe countingallbases(leftcolumns)andonlynonrepetitivebases(rightcol- four possible ancestral nucleotides can be explicitly computed umns).Speciesareaddedinthefollowingorder:human,cat,chipmunk, sloth,manatee,rousettebat,mole,pig,beaver,treeshrew,horse,pango- using standard substitution models (Yang et al. 1995), which lin,mouse,armadillo,aardvark,okapi,dog,mole-rat,rabbit,andlemur. readilyprovidestheprobabilityofreconstructionerror.However, 2414 Genome Research www.genome.org Ancestral genome reconstruction inthepresenceofindelsorwithanuncer- tainalignment,theanalogouserrorcalcula- tionbecomesproblematic,evenforafixed tree (Hein et al. 2000; Huelsenbeck 2001; Lunteretal.2003). Here, we take a heuristic approach to estimating the confidence of the recon- structedbaseatagivensite.Theprobability that an individual reconstructed base is a mismatcherrororanaddedbaseisempiri- Figure 3. (A) Estimates of the expected number of substitutions per site between a repeat consensus C, it human descendent H, and the reconstructed ancestor A*, based on a Kimura callyestimatedbasedonlocalpropertiesof 2-parametermodelandaveragedoverallhumanancestralrepeatsoftheregionconsidered.The thealignmentatandaroundthatposition trueancestorAcannotbeobserved,butadistanceof0.026substitutionspersitebetweenitand (seeMethods).Testingthisapproachinour A*isestimatedfromthethreeotherdistances.(B)Starphylogenywithnindependentdescendants. simulations, we find that about 98.5% of (C)Atreewithbifurcatingroot.IrrevocableinformationlossoccursbetweenRanditsdescendants AandB. the nucleotides of our simulated Boreoeu- therian ancestral sequence can be recon- structed with at least 90% confidence that they are not mis- BoreoeutheriansequenceforthistransposonrelicandletA*be matchesoraddedbases,andabout95%,withatleast99%,con- thereconstructedsequence.SinceAstandsontheevolutionary fidence.Anadditional1%ofthebasesoftheancestralsequence pathbetweenCandH,wewouldexpecttohaved(C,H)≈d(C, aremissingfromthereconstructedsequence,butthelocationsof A)+d(A,H),whered(C,A)andd(A,H)aretheexpectedsubsti- theseomissionscannotbeaccuratelypredicted. tutionspersitebetweenCandA,andbetweenAandH,respec- tively.ReconstructionerrorsinA*wouldbeexpectedtotakethis sequenceawayfromthetrueevolutionarypath,resultingind(C, ReconstructionofanancestralregionintheCFTRlocus H)<d(C,A*)+d(A*,H).Figure3Ashowstheaveragedistances Following our simulations, we applied the reconstruction observed for ancestral repeats of the CFTR region. It indicates method to actual high-quality sequence data from a 1.87-Mb thatd(C,A*)+d(A*,H)exceedsd(C,H)by0.04substitutionsper region containing the human CFTR locus, using 18 additional site,whichcanbeverifiedtocorrespondtoamismatcherrorrate orthologousmammaliangenomicregions(Table2,below)gen- in the reconstructed sequence A* of about 2.6%. This roughly eratedbytheNISCComparativeSequencingProgram(Thomaset confirmsourestimateof96%overallaccuracy,sincemismatch al.2003)(seewww.nisc.nih.gov).Wereconstructedanapproxi- errorsareexpectedtoaccountforabouthalfofthebase-by-base mation to all ancestral sequences of the CFTR locus for which errors made by our method in this case and errors are concen- orthologoussequencewasavailableinatleast16ofthe19spe- tratedinrepetitiveregions. cieslistedinTable2,below.Inhuman,thiscorrespondstosev- Figure4illustratesthereconstructioninanoncodingregion eraldiscontinuoussegmentscoveringatotalof1.274Mb.Simu- oftheCFTRlocusthatexhibitsatypicallevelofsequencecon- lationsonsyntheticdatalikethosedescribedaboveindicatethat servation.Thisregionislocatedina32-KbintronoftheCAV1 forthetopologyandsetofbranchlengthsforthese19species, gene,about13Kbfromthe5(cid:1)exon.Thebasesinthisregionare theancestralsequencethatcanbemostaccuratelyreconstructed relicsleftoverfromtheinsertionofaMER20transposonsome- basedonthesequencesavailableistheBoreoeutherianancestor, timepriortothemammalianradiation,andarethusunlikelyto andthatneutrallyevolvingregionsofthisancestralgenomecan beunderselectivepressure. be reconstructed with an accuracy of about 96%. Notice that Notice that despite the fact that the alignment of certain although we are using sequences from 19 mammals, the pre- species(inparticular,mouse,rat,andhedgehog)appearssome- dictedaccuraciesobtainedarelowerthanthosereportedinFig- what unreliable, the inference of the presence or absence of a ure 2, because not as many major lineages or outgroups are Boreoeutherian ancestral base at a given position is quite sampled.Onasite-specificbasis,simulationssuggestthat>90% straightforwardgiventhealignment,andtoalesserextent,sois ofthebasesofthepredictedancestorcanbeassignedconfidence thepredictionoftheactualancestralbaseitself.TheMER20con- values>99%.Thereconstructedancestorandsite-specificconfi- sensusisshownforcomparison.Mostpositionswheretherecon- dence estimates are available at http://genome.ucsc.edu/ structedBoreoeutherianancestralbasedisagreeswiththeMER20 ancestors. consensusarelikelyduetosubstitutionsinthisMER20relicthat Weconfirmedthatthe96%accuracyestimateisreasonable predated the Boreoeutherian ancestor, since the support of the by analysis of transposable elements whose insertion predated reconstructed base is very strong in the extant species. If the the Boreoeutherian ancestor (“ancestral repeats”) (Fig. 3A). For MER20consensussequenceisusedasanoutgroupintherecon- each family of ancestral repeats, a consensus sequence is avail- structionprocedure,onlytwobases(indicatedbyalongerarrow) able,obtainedfromthemanycopiesoftheseelementsscattered are reconstructed differently, indicating that the reconstructed inthegenome.Theconsensussequenceisthoughttorepresent ancestral sequence is very stable and most of it is likely to be thetransposonsequenceatthetimeofitsinsertionintothisand correct. otherregionsoftheancestralgenome(Jurka2000).Wealigned Because the reconstructed Boreoeutherian ancestral se- theextantsequenceHofeachtransposonrelicidentifiedinthe quenceisevolutionarilyclosertotheoldermammalianancestral humanCFTRregionbyRepeatMasker(SmitandGreen1999)to genomes that existed at the time of the insertions of ancestral the consensus sequence C for its ancestral repeat family, and transposons,itissuperiortothehumangenomesequenceforthe estimatedtheexpectednumberofsubstitutionspersitebetween recognitionoftheseelements.Inessence,itactsasanobserva- consensusandhumanrelic,d(C,H),usingaKimura2-parameter tory that allows us to see even farther back in time. When Re- model (Kimura 1980). Let A be the true (unknown) ancestral peatMasker is run on the inferred Boreoeutherian ancestor, an- Genome Research 2415 www.genome.org Blanchette et al. eee erethalbashatth positionswhdtheancestrwn)verifiest eeo dicathangotsh ncn owsihavedata on.Arrwould(DNA retrotransposussequencenonrepetitive MER20onsensnking derivedfromaoftheMER20cnmentoftheflaal.(2001). oussequencesknowledgehe34).ThealigdEiziriketfrom nactualorthologpositionswheret9,899(NCBIbuilederiveddirectly sequencebasedooeutherianLongerarrowsindicatetheischr7:115,739,755-115,73Thetreeandbranchesarus. onstructionofanancestralBoreersfromtheMER20consensus.thehumansequencedisplayedntspeciesare,infact,orthologo mpleofrecncestordiffpositionofthediffere Exaure4.onstructedadiction.Theuencesfrom Figrecpreseq 2416 Genome Research www.genome.org Ancestral genome reconstruction cientrepeatfamiliessuchasL2LINESandMIRsaredetectedin onthemoreoptimallychosensetof20species,butisconsistent significantly larger fraction than when RepeatMasker is run on withwhatwewouldexpectfromthesuboptimalsetof19species thehumansequence,becausetheyaremuchlessdecayed[Table usedinthisreconstruction.Interestingly,attwoofthepositions 1,column(b)].Thisimprovedabilitytodetectveryoldrepeats wherethereconstructedancestralCFTRproteindiffersfromhu- resultsinanincreaseof2.7%intheestimatedtotalfractionof manCFTR,thereconstructedancestralaminoacidisassociated thehumanCFTRregionthatderivesfromatransposoninsertion withcysticfibrosiswhenitoccursasahumanmutation(http:// (from37.7%to40.4%). www.genet.sickkids.on.ca/cftr/):Phe→Leuataminoacidposi- More importantly, reconstructed ancestral genome se- tion 87 (Bienvenu et al. 1994) and Met → Ile at position 1028 quencesallowustomakeinferencesaboutthespecificevolution- (Onayetal1998).Thesedisease-causinghumanvariantsarethe ary path of functional elements such as protein-coding regions wild-type amino acids in several other species, as has been ob- (Jermannetal.1995;Sunyaevetal.2001;Changetal.2002;Ng served for other human disease proteins as well (Schaner et al. and Henikoff 2002; Zhang and Rosenberg 2002; Gaucher et al. 2001).Thatthedisease-causingaminoacidvariantwaswildtype 2003;ShiandYokoyama2003;Thorntonetal.2003;Thornton inoureutherianancestorisverylikelyintheformercase,butthe 2004).About5995ofthe6026codonsfromtheknownhuman reconstruction is less clear in the latter case, because so many genesintheregionusedforthisreconstructionarealsoclearly differentsubstitutionsoccuredindifferentlineages. codingintheotherextantspecies.Allofthese5995codonswere SensiblereconstructionofhypothesizedstructuralRNAswas reconstructed without introducing an in-frame stop codon or also obtained. Two regions of the CFTR locus in introns of the frame shift, despite the fact that the reconstruction algorithm ST7 gene that appear to form stable RNA secondary structures usedneitherpriorknowledgeaboutexonpositionsnormodelof (Marguliesetal.2003)arepredictedtofoldinanearlyidentical codonevolution.Thisindirectlysuggeststhattheaccuracyofthe fashioninthereconstructedancestor. reconstruction is quite high for elements of the genome that The reconstructed ancestral sequence can also be used to havebeenunderpurifyingselection. gatherstatisticsontheratesofgainandlossofDNAindifferent The accuracy of the inferred ancestral CFTR protein se- eutherian lineages, and the shifts in substitution spectra. After quence was verified by comparing it to outgroups like chicken reconstruction of the Boreoeutherian ancestral sequence from and the marsupial Didelphis virginiana (opposum). Of the 1481 the 19 present-day genomic sequences, we compared it with aminoacidsoftheancestralCFTRprotein,1276aremostlikely those sequences to derive these statistics (Table 2). The recon- correct by virtue of a quasi-unanimity within eutherian mam- structed ancestral sequence had a size (1124 Kb) about 10% mals. Of the remaining 205 amino acids where the reconstruc- smaller than those of extant old-world monkeys (1260 Kb on tion is not completely obvious, 137 amino acids are strongly average, with most of the difference due to Alu insertions) and confirmed by a match in either chicken or opposum, and 29 alsosmallerthanthoseofmostotherspecies,withtheexception otherscouldonlybeweaklyconfirmedbyamatchineitherfrog ofthetwolemurs.Thenumberofinsertedanddeletedbasesin orFugu.Ontheotherhand,15aminoacidscouldbeincorrectly primatesislowcomparedwithmanyothermammals(Thomaset reconstructed as indicated by the failure of the two tests above al.2003),whilethoseofrodents(butnotrabbit)arehigh.Sub- andbyamatchbetweenoneoftheeutherianaminoacidsand stitutionratesfollowasimilarpattern(Cooperetal.2003).Over- either Didelphis or chicken. Overall, this gives an estimated ac- all,theancestralsequenceismostcloselyrelatedtothatofpri- curacyof∼99%attheaminoacidlevelforthereconstructionof mates,andperhaps,surprisingly,tothatofhorse. theancestralCFTRprotein.Thiscorrespondstoan∼99.5%accu- Itispredictedthatthehumansequencediffersfromthatof racy at the base level, because roughly 2/3 of random base theBoreoeutherianancestorin30.3%ofitsbases,21.7%result- changesarenonsynonymous,andthereisa3:1ratioofbasesto ing from insertions, and thus not present in the ancestor, and aminoacids.Thisisnotasgoodasthe99.8%accuracyexpected, 8.6% resulting from substitutions. In addition, the human se- basedonsimulationsofregionsevolvingathalftheneutralrate quencehaslostabout11.3%oftheancestralbases.Mostdiffer- encesbetweenthehumanandancestralsequencesderivefrom primate lineage insertions of transposons, in agreement with Table1. Detectedrepetitivecontentofthereconstructed Boreoeutherianancestorandofhuman otherrecentstudies(IMGSC2002).Incontrast,rodentsdifferin about 55%–60% of their bases and have lost about 39% of the PreBoreoeutherianancestralrepeats ancestralbases,whilehedgehogdiffersfromtheancestorin58% of its bases and has lost 50%. Though this high mutation rate Detectableinhuman Detectableinancestor andancestor(kb)a only(kb)b makesthesespeciesveryusefulfordetectingfunctionalregions throughcomparativegenomics(Marguliesetal.2003),itmakes Alu 0 0 themoflessuseforreconstructingancestralsequences.Because LINEL1 83.5 9.1 ofthedifficultyofaligningsuchrapidlyevolvingsequences,the LINEL2 61.5 15.3 accuracy of these estimates for rodents and hedgehog remains LINEL3 2.4 0.7 DNA 23.7 2.3 uncertain. MIR 40.3 5.6 Thesetof19speciesweusedisnotauniformsamplingof LTR 38.3 1.8 theeutherianphylogenetictree,butratherisbiasedtowardclose Others 5.0 0.4 human relatives, containing seven old-world monkeys. To en- Total 254.7 35.2 surethatthenumberofcloselyrelatedspeciesdoesnotunduly affectthereconstructedancestorbybiasingittowardthehuman aNumberofhumankilobaseslabeledbyRepeatMaskerasbelongingto sequence, we repeated the reconstruction procedure, removing thegivenfamilyandpresentintheBoreoeutherianancestor. all primates but human and lemur. The new reconstructed an- bNumberofhumankilobasesthatarenotdetectedasrepetitiveinhu- man,butthataredetectedassuchinthecorrespondingancestralregion. cestor was not significantly farther from the human sequence, AllnumberswerecalculatedusingthesensitivemodeofRepeatMasker. with0.113expectedsubstitutionspersite(comparedwith0.111 Genome Research 2417 www.genome.org Blanchette et al. Table2. Comparisonofmodernsequencestopredictedancestor Deletions Insertions Substitutions %of %ofextantspecies’bases %ofextantspecies’bases Sizeofregion Nonrepetitive ancestor acquired(nonrepetitive changed(expected# Species (kb)(a) %GC-content(b) lost(c) only)(d) substitutionspersite)(e) ReconstructedBoreoeutherian ancestor 1124 37.0 N/A N/A N/A Human 1274 37.1 11.3 21.7(2.0) 8.6(11.1) Chimpanzee 1278 37.1 11.5 21.8(1.8) 8.7(11.1) Gorilla 1247 37.1 12.9 21.6(1.9) 8.7(11.1) Baboon 1260 37.3 12.6 21.2(2.1) 9.1(10.7) Orangutan 1268 37.1 11.7 21.2(1.8) 8.6(11.2) Vervet 1229 37.2 13.5 20.7(2.0) 9.1(11.8) Macaque 1255 36.4 12.2 21.0(2.0) 9.1(11.7) Lemur 1071 37.7 19.1 11.6(2.8) 9.0(10.9) Mouse-lemur 1085 37.5 18.0 14.5(3.8) 9.3(11.6) Mouse 1110 39.2 39.1 38.3(12.0) 17.5(34.3) Rat 1239 39.5 38.8 44.4(10.1) 15.9(35.1) Rabbit 1348 42.7 29.4 37.9(28.9) 10.5(21.3) Cat 1206 37.2 24.5 29.6(6.9) 11.3(16.5) Dog 1122 39.4 26.4 22.5(6.4) 13.5(19.2) Cow 1324 37.1 30.9 41.5(7.7) 11.1(20.9) Pig 1158 36.8 33.7 29.6(7.5) 10.9(19.7) Horse 1102 38.5 20.2 17.5(8.0) 12.1(13.3) Hedgehog 1379 39.7 50.0 48.9(38.6) 8.9(28.5) Armadillo 1339 39.4 28.9 34.2(18.1) 9.9(20.2) Listedaresomepropertiesofsequencesoftheextantspeciesinthegreater-CFTRlocusandthepredictedchangestheyincurredduringevolutionfrom theBoreoeutherianancestralsequence.(a)Lengthofsequence.(b)FractionofnonrepetitivebasesthatareGorC.(c)Deletions:percentageofthe ancestralsequencelostineachspecies.(d)Insertions:percentageofextantspecies’sequencethatwasinsertedsincethereconstructedancestor(in parentheses,percentageofextantspecies’sequencethatresultedfrominsertionsofnonrepetitivesequences,usingRepeatMaskertoidentifyrepetitive sequences.)ThehighfractionofnonrepetitiveinsertedbasesinrabbitandhedgehogismostlikelyduetolackofcompleteRepeatMaskerlibrariesfor thetransposonsspecifictothesespecies.(e)Substitutions:percentageofextantspecies’basesthatwerederivedfromanancestralbasebutdifferfrom thatbase(thisisdifferentfromthestandardpercentageidentitymeasure,whereonlyalignedbasesareconsidered).Inparentheses,theexpected numberofsubstitutionspersiteunderaKimura2-parametermodel(Kimura1980)isgiven,hereusingonlythealignedbases. previously),10.8%deletions(comparedwith11.3%previously), outliershedgehog(28timeshigher)andpre-mouse–rat-splitro- and23.4%insertions(comparedwith21.7%previously). dents(26timeshigher). The availability of predicted ancestral sequences at every internalnodeofthetreeoffersauniqueperspectiveonthede- letionandinsertionprocessesatworkalongeachbranchofthe Discussion tree. Focusing on a 280-kb region where sequences from all 19 One of the nonintuitive results of this study is the observation mammalswereavailable,thenumberofmicrodeletionsandmi- thatmoreancientancestralgenomescanoftenbereconstructed croinsertions(oflengthatmost10bp)alongeachbranchofthe more accurately than those of their more recent descendants. treewasestimated(Fig.5).Wedidnotattempttoestimatethe Whyexactlyisthisso?Forsimplicity,considerthecaseofrecon- indelratesalongthefourdeepestbranchesofthetreebecause(1) structingasinglebinaryancestralcharacterstateintherootspe- for the two deepest branches of the tree, deletions cannot be cies (e.g., purine vs. pyrimidine at a given site) under a simple distinguishedfrominsertions,and(2)forthetwobranchesinci- modelinwhichthepriorprobabilitydistributionontheances- dentupontheBoreoeutherianancestor,deletionsandinsertions tralcharacterisuniform,substitutionratesareknown,symmet- are crucially determined by the presence or absence of aligned ric,homogeneous,andnottoohigh,andthetotalbranchlength basesinarmadillo,whichisoftenunreliablyaligned.Amongthe in the phylogenetic tree from the root ancestor to each of the branches where indels can be accurately counted, the rate of modernspeciesisthesame(i.e.,assumeamolecularclock).Here, deletionsisconsistentlytwotothreetimeshigherthantherate eachofnmodernspecieshasastatethatdiffersfromtheances- ofinsertions,withthelowestdeletion/insertionratiosfoundin tralonewiththesameprobabilityp<1/2.Ifthetreeexhibitsa thedogandtheprosimianlineages,andthehighestonesfound star topology (Fig. 3B), in which each of the modern species inthepre-mouse–rat-splitrodents,horse,andcowlineages.De- derives directly from the ancestor on an independent branch, letionandinsertionratesaredefinitelynotfollowingamolecular clock,withratesinprimates∼2.5timeslowerthanthoseinro- thenitisclearthatthemaximumlikelihoodandBayesianmaxi- mum a posteriori reconstructions of the ancestral character dents and 1.3–1.5 times lower than those in artiodactyls and agree,andthereconstructedstateistheonethatismostoften carnivores.Theresultsforhumanversusrodentsareinrelatively observedinthenmodernspecies.Theprobabilityofanerrorin closeagreementwiththoseobtainedfromastudyofthewhole reconstructionis: human, mouse, and rat genomes (RGSPC 2004). Deletion and itnhseeertxipoenctreadtensuarmebcelorsoeflysucbosrtrietluattieodnwspitehrssuitbesbtiettuwtieoennr1a5teas,nwdi2th0 (cid:1)n (cid:1)nk(cid:2)pk(cid:3)1−p(cid:4)n−k timeshigherthanthedeletionrate(SupplementalFig.S3),with k=(cid:1)n(cid:2)2(cid:2) 2418 Genome Research www.genome.org Ancestral genome reconstruction thesametimetothecommonancestor),becausethestartopol- ogy maximizes the mutual information between the residues at the leaves and at the root (Schultz et al. 1996; Schultz and Churchill1999).Thishasbeenrigorouslyprovenforasymme- tric substitution model in the case of binary characters (Evans et al. 2000, Theorem 6.1). However, there are counterexamples with many-valued characters, e.g., amino acids, where for suf- ficiently long branches, the star topology does not provide the best ancestral reconstruction, i.e., the highest mutual in- formation(B.LucenaandD.Haussler,inprep.).Thus,thepre- cise relationship between tree topology and reconstructability of the ancestral state appears to be rather subtle in the general case. While suggestive that reconstruction of a reasonable ap- proximation to an eutherian ancestral euchromatic genome maybewithinourreach,oursimulationresultshaveanumber ofimportantlimitationsasfollows:(1)Theratesofsubstitutions, deletions,andsmallinsertionsareassumedtobeconstantacross sequence position and homogeneous across branches, with branchlengthsproportionaltothoseinaparticulartree(Eizirik etal.2001),scaledtofitratesestimatedfromaparticularregion (theCFTRregion)(Thomasetal.2003).Ifthesubstitutionrates weregrosslyunderestimated,ortherewereverystrongclustering of mutations or “hotspots,” i.e., regions whose mutation rate was, say, double the average nonfunctional parts of the CFTR Figure5. Frequencyofmicrodeletions(1–10bp)(left)andmicroinser- locus, there would be more genome positions where key in- tions (right) during eutherian evolution. Indel rates for the branches formation was irrevocably lost in the early branches, and the shownwithdashedlinescannotbeaccuratelyestimated.Estimatesare accuracy of the reconstruction would be reduced. (2) Different basedonasetofregionstotalingabout280kb,forwhichsequencedata isavailableforall19mammals. modes of selection are not modeled, including specific types of purifying selection in codons and other functional regions, andpositiveselectionfornewfunctions.Theformerislikelyto whichisatmost[4p(1(cid:2)p)]n/2(Hoeffding1963;LeCam[Lemma help reconstruction, but the latter may inhibit the ability to 5p.479]1986).Thiserrorapproacheszeroexponentiallyfastasn accurately reconstruct the ancestor in certain critical sites. (3) increases.Whennistoosmall,theancestorisprobablynotre- Somenucleotide-levelmutationalprocesseslikeDNApolymerase constructable(Mossel2003). slippage effects (Nishizawa and Nishizawa 2002) or gene con- In contrast, a non-star topology (Fig. 3C) such as a binary version are not included in the simulation. These may change tree that has the same total root-to-leaf branch length and the patterns of molecular evolution in some areas and reduce our samenumbernofmodernspeciesattheleaveshastwononzero ability to infer ancestral states. Nonallelic gene conversion in lengthbranchesfromtherootancestorRleadingtointermediate particular could, in principle, make it difficult to apply the ancestorsAandB,andinformationisirrevocablylostalongthese reconstruction method we use to find ancestral versions of re- two branches. No matter how large the number n of modern petitive regions in some cases. However, we saw no evidence descendantspeciesderivedfromAandB,onecandonobetterat that this is a serious problem in our analysis of the alignment reconstructingthestateatRthanifoneknewforcertainthestate of ancestral repeats, such as the MER20 shown in Figure 4. (4) initsimmediatedescendantsAandB.Evenwiththisknowledge, Large-scale mutational processes like tandem and segmental theaccuracyofreconstructionofRfromAandBwillbestrictly duplication,inversion,andtranslocationarenotincludedinthe <100% for all reasonable models and nonzero branch lengths. simulations. The alignment of the one multimegabase mam- Thereconstructiongetspoorerthelongerthebranchlengthsare maliangenomeregionwherewehavedatafrommanyspecies, toAandB.ThisextendstothecasewheretheancestorRbeing the CFTR region, shows a dearth of such changes. However, it reconstructedhasaboundednumberofindependentimmediate is estimated that perhaps 10% of the euchromatic human ge- descendantsandtothecasewheredescendantsofanearlieran- nome has been subject to recent duplications (Samonte and cestorofR(outgroups)arealsoavailable.Thelongbranchescon- Eichler 2002) and/or an excess of rearrangements (Kent et al. necting them to the rest of the tree are why some more recent 2003;PevznerandTesler2003),suggestingthatatleastasimilar ancestralsequencesinthetreeofFigure1arelessreconstructable proportionoftheancestraleuchromaticgenomewouldbediffi- thantheBoreoeutherianancestor,whichactsalmostliketheroot cult to reconstruct without additional data and better tech- ofastartopology. niques. Theaboveanalysisshowsthatthestartreeisalwaysthebest Despite these shortcomings, our validation of the recon- topologyforreconstructioninthelimitasthenumbernofob- struction by both simulation and ancestral repeat and codon served species becomes large, while the time to the common analysisonactualdatasuggeststhatforregionslikeCFTR,which ancestor remains fixed. A stronger claim is that for every n arelikelytobetypical,theaboveissuesarenotsevereenoughto and every time to the common ancestor, the star tree with n preventareasonablyaccuratereconstruction. leaves is always more favorable for ancestral reconstruction Moresignificanttechnicalchallengesremainifwewishto than any branching tree that has internal “shared” nodes (but conduct in vivo functional tests of reconstructed ancestral ge- Genome Research 2419 www.genome.org Blanchette et al. nomicregions,eitherincelllinesorinmousemodels.Multikilo- letionsareinitiatedatarateabout0.056timesthesubstitution base sequences of transgenic DNA can be inserted into mouse rate, their length is chosen according to a previously reported embryonic stem cells via homologous recombination (“knock- empiricaldistribution(Kentetal.2003)thatrangesbetweenone in”) methods (Prosser and Rastan 2003; Robertson et al. 2003) and5000nucleotides,andtheirstartingpointisuniformlydis- andBACtransgenics(YangandSeed2003).“Humanized”mice, tributed. Insertions occur randomly according to a mixture which have specific individual genes replaced by their human model. Small insertions (of size between 1 and 20 nt) occur at versions,havebeenproducedbythesemethods.Multimegabase half the rate of deletions, their size distribution is empirically determined (Kent et al. 2003) and their content is a random transgenicsequenceshavebeenintroducedinmammalianarti- sequence where each nucleotide is chosen independently from ficialchromosomes,e.g.,forthehumanCFTRlocus(Auricheet the background distribution. We also simulate the insertion of al. 2002). However, these methods of introducing foreign DNA retrotransposons.Forthis,weusedalibraryof15differenttypes areexpensiveevenwhenusingavailablegenomicsequences,and of transposable elements chosen to cover the large majority of new methods for synthesizing large segments of DNA de novo repetitive elements observed in well-studied mammals (Jurka wouldbeneededtoapplythemtoancestralgenomicreconstruc- 2000).Eachinsertionwasaccomplishedbyrandomlyselectinga tion, e.g., to produce what might be called “retrovolved” mice partoftheconsensussequenceforagiventypeofelementand thatharbortheancestralversionsofspecificgeneloci.Further- insertingitatarandomlocationinthesequence,i.e.,wedidnot more, multiple loci would have to be changed to explore co- model preferences for particular insertion locations (IHGSC evolvingsetsofgenes.However,iftheseobstaclescanbeover- 2001).Therateofinsertionofeachrepeatvariesfrombranchto come, it would be quite interesting to attempt in vivo tests of branch, so that certain retrotransposons (like ALUs, SINEs B2, reconstructed ancestral genomic regions in a mouse model, es- BOV)arelineagespecific,whileothers(L1,LTR,DNA)areboth peciallyincaseswherephenotypicdifferencesbetweenmiceand presentinthesequenceattherootofthetree(witharangeof theplacentalancestorarehypothesized. decayinglevel)andcanbeinsertedalonganybranch.Carewas Extanteutherianspeciesarevariationsonacommon“mam- taken to ensure that the rate of repeat insertion yields a set of maliantheme.”Accuratereconstructionoflargegenomicregions sequences whose repetitive content and repeat age distribution of an eutherian ancestor may help us identify and understand resembles closely those previously reported for human (IHGSC the common functional elements of that theme, as well as the 2001)andmouse(IMGSC2002),andresemblesthoseobservedin lineage-specificevolutionaryinnovationsthatledtothemodern thegreaterCFTRregionforothermammals.Incaseswhereno variationsonit.Becausedistancesarereducedanddirectionof lineage-specificrepeatinformationwasavailable,weusedrepeti- tive element consensi of species not used in this study (mono- change can be resolved, much can be learned by comparing tremes and marsupials) and used an insertion rate equal to the mammaliangenomestotheircommonancestorratherthanpair- insertionrateofALUsintheprimatelineage. wiseamongthemselves.Becausethenumberofsubstitutionsper We use the above methods to simulate evolution from an siteleadingfromtheplacentalancestralgenometothehuman ancestral mammalian sequence forward to modern versions of genome is only one third of that from the ancestor to mouse thatsequence,simulatingspeciationeventsatthebranchpoints (Cooper et al. 2003; Thomas et al. 2003; RGSPC 2004), the an- of the tree, and substitutions, insertions, and deletions along cestral genome is much closer to our own genome than is the eachbranch.Toinitiatesuchasimulation,wefirstneedtogen- mousemodel.Whilethepresentworkisonlyasmallfeasibility erateahypotheticalancestralmammaliansequencetogoatthe study,inthelongrun,weexpectthatanaccurateancestralre- root of the tree. This is the sequence that we will later try to constructionoftheeuchromaticgenomicregionsofourplacen- reconstruct from the sequences at the leaves of the tree. This talancestorwillproveextremelyvaluableforstudyingtheevo- hypotheticalancestralmammaliansequenceisgeneratedbyan- lutionaryprocessesandspecificevolutionaryeventsthatshaped other simulation, i.e., starting with a repeat-free 40% GC-rich ourowngenome,aswellasthegenomesofothermodernmam- randomsequence,wesimulateitsevolutionforatimeandata malianspecies. ratesimilartothosebetweenhumanandmouse,usingthesame setofmutationaloperationsaspreviouslydescribed,butinsert- ingtransposonsthatarebelievedtopredatethemammalianra- diation.Thissimulatedancestralsequencethushasarepeatcon- Methods tent and age distribution that should approximate that of the actualancestralmammaliangenome. Simulationprocedure We built a simulation procedure, based on the Rose program Alignmentandreconstruction (Stoye et al. 1997), that mimics the evolution of mammalian sequencesundernoselectivepressure.Thesimulationsarebased Aftergeneratingasetofsimulatedsequences,thesequencesare onthephylogenetictreeinferredbyEiziriketal.(2001)onaset first soft-repeat-masked using RepeatMasker (Smit and Green of genes for a large set of mammals. The branch lengths are 1999)andthenalignedusingtheThreadedBlockAligner(TBA) uniformly scaled by a factor of K=2.1, chosen to fit as closely multiple-alignment program (Schwartz et al. 2003; Blanchette as possible the substitution rates observed in neutral DNA of a et al. 2004). The TBA alignment of the CFTR region can be in- 1.87-Mb region of human chromosome 7 with orthologous se- teractively explored on the human genome browser (Kent et quencesineightothermammals(SiepelandHaussler2003;Tho- al.2002)athttp://genome.ucsc.edu(undertheENCODEtracks), mas et al. 2003). Given this phylogenetic tree, we simulate se- and is updated as new species become available. An archival quence evolution by performing random substitutions, dele- versionofthealignmentusedinthisstudyisavailableathttp:// tions, and insertions along each branch, in proportion to its genome.ucsc.edu/ancestors.Theancestralsequenceispredicted length.Substitutionsfollowacontext-independentHKYmodel based on this multiple alignment. To determine which posi- (Hasegawa et al. 1985) with Ts/Tv=2, p(a)=p(t)=0.3, and tions in the multiple alignment correspond to bases that were p(c)=p(g)=0.2,exceptthatsubstitutionratesofCpGpairsare in the common ancestor and which represent lineage-specific 10timeshigherthanotherrates(SiepelandHaussler2003).De- insertions we start by using RepeatMasker (Smit and Green 2420 Genome Research www.genome.org Ancestral genome reconstruction 1999)tosoft-maskrepetitiveregions.Lineage-specificlikeALUs seen in Supplemental Figure S3, (a), p turns out to be an ex- e are excised as their insertion came after the Boreoeutherian cellent predictor of mismatch errors, but a poor predictor of ancestor. The repeat-masked multiple alignment is then fed to added bases. On the other hand, Figure S3, (b) shows that n id agreedyalgorithmthatattemptstoexplaintheremainingindels isgoodatpredictingaddedbases,butquiteinefficientatpredict- with the plausible scenario. For the most part, the algorithm ingmismatcherrors.Theprobabilityoferrorofeachtypecanin assumesthatthealignmentisphylogeneticallycorrect,i.e.,that factbeestimatedjointlyforeachpair(p,n ),whichprovidesa e i) two bases are aligned if and only if they derive from a com- reasonable confidence estimate for both types of errors at any monancestralbase.Seebelowforadeparturefromthisassump- reconstructed base, making it possible to identify high- tion.Originally,allofthegapsinthealignmentaremarkedas confidence or low-confidence bases in the reconstructed se- unexplained. The algorithm iteratively selects the insertion quence. ordeletion,performedalongaspecificedgeofthetreeandspan- ning one or more columns of the alignment, that yields the largestnumberofalignmentgapsexplainedperunitofcost.The Acknowledgments number of gaps explained by a deletion is the number of un- explainedgapsinthesubtreeabovewhichthedeletionoccurs. WethankJimKent,ArianSmit,AdamSiepel,GillBejerano,Elliot The number of gaps explained by an insertion is the number Margulies, Brian Lucena, Leonid Chindelevitch, and Ron Davis of unexplained gaps in the complement of the subtree above forhelpfuldiscussionsandsuggestions.W.M.wassupportedby which the insertion occurs. The costs are defined heuristically. grant HG-02238 from the National Human Genome Research Thecostofadeletionisgivenby1+0.01log(L)(cid:2)0.01b,where Institute, E.G. was supported by NHGRI, D.H. and M.B. were Listhelengthofthedeletionandbisthelengthofthebranch supported by NHGRI Grant 1P41HG02371 and the Howard along which the event takes place. The cost of an insertion HughesMedicalInstitute.Finally,wethanktheNISCCompara- isgivenby1+0.01log(L)(cid:2)0.01b(cid:2)r,whereLandbaredefined tiveSequencingProgramforprovidingmultispeciescomparative as above and r is a term that takes value 0.5 if the repeti- sequencedata. tive content of the segment inserted is >90%. Once the best insertion or deletion has been identified, its gaps are marked as “explained.” This does not preclude them from being part References of other indels, but they will not count in their evaluation. Finally,heuristicsareusedtoreduceerrorsduetoincorrectalign- Adey,N.B.,Tollefsbol,T.O.,Sparks,A.B.,Edgell,M.H.,and ment, in particular to reduce the problems caused by two re- HutchisonIII,C.A.1994.Molecularresurrectionofanextinct petitive regions from two distantly related species mistakenly ancestralpromoterformouseL1.Proc.Natl.Acad.Sci.91: aligned to each other, with other species having gaps in that 1569–1573. Auriche,C.,Carpani,D.,Conese,M.,Caci,E.,Zegarra-Moran,O., region. More precisely, a subtree containing at least six leaves, Donini,P.,andAscenzioni,F.2002.FunctionalhumanCFTR <20% of which have bases at a given position, will never be producedbyastableminichromosome.EMBORep.3:862–868. predicted to have an ancestral base at that position. See Austin,J.J.,Ross,A.J.,Smith,A.B.,Fortey,R.A.,andThomas,R.H.1997. Fredslund et al. (2003) for an alternative approach to the same Problemsofreproducibility—DoesgeologicallyancientDNAsurvive inamber-preservedinsects.Proc.R.Soc.Lond.BBiol.Sci. problem. After having established which positions of the mul- 264:467–474. tiplealignmentcorrespondtobasesintheancestor,wepredict Bejerano,G.,Pheasant,M.,Makunin,I.,Stephen,S.,Kent,W.J.,Mattick, which nucleotide was present at each position in the ancestor J.S.,andHaussler,D.2004.Ultraconservedelementsinthehuman using the standard posterior probability approach (Yang et al. genome.Science304:1321–1325. Bienvenu,T.,Petitpretz,P.,Beldjord,C.,andKaplan,J.C.1994.A 1995)basedonasimpledinucleotidesubstitutionmodelwhere missensemutation(F87L)inexon3ofthecysticfibrosis substitutions at the two positions are independent except for transmembraneconductanceregulatorgene.Hum.Mutat. CpG, whose substitution rate to TpG is 10 times higher than 3:395–396. thoseofothertransitions(SiepelandHaussler2003).Thebranch Birnbaum,D.,Coulier,F.,Pebusque,M.J.,andPontarotti,P.2000. Paleogenomics:Lookinginthepasttothefuture.J.Exp.Zool. lengths are inferred from the data using PHYML (Guindon 288:21–22. and Gascuel 2003). The equilibrium frequencies are estimated Blanchette,M.,Kunisawa,T.,andSankoff,D.1999.Geneorder from the data. The only parameter given to the reconstruc- breakpointevidenceinanimalmitochondrialphylogeny.J.Mol. tion algorithm is the transition/transversion ratio of the HKY Evol.49:193–203. Blanchette,M.,Kent,W.J.,Riemer,C.,Elnitski,L.,Smit,A.F.,Roskin, model,whichissetat2. K.M.,Baertsch,R.,Rosenbloom,K.,Clawson,H.,Green,E.D.,etal. Inexperimentsusingactualsequencedatafrompresentday 2004.Aligningmultiplegenomicsequenceswiththethreaded mammals,thesimulationstepsareomitted,andthesamealign- blocksetaligner.GenomeRes.14:708–715. mentandreconstructionprocedureisfollowed. Boffelli,D.,McAuliffe,J.,Ovcharenko,D.,Lewis,K.D.,Ovcharenko,I., Pachter,L.,andRubin,E.M.2003.Phylogeneticshadowingof primatesequencestofindfunctionalregionsofthehumangenome. Science299:1391–1394. Base-by-baseconfidenceestimates Bourque,G.,Pevzner,P.A.,andTesler,G.2004.Reconstructingthe genomicarchitectureofancestralmammals:Lessonsfromhuman, Theprobabilitythatagivenancestralbaseisincorrectduetoa mouse,andratgenomes.GenomeRes.14:507–516. mismatchoraddedbasecanbeapproximatedempiricallybased Challem,J.J.1997.Didthelossofendogenousascorbatepropelthe ontwoindicatorsofreconstructionerrors.Thefirstindicatoris evolutionofAnthropoideaandHomosapiens?Med.Hypotheses thetheoreticalsubstitution-basedreconstructionerrorprobabil- 48:387–392. Chang,B.S.andDonoghue,M.J.2000.Recreatingancestralproteins. ityp calculatedasthesumoftheposteriorprobabilitiesofthe e TrendsEcol.Evol.15:109–114. threeleastlikelyancestralnucleotidesatthatposition(Yanget Chang,B.S.,Jonsson,K.,Kazmi,M.A.,Donoghue,M.J.,andSakmar,T.P. al. 1995). The second indicator n is the number of insertion 2002.Recreatingafunctionalancestralarchosaurvisualpigment. id anddeletioneventsthatspanthesite,asestimatedbyourrecon- Mol.Biol.Evol.19:1483–1489. Cooper,G.M.,Brudno,M.,Green,E.D.,Batzoglou,S.,andSidow,A. struction method. Each reconstruction error observed during 2003.Quantitativeestimatesofsequencedivergenceforcomparative the simulation was recorded, together with pe (rounded to the analysesofmammaliangenomes.GenomeRes.13:813–820. closestpercentagepoint)andn forthecorrespondingsite.As Cunningham,C.W.,Omland,K.E.,andOakley,T.H.1998. id Genome Research 2421 www.genome.org
Description: