Estimating Divergence Time and Ancestral Effective Population Size of Bornean and Sumatran Orangutan Subspecies Using a Coalescent Hidden Markov Model Thomas Mailund1*, Julien Y. Dutheil1, Asger Hobolth1,2, Gerton Lunter3, Mikkel H. Schierup1,4 1BioinformaticsResearchCentre,AarhusUniversity,Aarhus,Denmark,2DepartmentofMathematicalSciences,AarhusUniversity,Aarhus,Denmark,3TheWellcome TrustCentreforHumanGenetics,Oxford,UnitedKingdom,4DepartmentofBiology,AarhusUniversity,Aarhus,Denmark Abstract Duetogeneticvariationintheancestoroftwopopulationsortwospecies,thedivergencetimeforDNAsequencesfrom two populations is variable along the genome. Within genomic segments all bases will share the same divergence— becausetheyshareamostrecentcommonancestor—whennorecombinationeventhasoccurredtosplitthemapart.The size of these segments of constant divergence depends on the recombination rate, but also on the speciation time, the effectivepopulationsizeoftheancestralpopulation,aswellasdemographiceffectsandselection.Thus,inferenceofthese parametersmaybepossibleifwecandecodethedivergencetimesalongagenomicalignment.Here,wepresentanew hidden Markov model that infers the changing divergence (coalescence) times along the genome alignment using a coalescent framework, in order to estimate the speciation time, the recombination rate, and the ancestral effective population size. The model is efficient enough to allow inference on whole-genome data sets. We first investigate the powerandconsistencyofthemodelwithcoalescentsimulationsandthenapplyittothewhole-genomesequencesofthe twoorangutansub-species,Bornean(P.p.pygmaeus)andSumatran(P.p.abelii)orangutansfromtheOrangutanGenome Project.Weestimatethespeciationtimebetweenthetwosub-speciestobe334+145thousandyearsagoandtheeffective populationsizeoftheancestralorangutanspeciestobe26,800+6,700,consistentwithrecentresultsbasedonsmallerdata sets. We also report a negative correlation between chromosome size and ancestral effective population size, which we interpretas asignature ofrecombination increasing the efficacy ofselection. Citation:MailundT,DutheilJY,HobolthA,LunterG,SchierupMH(2011)EstimatingDivergenceTimeandAncestralEffectivePopulationSizeofBorneanand SumatranOrangutanSubspeciesUsingaCoalescentHiddenMarkovModel.PLoSGenet7(3):e1001319.doi:10.1371/journal.pgen.1001319 Editor:JonathanK.Pritchard,UniversityofChicagoHowardHughesMedicalInstitute,UnitedStatesofAmerica ReceivedNovember17,2009;AcceptedJanuary25,2011;PublishedMarch3,2011 Copyright:(cid:2)2011Mailundetal.Thisisanopen-accessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense,whichpermits unrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited. Funding:ThisworkisfundedbyEuropeanResearchAreainPlantGenomics(ERA-PG)ARelativesandTheDanishResearchCouncil(FNU272-05-0283).The fundershadnoroleinstudydesign,datacollectionandanalysis,decisiontopublish,orpreparationofthemanuscript. CompetingInterests:Theauthorshavedeclaredthatnocompetinginterestsexist. *E-mail:[email protected] Introduction exploitedbyTakahata[8]inordertoderiveasimpleestimatorof theancestraleffectivepopulationsize,butwhenappliedtohuman Thereisagrowingawarenessthatthegenomicsequencesnow andchimpanzeetheestimatewasassociatedwithalargevariance, available for closely related species or sub-species may provide because of the limited divergence of these species. Takahata [9] detailed information on the population genetics process in the showedthatincludinganoutgroupimprovedtheresults,andYang ancestors of these species and about the speciation process itself [10]showedthat mutationrate heterogeneityisconfoundedwith [1].Thisisbecausethedivergencepatternsofasetof(sub-)species theestimate.Theseearlyapproachesestimatedafixedphylogeny vary along their genomes due to polymorphism in the ancestral fordifferentsequencefragments.Laterapproacheshaveexploited species. Different parts of the genome have different histories the fact that if speciation times are sufficiently close together, because recombination has brought together the genome from incomplete lineage sorting (cases where the gene tree is different different ancestors. Viewed back in time, two sequences are from the species tree) may occur. Wall [11] allowed for therefore, at any given point in the genome, a sample of two recombination and several species in the likelihood estimation of individuals in the ancestral species and the identity of these population parameters and Patterson et al. [2], Hobolth et al. [5], individualsvaryalongthesequence,asillustratedinFigure1.The andDutheiletal.[12]madesimplemodelsofchangesingenealogy varyingsequence-divergencetimesprovideinformationthatallows along a multi-species alignmentwith incompletelineage sorting. ustoexaminethespeciationprocess[2,3]andenableinferenceof DetailedmodelingusingMCMCofthegenealogiessupporting parameters for the population genetics of the ancestral species, a data set [13], or the ancestral recombination graph [3] in a such as the effective population size or speciation-divergence or modelthatincludesrecombination,hastheadvantageofallowing population-divergence times[4–6]. modeling more complex aspects of the data such as gene flow Coalescent theory [7] tells us that the variation in coalescence through migration, but scaling these to whole genome data is time in an ancestral population is directly proportional to the challenging. In contrast, approaches based on hidden Markov effective population size of the ancestral population. This was models(HMMs),suchasHobolthetal.[5],andDutheiletal.[12], PLoSGenetics | www.plosgenetics.org 1 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM becomeone(acoalescenceevent).Wemodelthetwospeciesusing Author Summary an isolation model (recall Figure 1). Initially, the sequences from WepresentahiddenMarkovmodelthatusesvariationin the two populations evolve independently, meaning that coales- coalescence times between two distantly related popula- cence events always involves sequences from within the same tions, or closely related species, to infer population population. After the population split event, the two populations genetics parameters in ancestral population or species. become one, and the two (fragmented) sequences can now start Themodelinfersthedivergencetimesinsegmentsalong findingmostrecentcommonancestors.Asaresultoftheseevents, thealignment.Usingcoalescentsimulations,weshowthat the sequence divergence varies as we scan along the genomic the model accurately estimates the divergence time sequences. The patterns of sequence divergence, and the betweenthetwopopulationsandtheeffectivepopulation distribution of recombination events separating segments with size of the ancestral population. We apply the model to constantdivergence,areinformativeofthecoalescentprocessesin the recently sequenced orangutan sub-species and esti- theseparate populations andthat oftheancestor. mate their divergence time and the effective population Themodelisparameterizedwiththesplittimebetweenthetwo size oftheirancestorpopulation. populations, the effective population size of the ancestral population, and the recombination rate, which is assumed to be provide computationally fast inference algorithms, which are constant along the segment of the genome analysed. The model scalable to whole-genome data sets. Nevertheless, modeling assumes that no migration occurs after the population split. We complex demographic models in terms of transition matrices validatethemodelbysimulationsandthenapplyittotherecently and emissionprobabilities ismathematically challenging. sequencedgenomesofthetwoorangutansubspeciesBornean(P.p. Here we present a new approach for constructing transition pygmaeus) and Sumatran (P. p. abelii) [17]. We estimate the sub- matricesforHMMapproachesanduseittocapturethevariation speciesdivergencetimetobe334+145thousandyearsago(kya), incoalescencetimebetweenthegenomesoftwoindividualsfrom andtheeffective populationsize oftheancestral orangutan tobe two divergent populations or two closely related species. The 26,800+6,700 estimated from the autosomal chromosomes, and model traces the ancestry of the two sequences using the about 3=4 ofthat for theXchromosome: 20,400+7,400. coalescenceprocesswithrecombination[14–16],andthisprocess determinestheswitchingprobabilitiesfromonecoalescenttimeto Results another along the pair of sequences. Considering a junction between two nucleotides, and going back in time, two types of A coalescent time hidden Markov model events may occur: a sequence can split up in two fragments (a We have developed a new coalescent hidden Markov model recombination event) and two fragments can merge again to (CoalHMM). The properties characterizing CoalHMMs [5,12] Figure1.Theancestryoftwogenomicsequences.Thefigureillustratestheancestryoftwogenomicsequences(inredandblue)fromtwo differentpopulations.Tracingtheirancestrybackintime,thefirsteventweseeisarecombination(attimer andsequencepositionb )withinthe 1 1 ‘‘bluepopulation’’.Thisisfollowedbythepopulationsplit(attimet ),andthusthe‘‘redgenome’’enterstheancestralpopulationasacontiguous 1 segment, while the ‘‘blue genome’’ enters the ancestral population in two fragments. The process in the ancestral population now under goes coalescenceevents(attimesc ,c ,andc )andarecombinationevent(attimer andsequencepositionb ).Aconsequenceofthecoalescence 1 2 3 2 2 process with recombination is that the coalescence times, and thus the sequence divergence, changes along the genomic alignment at the recombinationbreakpoints(illustratedontheright). doi:10.1371/journal.pgen.1001319.g001 PLoSGenetics | www.plosgenetics.org 2 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM are that they model the coalescent process as a Markov process event per 2N generations, where N is the effective population e e along an alignment, and conditional on the alignment can infer size),andtherateofchangefromthesecondstatetothefirstisthe theunobservedgenealogiesthatleadtothealignment,orintegrate rate of recombinationevents. overthedistributionofgenealogies.Whilethecoalescentprocessis Assuming that the two nucleotides are initially linked, i.e. that only Markovian in time but not in space [16,18], we have the CTMC is initially in state 1, we can calculate the probability previously shown that we can reasonably approximate it as a that it isin state 1 or state 2 for anypoint in time (see Figure 3). Markov process along the sequence [12,19,20] and doing this This probability distribution will tend toward an equilibrium enablesustodevelopveryefficientinferencealgorithmscompared defined by the ratio between the coalescent rate and the to the sampling based approaches usually needed to capture the recombination rate. As the coalescence rate is inversely propor- true coalescent process. tional to the effective population size, the larger the effective The crux of developing a CoalHMM is specifying the populationsizethelesslikelythatthenucleotidesarelinkedatthe transition probabilities of the hidden Markov model in terms of equilibrium, although in general the CTMC is much more likely the coalescent process parameters, e.g. recombination rates and to be in the linked state than the unlinked as in general the effective population sizes. In Hobolth et al. [5], this issue was recombination rate for two neighboring nucleotide is orders of largely ignored and coalescent parameters were obtained from magnitude smaller than thecoalescence rate. post-processingofinferredhiddenMarkovmodelparameters.In For two sequences, the coalescence process we consider Dutheil et al. [12] the transition probabilities were derived from containstwoadjacentnucleotidesfromeachsequence,andthe the coalescent process through a set of rather complicated state space consists of all possible ways that these four equations and simplifying assumptions. The approach we nucleotides can be combined: linked or unlinked between left describe in this paper greatly simplifies the computation of and right nucleotides and having coalesced into their most transition probabilities and does so without simplifying assump- recent common ancestor or not for nucleotides at the same tions beyond assuming that the process is Markovian along the position.InFigure4wehavesummarizedthestatespaceofthe alignment. system. The 15 states correspond to the various ways the two The key insight behind our new approach, also observed in genomes (the top and the bottom row) can link the left and Dutheil et al. [12], is that since we assume that the process is right nucleotides or be merged in common ancestors (white Markovian along the alignment we need only consider the dots). genealogies of pairs of adjacent nucleotides. The new approach When modeling the ancestry of one sequence for each of two differsfromDutheiletal.inthewayweexploitthisinsight.Here, populations, the system will first evolve independently as two we explicitly consider the coalescent with recombination process single-sequenceCTMCsbackintimeandthenmergeintoasingle forpairsofnucleotidesandderivetheexacttransitionprobabilities two-sequenceCTMCwithinitialprobabilitydistributiongivenby fromthismodel. theend-statesofthesinglesequenceCTMCs.Thetwosequences The new approach that allows us to calculate the exact from the single-sequence CTMCs can, after entering the two- transition probabilities from the coalescence process with recom- sequence CTMC, recombine and coalesce as in the single- binationdiffersfromthepreviousCoalHMMswehavedeveloped, sequence CTMC, but now with the possibility of linking the left- where thisprocess has beenapproximated. nucleotidefromonepopulationtotheright-nucleotideoftheother Modeling genealogies as a two-nucleotide coalescent population.Figure5showstheprobabilitydistributionofthetwo- process. In our new CoalHMM approach, we use two sequence CTMC states for states 1 to 7 (see Figure 4) that different Markov models: one that models the coalescence times corresponds to the states where left and right nucleotides are along the sequences as a discrete space Markov model, and one recombining andcoalescing. that models the ancestry of two neighboring nucleotides back in In addition to these events, it is possible for the two left- time as a continuous time, finite state Markov model. The first nucleotides or the two right-nucleotides to coalesce into the most model is used as the hidden Markov model when estimating recent common ancestor of the two original sequences. These parameters, while the second is used to compute the transition eventsareirreversibleintheCTMCandeventuallybothleftand probabilities of thefirst. rightnucleotideshavefoundamostrecentcommonancestorand Thecoalescentprocess,whenviewedasaprocessintimerather the two-sequence CTMC essentially reduces to the single- than along the alignment, is a continuous time Markov chain sequence CTMC. (CTMC) on a finite state space. For a single sequence, the two Calculating transition probabilities from the two- nucleotideCTMChasjusttwostates:thetwoadjacentnucleotides nucleotide coalescent process. From the two CTMC can be linked, i.e. sitting on the same haploid chromosome, or systems we can compute the probability distribution of unlinked,i.e.sittingintwodifferentchromosomes(seeFigure2).A genealogies of adjacent nucleotides back in time. We use this to recombinationeventchangestheCTMCfromthefirststatetothe construct the secondMarkov model, a Markovprocess along the second and a coalescent event changes the CTMC from the alignment, inthefollowingway: 1)wesplitthecoalescencetimes second state back to the first. The rate of change from the first backintimeintoafinitesetoftimeintervalsthatwillbethestates statetothesecondistherateofcoalescenceevents(onecoalescent of the hidden Markov model along the alignment, 2) for time intervals i and j we use the CTMC system to compute the probabilitythattheleftnucleotidesfindtheirmostrecentcommon ancestor in interval i while the right nucleotides find their most recent common ancestor in interval j, Pr(L[i,R[j) and 3). We calculatethetransitionprobabilityofmovingfrominterval/statei Pr(L[i,R[j) tointerval/state j as . Pr(L[i) Figure2.Statesforthesinglespeciessystem.Forasinglespecies Ofthesesteps,1)and3)aretrivialtoachieve.Step2)isachieved wehavetwonucleotides,oneleftandoneright,andthesecaneither belinkedorunlinked. byconsideringfourdifferentsub-setsofthetwo-sequenceCTMC: doi:10.1371/journal.pgen.1001319.g002 VB, VL, VR and VE. The first set consists of the states where PLoSGenetics | www.plosgenetics.org 3 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM Figure3.Stateprobabilitiesforthesinglesequencetwo-nucleotidecoalescentprocessasafunctionoftime.Thefigureshowsthe probabilityofthetwoneighboringnucleotidesbeinglinkedinthesingle-sequenceCTMCandhowthisprobabilityevolvesovertime.Theprobability forbeingunlinkedis,ofcourse,oneminustheprobabilityofbeinglinked,sincetheCTMChasonlytwostates.Thetwodifferentlinescorrespondto twodifferentcoalescencerates.Forpopulation1,thecoalescencerateissetto1(sothex-axisisinunitsof2N generationsforthisspecies),whilefor e population 2 the coalescence rate is half that, corresponding to population 2 having twice the effective population size of population 1. The recombinationrateissetto2:10{4.Assuming2N of20,000forpopulation1and40,000forpopulation2,andagenerationtimeof20years,this e correspondsto1cM/Mbpand1unitonthex-axiscorrespondsto400,000years.Theseparationtimeweinferfortheorangutansub-speciesis around0:75onthex-axis,wherethesystemisstillfarfromequilibrium. doi:10.1371/journal.pgen.1001319.g003 neither left nor right nucleotide have reached their most recent Togettheprobabilitye.g.fori~jwecalculatetheprobabilitythat commonancestor,thenexttwoconsistofthestateswhereonlythe the system is in V until it reaches interval i and then is in V B E leftoronlytheright,respectively,havereachedtheirmostrecent when it leaves interval i (see Figure 7, left). For ivj (and common ancestor, and finally the last set consists of the states symmetricallyforjvi)wecalculatetheprobabilitythatthesystem where both left and right nucleotides have reached their most isinVB untilitreachesintervali,leavesintervaliinVL andstays recent common ancestor. Tracing the history of the nucleotides inVLuntilitreachesintervaljwhichitleavesinVE (seeFigure7, back in time, all histories begin in a state in V and end in V . right). B E MosthistoriesgodirectlyfromV toV (wherebothleftandright B E nucleotides coalesceat thesame timebybeinglinkedat thetime Model validation: simulation study they reach their most recent common ancestor) but some find a The power and consistency of the model can be evaluated mostrecentcommonancestorfirstattheleftnucleotide,astatein through simulations. The HMM is designed to approximate the VL,orfirstattherightnucleotide,astateinVR,beforereaching coalescent with recombination process. Thus, we simulate data VE. The probability of being in the four classes of states as a usingacoalescentwithrecombinationandthencompareinferred functionof timeisshown inFigure 6. parameters from the HMM model with the values used in the Fromtheprobabilitydistributionofthefourstateclassesbackin simulations. time, we can construct the joint probability of having the left We first examined the fit between the coalescent process with nucleotide finding its most recent common ancestor in interval i recombination and the Markov approximation by examining the andtherightnucleotidefindingitsmostrecentcommonancestor empiricaldistributionoftimespentineachtimeintervalwiththe inintervaljinastraightforwardmanner(seeMethodsfordetails). Markovcalculations(seeSupplementalSection1.1ofProtocolS1). Figure4.Statesforthetwospeciessystem.Forthetwospeciessystem,wehaveoneortwoleftandoneortworightnucleotides.Onewhen the left or right nucleotides have found their MRCA (open circles) and two otherwise (filled circles). Left nucleotides can be linked with right nucleotidesornot. doi:10.1371/journal.pgen.1001319.g004 PLoSGenetics | www.plosgenetics.org 4 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM Figure5.EvolutionofV statesinthetwo-sequenceCTMC.Thefigureshowsthebeginningstatesinthetwo-sequenceCTMCandhowtheir B probabilitiesevolveovertime.Theasymmetrybetweenwhichoriginalsequenceislinkedversusunlinked,whenonlyonesequenceislinked(the secondplotfromthetop)iscausedbythedifferencesineffectivepopulationsizewithinthesingle-sequenceCTMCsystemusedfordeterminingthe initialdistributionofthetwo-sequencedistribution.TheinitialprobabilityistakenfromthetwopopulationsinFigure3attime10(therightedgeof that figure) and the rates in the two-sequence system correspond to those for population 1 in Figure 3, i.e. a coalescence rate of 1 and a recombinationrateof2:10{4. doi:10.1371/journal.pgen.1001319.g005 Hereafter we estimated the probability of moving from one time the ancestral species of 25,000, and 3) a recombination rate of intervaltoanotherbasedonsimulationsandcomparedthattothe 1.5 cM/Mb. We then explored how well the model estimates transitionprobabilitiesintheHMM(seeSupplementalSection1.2 these parameters. Since the model uses intervals of coalescence ofProtocolS1),andgenerallyagoodmatchbetweenthetwowas times to estimate the parameters, we expect that the number of found.Nextwecomputedthelikelihoodsurfaceforourthreemain states in the HMM will affect the estimates. We explored this by parametersandinspectedthesemanually.Generallywefoundthe analyzing the simulated data sets with 5, 10 and 15 states. The maximumlikelihoodnearthetruevalueforallthreeparameters, resultsare shown inFigure 8. but with a rather flat likelihood for the recombination rate Asisevidentfromthefigure,thereisabiasintheestimates:we parameter (see Supplemental Section1.3 ofProtocol S1). tend to underestimate the split time and at the same time Most important for the model is the estimation accuracy. To overestimate the ancestral population size. Similar biases were examine this, we simulated 100 data sets 500kbp in length with observedinourpreviousmodelbasedonincompletelineagesorting the following parameters: 1) a sub-species divergence time of rather than changes in divergence time [12]. Both biases are 335kya,2)aneffectivepopulationsizeforthetwosub-speciesand probablycausedbythesamemodelingartifact:Forthemodeltofit PLoSGenetics | www.plosgenetics.org 5 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM Figure6.Evolutionofthetwo-sequenceCTMC.Thefigureshowshowtheprobabilityofbeinginoneofthefourclassesofstatesevolveover time.Initially,thesystemisinV withprobability1,butthisprobabilitydropsexponentially.Witharelativelysmallprobabilitythesystemwillgo B throughastateinV orV beforeendingupinV butmainlyV statesmovedirectlytoV states.Therateparametersusedarethesameasthose L R E B E inFigure5. doi:10.1371/journal.pgen.1001319.g006 thedata,itmustpredictasequencedivergenceclosetotheobserved coalescentprocesswiththeMarkovassumptionbutcontinuoustime value,soifthesplittimeisdecreased,theeffectivepopulationsize intervals[20]weseethebiasontherecombinationratedisappear. mustincreasetoreachthesameaveragedivergence,andviceversa. Based on the simulation study, we believe that the bias in the Therecombinationrateissomewhatunderestimatedforallruns. recombinationrateiscausedbytheMarkovassumptionalongthe Toelucidatethesourceofthebiaseswedidanextensivesetof alignment, and that the bias in divergence time and effective simulationswithdifferentsubsetsofourmodelassumptionsmet,see populationsizeiscausedbythediscretizationofcoalescencetimes ProtocolS1.SimulatingdirectlyfromtheHMM,whereallmodel intofixedintervals,andweobservethatthebiasindeeddecreases assumptionsaremet,wefindnobiasinanyofthethreeparameters. as we increase the number of intervals and thus more accurately Simulating from the coalescent process with recombination but capture thetrue distribution. discretizingtimetothemeanofeachtimeinterval,weonlyseethe With10states,thebiasesonsplittimeandeffectivepopulation bias on the recombination rate but no longer the biases on sizearereduced comparedto5 states,andthegainofadding an divergence time and effective population size. Simulating from a additional 5 states is minor in comparison. For computational PLoSGenetics | www.plosgenetics.org 6 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM Figure7.TransitionprobabilitiescalculatedfromtheCTMCsystem.ThehiddenMarkovmodeltransitionprobabilitiesarecalculatedby consideringtheprobabilities,inthetwonucleotideCTMCsystem,thateitherbothleftandrightnucleotidefindsamostrecentcommonancestorin thesametimeinterval(left)orthattheleftnucleotidefindsamostrecentcommonancestorinonegiveninterval,i,andtherightinanother,j(right). doi:10.1371/journal.pgen.1001319.g007 efficiency (since the algorithm scales quadratically in the number interval. The genome-wide average is 334+145 thousand years of states), weused10states intheorangutan analysis. ago(kya). Ancestral effective population sizes. Figure 10 and Analysis of the orangutan sub-species Supplemental Figure 4 of Protocol S1 show the estimates of the We aligned the two orangutan genomes and divided the ancestraleffectivepopulationsizeforeachchromosomeandforthe alignment into 2,689 1 Mbp segments, obtaining independent entiregenome,respectively.SincetheXchromosomeisexpectedto maximum likelihood estimates of the divergence time, effective haveaneffectivepopulationsizeof3=4ofthatoftheautosomes,we populationsizeoftheancestralspecies,andrecombinationratefor estimate the genome-wide effective population size from the eachsegment.Theestimatesarebasedonamutationrateof10{9 autosomes only and obtain 26,800+6,700. The estimate for the per year and a generation time of 20 years, giving m~2:10{8 Xchromosomeis20,400+7,400,closetotheexpected3=4ofthe substitutions per generation. For the effective population size of estimatefromtheautosomes. the two sub-species we used estimates from Becquet and As for the estimates of the divergence time, the estimates are Przeworski [3]: 2N ~10,000 for Bornean orangutans and consistentbetweenchromosomes.Again,thegenomicmeanisnot e 2N ~17,000 for Sumatran orangutans. The genome-wide contained within the 50% confidence interval, but is within the e estimates aresummarized inTable 1. 95%interval.Thisislikelytoberelatedtotheobservationforthe Afterinferringparametersforeachsegmentweremovedoutlier divergence time estimates in Figure 9, as the parameter for segments where the inferred divergence time was below 5 divergencetimeandforancestralpopulationsizeareconfounded. thousand years or above 1 million years; the effective population In general, if we underestimate the divergence time we size was below 5,000 or above 100,000; and where the overestimate the effective population size, leaving the average recombination rate was below 0.1 or above 10. In total 203 divergence lessaffected. segments were removed, leaving 2486. Removing these outliers Recombinationrate. Figure11andSupplementalFigure5of had verylittleeffect on thegenome-wideestimates (see Table 1). Protocol S1 show the estimates of the recombination rate for each Divergence time. We obtained an independent estimate of chromosomeandfortheentiregenome,respectively.Weestimatethe the sub-species divergence time for each of the 1 Mbp segments. genome-wide recombination rate to be 0:95+0:72 cM/Mb. The Figure 9 shows the distribution of these estimates for each absolutevalueshouldbeinterpretedwithcaution,however,sincewe chromosomeandSupplementalFigure3ofProtocolS1showsthe knowfromthesimulationstudythattherecombinationrateislikelyto distributionfortheentiregenome.Ingeneral,theestimatesonthe be under-estimated. The estimated recombination rate correlates different chromosomes are consistent. For chromosome 21, the positively with the equilibrium GC content, as estimated from the genomicmeanestimateisnotwithinthe50%confidenceinterval substitutionmodel(Figure12)ashasalsobeenobservedforhuman (theblueboxinthebox-plot),butitiswithinthe95%confidence polymorphismdata. PLoSGenetics | www.plosgenetics.org 7 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM Figure8.Estimationaccuracyasafunctionofthenumberofhiddenstates.Theboxplotsshowtheestimatedparameters(divergencetime, ancestral effective population size, and recombination rate) for 100 simulated data sets. The true value is showed as the blue dashed line. The numberofstatesintheHMM,i.e.thenumberofcoalescencetimeintervalsusedfortheestimationtakesthevalues5,10and15.Thereisaclearbias intheestimateswherewetendtounderestimatethedivergencetimeandoverestimatetheeffectivepopulationsize.Thisbiasiscausedbythe discretisationof continuous coalescencetimes into fixed intervals, and the bias is reducedas the numberof states (i.e. intervals)increases.The recombinationrateisunder-estimated,whichisaconsequenceoftheMarkovassumption. doi:10.1371/journal.pgen.1001319.g008 Differences between chromosomes. Figure 13 shows the Robustnessoftheresults. Theestimatesareconditionalon averageestimatesofthekeyparametersforeachchromosomeasa theparameterswehavekeptfixedinthemodel:themutationrate, function of the chromosome size (measured as the number of thegenerationtime,theeffectivepopulationsizeofthetwopresent segments analyzed). We find no correlation between the day populations, andthenumber of hiddenstates. divergence time and the chromosome size, but a negative Sincecoalescencetimeisscaledintimeunitsof2N generations e correlation between inferred recombination rate and size. This andemissionprobabilitiesaregivenbythemutation(substitution) suggests higher recombination rate per base pair at small rate times the divergence time, changing either the assumed chromosomes as is also observed in the human genome. generationtime(20yearspergeneration)ortheassumedmutation Interestingly, the effective population size is also significantly rate (2:10{8 mutations per generation) will change the estimated higher onthesmaller andmore recombiningchromosomes. divergence time linearly: since time in the model is measured in PLoSGenetics | www.plosgenetics.org 8 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM number of states did not change the estimated parameters Table1. Genome-wideparameterestimates. significantly.Fordetails,seeSupplementalSection2.3ofProtocolS1. Genome-wideestimate Includingoutliers Discussion Sub-speciationtime 334,000+145,000 314,000+168,000 Wepresentanewmodelfortheanalysisoftwogenomesfrom AncestralNe(autosomes) 26,800+6,700 27,500+7,200 diverged populations or closely related species, with the aim of AncestralNe(X) 20,400+7,400 20,700+8,800 estimating divergence time, the effective population size of the Recombinationrate 0:95+0:72 0:97+0:86 ancestral population, andtherecombination rate. Simulation results show that the model infers the key Genome-wideestimatesofthekeyparameters,withandwithoutoutlier parameters without much bias when the number of coalescent estimatesremoved.Theancestraleffectivepopulationsizeisestimated states is sufficiently high (we recommend at least 10 states). The separatelyfortheautosomalchromosomesandtheXchromosomesincetheX estimatesarenotmuchaffectedbyassumptionsoftheeffectivesize chromosomebycoalescencetheoryisexpectedtohaveaneffectivepopulation sizeof3=4oftheautosomes. of thepresentday populations, whichimpliesthat themodel can doi:10.1371/journal.pgen.1001319.t001 be used for analysis of population pairs where only the order of magnitude of these quantities are known. That the model produces consistent results may seem surprising since the model generations, halving the generation time would halve the isessentiallyonlymodelingwhathappensfromonenucleotideto divergence time when measured in years. Similarly, assuming the next and not any higher order correlations. This is very that the mutation rate is twice as high, the inferred divergence fortunatesinceitistheMarkovpropertythatenablescalculations time wouldbehalf aslong ago. tobesufficiently efficientfor genome-wideanalysis. Less obvious is the dependency on the present day effective populationsizesandthenumberofstatesintheHMM.Totestthis wevariedthefixedeffectivepopulationsizesbyafactorof10inboth Consequences of migration direction and alternatively tried constraining the three effective The present model assumes a simple population split or an population sizes in the model to be equal. We found the resulting allopatricspeciation.Totesttheconsequencesofthisassumption, changesintheestimatestobeinsignificant.Similarly,changingthe we simulated data sets where a single population first splits into Figure9.Distributionofsub-speciationtimeestimatesforeachchromosome.Theboxplotshowsthedistributionoftheestimatedsub- speciationtimeoneachchromosome.Thedashedlineshowsthegenome-wideaverage.Ingeneral,theestimatesarereasonablyconsistentbetween thechromosomes,althoughchromosome21hasaslightlysmallervalue. doi:10.1371/journal.pgen.1001319.g009 PLoSGenetics | www.plosgenetics.org 9 March2011 | Volume 7 | Issue 3 | e1001319 ACoalescenceTimeHMM Figure10.Distributionofeffectivepopulationsizeestimatesforeachchromosome.Theboxplotshowsthedistributionoftheestimated ancestraleffectivepopulationoneachchromosome.Thedashedlineshowsthegenome-wideaverage.Ingeneral,theestimatesarereasonably consistent between the chromosomes, with one exception being chromosome 21. Chromosome X is within the expected range of 3=4 of the autosomalchromosomes. doi:10.1371/journal.pgen.1001319.g010 two populations with some gene-flow between the populations, theprocesshasreachedequilibrium,andthustheunknownphase and sometimelater thegene-flow stops(see Protocol S1). could potentially affectour estimates. Inthissetting,thedivergencetimeweinferisbetweenthefirst However, varying the present day effective population size andthesecondtimepoint;usuallyclosertothetimepointwhere parameters an order of magnitude in either direction (see gene-flow stopped unless the migration rate is very low. This Supplemental Figure 10 of Protocol S1) gave essentially identical makessensesincethesplittimemodeledinourapproachisexactly results for the parameter estimates, so we trust that treating the theterminationofgeneticflowbetweenthetwolineages,andnot reference genome sequences as if they were present day haploid theintroduction of structure in theancestral population. genomes isnot a sourceofbias. Onlythesplittimeparameterseemedtobeaffectedtoalarger extend by gene-flow. The effective population size was slightly Orangutan analysis over-estimated when gene-flow is present, more so if the time Ourestimatesoneffectivesizesandspeciationtimeareinclose intervalandmigrationratearelarge,buttheeffectissmallrelative agreement with an analysis based on SNP frequencies from full to the variance on the estimates. The recombination rate is genome sequencing of five individuals of each of the subspecies estimated to be somewhat higher with gene-flow, again higher (includedintheOrangutanGenomepaper[17]).Resultsdisagree when the time interval and mutation rate are high, but is still with Bequet and Przeworski [3] who reported a split time of 1.4 biased andestimated below thesimulatedrate. million years and an ancestral effective size of 86,900 (C.I. 52,400{{143,000). They used MIMAR to infer parameters. Consequences of unknown phase The same authors have shown that MIMAR is quite sensitive to Our model assumes that the input data is an alignment of population structure in the ancestral species. Furthermore, their twohaploidgenomeswhereneighboringnucleotidesarelinkedon parameters correspond to an average divergence time of the two the same chromosome, but in the analysis we use reference subspecies of 1.4my plus 2N years. If we assume 20 years per e genome sequences that each are a mosaic of at least two haploid generation then 2N years will be 2:86000:20~3:4 million years e genomes. andatotaldivergenceof4:8millionyears.Thisdoesnotseemto Aconsequenceofthisisthattheassumptionwemakeaboutthe besupportedbytheaveragedivergenceofthesubspecieswhichis two neighbouring nucleotides being linked at the present day is estimatedcloseto1:1millionyears[17],thereforewebelievethere incorrect (see also Figure 3). If the recombination/coalescence must bea biasin theestimates byBecquet andPrzeworski. process is close to equilibrium, assumptions about the starting TheXchromosomeisfoundtohaveaneffectivepopulationsize point are irrelevant, but with an effective size on the order of almost exactly 3/4 of the autosome average. This suggests that 10,000andspeciationtimeof330,000yearswedonotexpectthat selection has not affected the X chromosomes in a different way PLoSGenetics | www.plosgenetics.org 10 March2011 | Volume 7 | Issue 3 | e1001319
Description: