bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. GENETICS|INVESTIGATION An Accurate Genetic Clock, GENETICS DavidH.Hamilton∗,1 ∗DepartmentofMathematics,UniversityofMaryland,CollegePark ABSTRACTOurmethodfor“Timetomostrecentcommonancestor”TMRCAofgenetictreesforthefirsttimedealswith naturalselectionbyapriorimathematicsandnotasarandomfactor. Bioprocessessuchas“kinselection”generateafew overrepresented“singularlineages”whilealmostallotherlineagesterminate. Thisnon-uniformbranchinggivesgreatly exaggeratedTMRCAwithcurrentmethods. Thusweintroduceaninhomogenousstochasticprocesswhichwilldetect singularlineagesbyasymmetries,whose“reduction”thengivestrueTMRCA.Thisgivesanewphylogeneticmethodfor computingmutationrates,withresultssimilarto“pedigree”(meiosis)data. Despitetheselowrates,reductionimplies youngerTMRCA,withsmallererrors. Weestablishaccuracybyacomparisonacrossawiderangeoftime,indeedthisis onlyy-clockgivingconsistentresultsfor500-15,000ybp. InparticularweshowthatthedominantEuropeanY-haplotypes R1a1a&R1b1a2,expandfromc4000BC,notreachingAnatoliabeforec3800BC.Thiscontradictspreviousclocksdating R1b1a2toeithertheNeolithicNearEastorPaleo-Europe. HoweverourdatesmatchR1a1a&R1b1a2foundinYamnaya cemetariesofc3300BCbyNielsenetal(2015), Pääboetal(2015), togetherprovingR1a1a&R1b1a2originatesinthe RussianSteppes. KEYWORDS Molecularclock;TMRCA;SNP;STR; requirescalibrationofmutationrates,eitherbycountinggenetic Introduction differencesoveralineage,i.ethephylogeneticmethod,orby directobservationofparent-offspringpairs,i.e.meiosis.Forthe Zuckerkandl&Pauling(1962)madetheempiricalobservation humanY-chromosomeBurgellaandNavascue(2011)combined that“timetomostrecentcommonancestor”(TMRCA)ofage- 80meiosisstudies,whichforour33markersaveragesatabout netictreeseemedproportionaltogeneticvariance.Thusbegan .0043mutationspergeneration,whiletheiralternativeregres- the molecular clock, a recent account can be found in Wake- sion method averages at .0026. Chandler(2006) combines 20 ley(2013): averyshortlistofapplicationsmightbethediver- meiosisstudiestogetanaverageof.0032.Unfortunatelythese genceoftheeukaryotes(KimuraandOhta,1973)...theorigins rateswhenusedinKAPZdonotcorrelateverywellwithknown of human immunodeficiency virus (Korber et al., 2000). The genealogy.SoFTDNAusedtheirhugedatabasetogiveanaver- Y(chromosome)clockalsogivesanimportanttoolofarcheology ageof.00635.IndeedfortheY-clockingeneral“phylogenetic” sincepioneeringworkbyL.L.Cavalli-Sforzaetal(1978)corre- ratesareabout2timeslargerthanthosefrommeiosis.Whilethis latedgeneticvariancestoageofhumanlineages.Althoughin givesplausibleresultsforgenealogyitmakesancientlineages broadagreementwiththefossilrecordor14Cdating,verylarge seemfartooyoung. Zhivotovsky(2000)conceivedofamuch confidenceintervalsmeantthatquestionssuchas“peoplingthe smaller“longtermevolutionaryrate",computingonaverage Americas"ortheoriginofthedominantEuropeanY-haplotype .0007fromgeneticlineagesofremotePolynesianpopulations. R1b1a2couldnotbeaccuratelyresolved. Myresetal(2011)usedthisinaKAPZgiving9000BCforthe MotoKimura’stheoryofneutralmutations(1968)predicted originofR1b1a2,i.e.rightaftertheIceAge. amolecularclockwithconstantrate.Thishasbeenappliedin manyversions,wecallKAPZ,seeNielsen(2005).Actualtiming However a long term constant rate has proved untenable as estimating rates from different genetic lineages of known Copyright ©2015bytheGeneticsSocietyofAmerica agesgaverateswithsignificantdiscrepancies,seeNieandKu- doi:10.1534/genetics.XXX.XXXXXX mar(2000). Its variability was attributed to changing genera- Manuscriptcompiled:Friday4thSeptember,2015% tion times, metabolism etc. and of course natural selection, 1DepartmentofMathematics,UniversityofMaryland,CollegePark,md20742, see Ohta(2002), Lynch(2010). Some sophisticated empirical [email protected] Genetics,Vol.XXX,XXXX–XXXX September2015 1 bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. approachesallowingvariablerateshavebeendeveloped,see 15,000ybp. Howeverthelargedifferencebetweenrates,upto Nielsen(2005).HoweverconfoundingexamplesconvincedAy- a factor of 100, implies that in the very long term that these ala(1997)tochallengetheveryideaofmolecularclockhypoth- may vary, probably as the geometry of the genome evolves. esis, reasoningthatvariableratesaretheresultoftheunpre- Furthermoreweshallseethattheessentialequalityofourrates dictableeventsofnaturalselection.Nevertheless,whilefuture withthosefrommeiosisimpliesapriori,agenerationlengthof naturalselectioncannotbepredicted,weshowhowsingular 30years. ThisagreeswithFenner(2005)whofounda30year eventsofthepastcanbedetected. generationfromfieldworkamongHunterGatherers.Allofthis providestheoreticalvalidation. HumanpopulationshavebeenclassifiedbyY(chromosome) Howevertheoreticallyplausiblethetruetestisaccuracy,in- haplotypesdefinedbyaSNP(singlenucleotidepolymorphism) deed our original motivation was to develop a Y-clock with mutation,afamousexamplebeingR1b1a2. Nowoneislook- accuracy approaching 14C dating. This is verified by a com- ing at mutations of “short tandem repeat” (STR) at different parisonacrossawiderangeoftime,thefirstSTR-clockgiving DNA Y-chromosome Segments(DYS).Inapplicationsofthe consistentresultsfor500-15,000ybpwithaccuracyabout11% Y-clock researchers have stuck with constant mutation rates. (oneSD).Theusualsampleerror,uncertaintyinmutationrates However,comparingratesatthesameDYSmarkersweseean contributestothetotalaccuracy.However∼1/2isinherentin average 50% variation over eight major Y-haplotypes, i.e the themethoditself,thisbeingdeterminedbylargescalesimula- samephenomenaseenfortheverylongtermalleleclock.Now tionsofrandomtreeswithknownTMRCA.Itiseasytoforgot asahugevariationinhumangenerationtimesetcoverrecent toincludethis,e.g.rarelycomputedforKAPZmethods.Com- time does not seem plausible reseachers have fallen back on parisonof7,15,29markersleadsustoexpectabout5%(oneSD) thestochasticprocessitselftogenerateunlikelydistributions. for120markers. TheseBayesianmethodsuseMonte-Carlosimulationsofallpos- WeshowourdatescorrelatewithEuropeanhistory.Inpartic- siblegenealogicaltreesgivingthepresentsampledata,thenfind ularthedominantEuropeanY-haplotypeR1b1a2,expandsfrom TMRCAbyamaximumlikehoodestimate.Anexampleofthis c4000BC,notreachingAnatoliabeforec3800BC.Thiscontradicts fortheY-clockisthealgorithmBATWING(Wilsonetal,2003). previousclocksdatingR1b1a2toeithertheNeolithicNearEast NowforBATWINGtheTMRCAisoftengreaterthantheKAPZ, orPaleo-Europe. HoweverourdatesmatchR1b1a2foundin e.g. for the Cinnioglu et al(2004) study of Anatolian R1b1a2 Yamnayacemetariesofc3300BC,seeRamusNielsenetal(2015), bothmethodswereappliedtothesamedataandmutationrates. SvantePääboetal(2015).Ourtimesandtheirplacesalmostcer- TheKAPZgivesTMRCA9800BCcomparedwith18,000BCfor tainlyimpliesR1b1a2originatesintheRussianSteppes,disprov- BATWING.Balaresqueetal(2010)usedBATWINGtogivean ingthe“outofAnatoliahypothesis”andinsteadgivingstrong originforR1b1a2inNeolithicAnatoliac.6000BC,howeverdis- evidencefortheIndo-EuropeanhypothesisofMarijaGimbu- putedbyBusby(2012).AnotherBayesianalgorithmisBEAST, tas(1956). A different example is the dominant Amerindian seeDrummondandRambaut(2007),whichhasbeenusedfor Y-haplotype,Q1.Bortolinietal(2003)used8markerstogivea mitochondrialclockwherefindingthephylogenetictreeiseasier. TMRCA=11600BC(SD3000).UsingrecentdataofBattagliaet Unequalfertilitycanberepresentedbynon-uniformbranching al(2013)wefindtheTMRCA=13300BCwitha90%chanceof butinBayesianModelsbeinglessprobableincreasesTMRCA. occurringbeforetheCanadianIceCorridoropens,i.e.theinitial Instead, we deal with natural selection by apriori mathe- migrationwasalongthecoast. maticsandnotasarandomfactor. Bioprocessessuchas“kin Very recently Batini et al (2015) computed TMRCA using selection”generateafewoverrepresented“singularlineages” thealgorithmBEAST,withmutationratesforMYSsitesonthe whilealmostallotherlineagesterminate. Thesesingularlin- Y-chromosomefoundfromIcelandicgenealogybyHelgasonet eagesareextremelyunlikelysocannotbepredicted. Instead al(2015). Thisisthecompleteoppositeofourmethod,using we introduce an inhomogenous stochastic process where we ∼ 3×106 MSYsitescomparedwithour29DYS,butsamples haveconstantmutationratesbutvariablesourcesrepresenting sometimesassmallas N = 8 whilewehave N ∼ 1000. We nonuniformfertility,i.e.manyvirtualpatriarchsgivingaquasi- compareallresultsinTABLE10,seeappendix.TheirTMRCA mixedpopulation.Tofindthesesourcesisaninverseproblem, aregenerallysimilartoours,sotogetherwesupportrecentdates howeverinversionisnotstableorevenunique:thepopulation forR1b1a2(andalsoverifyingtheIcelandicrates).Forexample couldhavebeencreatedyesterday!Howeverwearesavedby BatinifindsthatR1b1a2hasTMRCA=3550BC,comparedwith findingthepresentpopulationisdominatedbyarelativelyfew our4000BC.ButtheirsampleofR1b1a2(unlikeours)isM246* “singularlineages”.Ourmodelhasmanyextincttwigswitha , i.e. majorbrancheslikeL11, P312areremoved. Sothereis fewsuccessfulbranches. Inparticularabout50%oftheDYS little branching and their results using BEAST and a KAPZ markersshowasingularsidebranch.Thisrequiresanewtech- methodRHIOareverysimilar. Howeverwhenthereislarge nique:singularbranchescanbedetectedbyasymmetriesinthe scalebranchingourtheorypredictstheirBayesianmethodwill presentdistribution(takingmutationbiasintoaccount),then giveolderresults.ThisisseenfortheSNPJ2whereourTMRCA reducedrevealingtheoriginalTMRCA. isabout5000yearsyounger.OurdatesforJ2andG2a2correlate Weaccountforthetendencyofbioprocessestoexaggerate withtheirexpansionfromtheNeolithicRevolution. Another theapparentmutationratebyusingreductionofsingularlineages exampleisforJ2b,whichispresentlyconcentratedisEastern to give a new phylogenetic method for computing mutation Europe.NowNielsenetal,ReichetalusedADMIXTUREtofind rates.Thisgivesanaverageof.0031mutationspergeneration, theinitialIndo-EuropeanexpansionintoCentralEuropeduring right between the meiosis results of Burgella and Navascue, theCordedWareCulture(3000-2000BC)wasmainlyofSiberian Chandler.Indeedourtheoreticalphylogeneticratesaresimilar typeDNAbutwithsignificantMiddleEasterncomponentnot toactuallaboratorymeiosisrates,seeTable8.Despiteourlow foundintheoriginalYamnayaCulture,probablyfrommixing mutationrates,“reductionofsingularlineages”impliesyounger intheWesternUrkraine.ThiscorrelateswithourfindingforJ2b, TMRCA.AlsoourlowSDverifiestheKimuratheorythatrates TMRCA=3286BC(cf.10,000BCforBatini). are constant, at least for the human Y-Chromosome for 500 - 2 D.H.Hamilton etal. bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. processmodelingBayesiantrees).Webelievethatthestandard stochasticprocessisperturbedby“improbable”biologicalpro- KAPZ TMRCA Years before present cesses. First,theWatson-Galtonprocessintroducedtoexplainthe extinctionofaristocraticlineages(seeKendall,1975)implieslin- eages almost certainly die out. Conversely, natural selection causes some branches to flourish, e.g. the “kin selection” of W.D.Hamilton(1964),showskinco-operationgivesgeneticad- true TMRCA vantages.ConsiderthreeexampleswithwelldevelopedDNA projects.GroupAoftheHamiltonshasapproximately100,000 descendedfromaWalterFitzgilbertc1330AD.GroupAofthe Macdonalds has about 700,000 descendants from Somerfeld c1100AD, and Group A of the O’Niall has over 6 million de- Singular scendantsfromNialloftheSevenHostages,c300AD.Theseare Lineage elitegroupswithallthesocialadvantages. Oneseeslinesof chieftains,oftenpolygamous. Weemphasizekinselectionbe- causeitseemsdominantovernaturalselectionforrecenthuman lineages,certainlynoonehasclaimedtheO’Niallaregenetically superior!Naturalselectionwouldcausesimilarbranchingover longertimescales.Thesesingularlineagescannotbepredicted bythestochasticmodelitself. Present genetic distribution ReductionofSingularLineages Figure1SingularLineagesinRandomTrees Althoughourmethodisforgeneralmolecularclockstobespe- cificwefocusontheY-clock. ConsiderDNA Y-chromosome Segments(DYS)countingthe“shorttandemrepeat”(STR)num- berofnucleotides.OneusestheseDYSmicro-satellites,marked SingularLineages by j = 1,...n, each individual i, 1 = 1,..N, has STR number Afundamentalproblemisthatpresentpopulationshavehighly xi,j. The Y-chromosome is unchanged except for mutations overrepresentedbrancheswecallsingularlineages.Awellknown xi,j →xi,j±1occurringatrateµjpergeneration. exampleistheSNPL21whichisabranchofR1b1a2. Individ- Modellingsingularlineagesrequiresanewstochasticsystem. ualsidentifiedasL21areoftenexcludedfromR1b1a2analysis Thestandardsystemis:attimettheprobabilityPthattheDYS becausetheyskewtheresults.Suchasingularlineagecausesthe markershavegivenSTRvectorXisgovernedbyahomogenous variancetobemuchgreater,eventhoughtheoriginalTMRCA systemofdiscretediffusionequations remainsunchanged,seefigure1. ∂P ForBayesianmethodssuchlineagesareveryunlikelygiving =µ∆P ∂t anevengreaterapparentTMRCA.Howeveronecannotdeal withsingularbranchesbyexcludingthem. Foronething,our allowingvariablemutationmatrixµ=µ(t,X).Ourmodelwill 2. methodwillshowthat 50%ofmarkersshowevidenceofsin- haveconstantmutationratematrixµinadiscreteinhomogenous (cid:56)2,19(cid:60) gularsidebr(cid:56)a2,n2c0h(cid:60)es, i.e. morethanaSDfr(cid:56)o2,m21(cid:60)expected. For equation 1.0 instancefora1l.a0rgesampleofR1b1a2weplott1e.0dthedistribution ∂P =µ∆P+δ 0.8 aboutthemod0.e8 forDYSmarkersGATAH4an0.d8 YCAIIa.Forthe ∂t 0.6 laterthereisl0i.t6tledifferencebetweenthetheo0r.6eticaldistribution butwithunknownsourcesδ=δ(X,t)modelingunequalfertil- (red)andtheactualdata(blue),howeverforGATAH4onesees ity. Solvingeithersystemforunknownµorδisaninversion 0.4 0.4 0.4 thelargediscrepancy(seeSI1forall33DYSdistributions,in problem.Butinversionisunstableforsuchsystems,alsothere 0.2 0.2 0.2 Supplement). Theidealdistributionisnotsymmetricbecause isnouniquesolution. (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:56)2,22(cid:60) GATA H4 YCAI IIa 0.10 1.0 1.0 1.0 0.8 0.8 0.8 0.08 0.6 0.6 0.6 0.4 0.4 0.4 0.06 0.2 0.2 0.2 0.04 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 mode 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 mode 1 2 3 (cid:56)2,25(cid:60) (cid:56)2,26(cid:60) (cid:56)2,27(cid:60) 1.0 Figure2Biase1.d0 distributions 1.0 0.02 0.8 0.8 0.8 0.6 wehavealso0.a6ccountedforbiasinthemut0a.6tionprocess, see 0 100 200 300 400 500 600 0.4 Wakeley(19940.)4. Thesesingularlineagesare0.4very(mathemati- Figure3Distributionofbranchingtimes cally)unlikelytoarisefromthestochasticsystemwhichisthe 0.2 0.2 0.2 mathematical basis of KAPZ (or the equivalent Monte-Carlo (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:56)2,28(cid:60) (cid:56)2,29(cid:60) (cid:56)2,30(cid:60) 1.0 1.0 1.0 GENETICSJournalTemplateonOverleaf 3 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:56)2,31(cid:60) (cid:56)2,32(cid:60) (cid:56)2,33(cid:60) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 (cid:45)3 (cid:45)2 (cid:45)1 1 2 3 bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. However in practise, up to a standard deviation, the in- lineageandreduceit. Ofcoursenotknowingtheexactasym- homogenous system is a perturbation of the KAPZ system, metricratiomeansthatbootstrapmethodsareusedextensively δ(X,t) = 0, so that “singular branches” are clear enough to forsingularreduction,bothtocomputevaluesandSD. bebecomputedapriori withoutempiricalconsiderations.These Thesemethodsalsoimplyanewwayofcomputingmutation singularbranchesarethenreducedrevealingtheoriginallineage. rates.Previously,thereweremethodsbasedonmeiosisdataor Wethencomputeabranchingtimet foreachDYSmarker phylogeneticstudiesoffamilyDNAprojects(whichgavequite j j. Nowthenonuniformbranchingprocesscausesthet tobe differentrates). Webeginwith8verylargeSNPprojectsfrom j randomlydistributedsotheirmeanisnottheTMRCA.Figure FTDNAusing37markers, ofcoursewithunknownTMRCA. 3forR1b1a2,showstheerrorspreadforeacht over29DYS Wefirstreducesingularlineages.Thentakingasymmetryinto j sites(notetheoutliers!).Onecancombinethedistributionsfor accountwefindmutationratesarethefixedpointsofaniterative t givethe“BranchingSpectrum”(dottedcurve). Nowlarge process.Thistakesabout3iteratestoconverge.Thesemutation j errorsandoutliersmeansonecannotsimplytakethemaxt to ratesarenormallydistributedwithmeanandSD.Discarding j betheTMRCA.Insteadstochasticsimulationsofthebranching markerswithmutationSD > 33%leavesuswith29markers. process,usingrobuststatisticstoavoidoutliers,findthemost We find this advanced phylogenetic method gives mutation likely TMRCA. The effect of reduction is dramatic, e.g. the ratesclosetothoseobtainedfrommeiosisandnearly1/2the TMRCA for R1b1a2 changes from 7150BC(KAPZ) to 4000BC valuesobtainedfromtheusualphylogeneticmethod. Further aftersingularreduction,usingthesamemarkersandmutation validationcomesfromfindingthattheequivalenceofourrates rates,seeFigure4foreffectontheBranchingSpectrum. withmeiosisimpliesaprioriahumangenerationofc30years. Results 0.10 AccuracyisverifiedbycheckingforconsistencyoverallofEu- 0.08 ropeanhistorybeginningwithGroupAofthreegenealogical projectsofmedievalage: 0.06 Table1:Medievalcomparison 0.04 Group TMRCA [SD] Oldestpatriarch Hamilton 1399AD [180] Fitzgilbert d1330AD 0.02 Macdonald 1060AD [270] Somerfeldc1100−1150AD 100 200 300 400 500 O(cid:48)Niall 100AD [260] Niall300AD/Conn100AD ArcheologicalfindsconvincedMarijaGimbutas(1956)toat- Figure4BranchingSpectrumbeforereduction(blue)andafter. tributeProtoIndo-European(PIE)totheYamnayaCulturec3000- 4000BCoftheRussianSteppes. Thisisconsistentwithmain- InSI6wedisplaysimilarforeightdifferentSNP.Someof streamlinguistictheory,someevenwroteoflinguisticDNA,see thesearelikeatightnormaldistributionshowingthattheexpan- Anthony(2007). Butactualgeneticswasignoredbecausecur- sionoccuredquickly,e.g.L21.OtherslikeJ2showthebranching rentgeneticclocksforR1b1a2pointedtotheRenfrewHypoth- occursover100generations. esis(1987)thatPIEspreadfromNeolithicAnatolia, c6000BC. Or Mesolithic or Paleolithic, depending on the genetic clock. AccurateMutationrates Howeverno-onecheckediftheirclockworkedoverthewhole rangeoftimefordifferentlineages. Byrelyingonasymmetriesofthedistributiontofindsingular ThenexttableshowstheexpansiontimesofthedominantEu- lineageswehavetobeawarethatthemutationprocessitself ropeanY-haplotypesR1b1a2&R1a1a.ThisdataisfromFTDNA mightnotbesymmetric,thisiscalledbias(Wakeley,1994). If projectsforregionXonlyusingindividualswithnamedancestor ignoredwemightbejustdetectingtheseasymmetries.Changing fromX.Theseindependentresultsagreewithinthestandard themodel,theprobabilityofamutationpergenerationis deviation,withdatesmatchingtheCordedWareCulture,asemi- nomadicpeoplewithwagonsandhorseswhoexpandedwest Pr[xi,j →xi,j+1]=µj,+1,Pr[xi,j →xi,j−1]=µj,−1. fromtheUrkrainec3500BC.Thisisconsistentwiththeoldest R1b1a2, R1a1a skeletons being from the Yamnaya Culture, c Ifthismarkerisfreefromsingularlineagesweobservethatthe 3300BC,seeR.Nielsenetal(2015)S.Pääboetal(2015). ratioofthefrequenciestotheleftandrightofthemodeis P (t) µ Table2:Indo-EuropeanSNP j,1 j,1 = . Pj,−1(t) µj,−1 Region R1b1a2[SD] N R1a1a [SD] N which is time independent. So using eight very large SNP All 4004BC[700] 677 4012BC[720] 1270 projectswefindenoughmarkersfreeofsingularitiestocom- Russia NA 3417BC[780] 337 putetheseratiosandtheirstandarddeviations,seeTable9in AppendixandSIforcomputations.Inparticularabout33%of Poland 4030BC[1070] 65 4470BC[830] 876 DYSmarkersshowasymmetricratiosaresignificant,i.emore Germany 3160BC[610] 438 3630BC[870] 190 thantwoSDfromratio1.Theseasymmetricratiosplayavery importantrole,forthisratioisallyouneedtodetectasingular Scandinavia 2700BC[690] 153 4740BC[1100] 140 4 D.H.Hamilton etal. bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Theregionaldifferencesseemrandomsuggestingrapidmoment Howeveronceyougetdownto7markerstheconfidenceinterval ofwholepeoples.Averagingoverall(weightedby1/Varianceas becomestoolargeforaccuracy.Alsoitbecomesdifficulttodeal thebestlinearestimate)givesR1b1a2hasTMRCA4011BC(SD withoutliers.Wemightexpect5%accuracywith120markers. 580),R1a1ahasTMRCA3980BC(SD380). An example with few markers is the R1b1a2 data of Aninterestingintermediatestepoccursbetweenthemedieval Balaresqueetal(2010). Ourmethod(thistimewith7useable and eneolithic. The mythical Irish Chronicles relate that the markers) gave SD ∼ 30%. Now Balaresque used BATWING O’NialldescenddirectlyfromthefirstGaelicHighKings,which tosuggestaNeolithicorigininAnatolia. WiththesameCin- traditiondatedc1500BC.TheO’Niallhavetheuniquemutation nioglu(2004)dataourmethodgivesforTurkishR1b1a2anything M222whichisabranchofthehaplotypeDF13whichisturna fromtheIceAgetotheIronAgeasseenin branchofL21andthatofP312.ThemostlikelyTMRCAare Table6:BalaresqueData Table3:Pre-CelticSNP R1b1a2 N TMRCA SD SNP TMRCA SD Eire 75 2046BC 1372 P312 2728BC 376 C.Europe 207 6383BC 2600 L21 2216BC 294 Turkey 69 5150BC 2700 DF13 1867BC 328 Fortunately,onceagain,wefindgooddatafromFTDNA:the ThesearedatesforforearlytolateBellBeakerCultureofWest- ArmenianDNAproject,seeTable7.BytraditiontheArmenians ernEurope,2800-1800BC.IndeedtheonlyknownBellBeaker enteredAnatoliafromtheBalkansc1000BCsotheymightnot genomewasfoundtobeP312with14Cdate2300BC. seemagoodexampleofancientAnatolianDNA.Butsome100 generationsofgeneticdiffusionhasresultedinanArmenian Table4:NonFTDNAdata distributionofHaplotypesJ2,G2a2b,R1b1a2closelymatching data Haplotype N TMRCA [SD] thatofallAnatolians,thereforerepresentiveoftypicalAnato- lianDNA.WeseethatAnatolianR1b1a2arrivedafterc3800BC, Underhill R1a1a1 974 3000BC [800] rulingouttheNeolithicexpansionc6000BC.Whendealingwith Rootsi G2a 536 21000BC [4200] regionalhaplotypes,e.g. R1b1a2inAnatolia,the TMRCA is onlyaupperboundforthearrivaltimes,forthegeneticspread Ourmethodrequireslargedatasetsandmanymarkerswhich maybecarriedbymovementsofwholepeoplesfromsomeother meanswehavetorelyondatafromFTDNA,finding29useable region. Thismeansonehastobecarefulinterpretingregional markersoutofstandard37theyuse. Infactmanyresearchers data,e.g.theTMRCAfortheR1b1a2(USA)isc4000BCbutno- haveusedFTDNAdata,e.gBalaresqueatal(2010).Wethinkour bodythinksitarrivedthen. methodofreductionwithrobuststatisticssolvesanyproblems withthisdata. TotestthisweusedourmethodwithR1a1a1 Table7:AnatolianSNP dataobtainedfromUnderhill(2015)withN =974and15use- ablemarkers,andG2afromRoostsi(2012)withadifferent15 Armenian N TMRCA [SD] markers,seeTable4. With29markerstheTMRCAforR1a1a1 R1b1a2 99 3780BC [1000] issomewhatafterourdateforR1a1a,whileG2apredatesours for G2a2(Table 5 are Western European only, see Table 7 for G2a2b 46 10000BC [2500] Anatolian). J2 97 12600BC [2500] Table5:TMRCAybpfor29,15,7markers Observethatour TMRCA forArmenianG2a2b(formerly G2a3)andJ2correlateswiththefirstNeolithicfarmersinthe SNP N T29 SD T15 SD T7 SD MiddleEast. FromTable5weseeJ2,G2a2bforallofWestern Europe(non-Armeniandata).OurdatesshowJ2wasexpanding G2a2 1221 7654 1007 10800 2340 6580 2021 at the end of the Ice Age. Modern J2 is still concentrated in R1b1a2 660 6004 700 6800 1075 8264 2217 thefertilecrescent,butalsoindisconnectedregionsacrossthe Mediterranean.Theoldgeneticmodelpredictedacontinuous R1a1a 1270 6012 720 5270 840 5643 1665 waveofNeolithicfarmerssettlingEurope(L.L.Cavalli-Sforzaet I1 2898 4109 460 5114 1010 5799 1775 al,2004).Butyoucannothaveacontinuousmaritimesettlement: itmustbeleap-frog. AlsorepeatedresettlementfromtheEast- L21 1029 3927 415 4052 503 4252 882 ernMediterraneanhasmixedancientJ2populations,andour U106 1533 4883 534 3043 710 4243 947 methodgivestheoldestdate.OntheotherhandG2a2bshows exactlythedatesexpectedfromacontinuouswaveofNeolithic J2 1241 14490 2000 21060 3860 8814 2302 farmersacrossCentralEurope. Ourdatesareconsistentwith P312 971 4449 490 5340 748 5145 1128 recent findings that the majority of early Neolithic skeletons foundinWesternEuropeareG2a2,c5000BC,whereastheoldest MeanSD 12% 18% 26% R1b1a2foundinWesternEuropeisBellbeakerc2300BC. Table5showstheresultsofextensivesimulationsusingrandom TheproblemofthepeoplingoftheAmericashasreceivedas subsetsofourFTDNAdata,for29,15(usingUnderhillmarkers) muchattention.Y-chromosomeanalysisof(subarctic)Amerindi- and7markers(Balaresquemarkers).InafactourdataforR1a1a ansshowshaplotypeQ1isdominant. Bortolinietal(2003)ap- for15markersisratherclosetotheUnderhilldataforR1a1a1. pliedKAPZwith8markerstoasampleof N = 24togivea GENETICSJournalTemplateonOverleaf 5 bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. TMRCAforQ1ofc11600BC,(3000SD).Morerecentstudies,e.g boththesehaplotypesexpandingatessentiallythesametime Reichetal(2012),havebeencarefulnottoestimateaTMRCA. c4000BC.ThisandourlaterdateforAnatolia,combinedwith Weuse22markerdatafromBattagliaetal(2013)wholikeRiech Pääboetal,impliesthatR1b1a2andR1a1amusthaveoriginated sawonlyoneQ1migration,butgavenodates.Oursare intheYamnayaCulture. Incheckingaccuracyweranintothequestionoftheorigins Table8:BattagliaAmerindiandata ofIndo-European.Althoughtherearegenesforlanguagethere iscertainlynonespecifictoIndo-Europeanlanguage.Thusinfer- SNP N TMRCA SD enceshavetobeindirect.MarijaGimbutas(1956)sawpatterns insymbolismandburialritualssuggestingtheYamnayaCul- allQ1 437 12600BC 2000 turewasthecradleofIndo-European. Alsotheirphysiology Q1-L54 73 13300BC 2800 wasrobustlyEuropeanoidunlikethegracileskeletonsofNe- olithicEurope,butthiscouldbenutritionandnotgenetic.From Q1-M3 335 10500BC 1770 theaboveweconcludethatthespreadofthisrobusttypeinto Thisissupportedby29markerFTDNAdatafromtheirNew WesternEuropeinthelateNeolithicmarkedaninfluxofSteppe Mexicoproject: nomads. NowifR1b1a2hadbeenshowntospreadfromAna- toliac6000BCitwouldhavebeentakenasstrongevidencefor Table9:NewMexicoFTDNA "outofAnatolia"becauseoftheassociationofR1b1a2,R1a1with Indo-Europeanlanguages.Butouraccuracycheckshowedthat SNP N TMRCA SD itwasG2,J2thatspreadwiththeNeolithicExpansionfromAna- NMQ1 110 16200BC 3000 tolia.NowthesehavebeenassociatedwithCaucasianlanguages orSemitic,butneverwithIndo-European. NMM3 71 11030BC 2300 Thus we are 90% certain that the entry of Q1 into the Amer- Methods icasoccurredbeforetheCanadianIce-Corridorwaspassable, Thisworkisbiomathematicaltheoryvalidifiedbydatafrom c9500BC,seeMandryk,C.A.S.etal(2001).TheQ1-M3subgroup publishedsources.Wegivefullmathematicaldetailsinthefol- isfoundthroughmostoftheAmericas,infactisamajorityof lowingappendix.AlsoseeSupplementaryInformation(SI)for AmerindianQ1. Itslaterdateisveryinterestingasitisabout detailedalgorithmsandMATHEMATICAworksheets,exam- thetimetheIce-Corridoropens. Whileitisprobablethatthe plesandsomedata.Toverifythetheoryandcomputemutation subbranchL54arrivedwithQ1weare90%certainthatM3did ratesweusediversedata,fromFTDNAy-haplotypeprojects notarrivewiththeoriginalmigration.(Wegot6000BCforthe forG2a2b,R1b1a2,R1a,I1,L21,U106,J2,P312. Alsoweused haplotypeC3whichprobablyarrivedwiththelaterNa-Dene regionalprojectsforGermany,Scandinavia,PolandandRussia migration). fortheirR1b1a2,R1a1adata.TheArmenianDNAprojectwas importantforitsR1b1a2,J2andG2a2bdata.WealsousedDNA Discussion projectsM222(O’Niall),Macdonald(GroupAwhichisR1a1a), Archeology, evolutionary biology, not to mention epidemiol- Hamilton(groupAwhichisI1).Thiswascomparedwithnon ogy,forensicsandgenealogyarejustsomeoftheapplications FTDNAdatafromBalaresque,Underhill,RootsiandBattaglia of molecular clocks. Unfortunately current clocks have been etal. foundtogiveonly“ballpark”estimates.Ourmethodgivesac- curatetime,atleastforthehumany-chromosomeverifiedover theperiod500−15,000ybp. Ourmainideasholdforgeneral molecularclocks,howeverourY-clockhasfeaturesspecificto STRmutations.Itwouldbeimportanttodevelopreductionof singularlineagesforotherclocks. Somegeneticiststhoughtnaturalselectionmakesmutation ratestoovariabletobeuseful.Theproblemisconfusionbetween the actual biochemistry giving mutations and superimposed processeslikekinselectionproducingapparentlygreaterrates. NoticethattheSDforourmutationratesisabout14%,previ- ouslymanyauthorsclaimedabout10%yethadlargedifferences withotherstudies. Manyapplicationstogenetics,forensics,genealogyrequire the TMRCA between just two individuals, or between two species, a classic method was given by Walsh (2001). While weareaccuratefor“bigdata",forthis“two-bodyproblem”one cannotdeterminewhatsingularlineagesthebranchinghasbeen through.Justusingournewasymmetricmutationrateswillnot work.Soitwouldbeimportanttofindanaccuratemethod. Pääboetal(2015),Nielsenetal(2015)observedall6skele- tons from Yamnaya sites, c 3300BC by 14C dating, are either R1a1b2andR1a1a.Thisinvolvesverydifficultgeneticanalysis offossilswhichmaynotalwaysbeavailable.Alsosuchanalysis cannotdatetheoriginofR1a1b2andR1a1a.OurTMRCAshows 6 D.H.Hamilton etal. bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Literaturecited Hamilton, W.D. (1964), The genetical evolution of social be- haviour.I,II,JournalofTheoreticalBiology7(1)pp1-52. Ammerman,A.J.&L.L.Cavalli-Sforza(1984)Neolithictransition Helgason,A.etal. (2015),TheY-chromosomepointmutation andthegeneticsofpopulationsinEuropePrincetonUniversity rateinhumans.,Nat.Genet.doi:doi:10.1038/ng.3171 Press. Ho,S.Y.,M.J.Phillips,A.Cooper,A.J.Drummond(2005),Time Anthony,D.W.(2007)TheHorse,theWheelandLanguage,Prince- dependency of molecular rate estimates and systematic tonUniv.Press,Princeton overestimationofrecentdivergencetimes,MolecularBiol- Ayala,F.J.(1997),Vagariesofthemolecularclock,ProcNatl ogy&Evolution22(7):pp1561-1568. AcadSciUSA94:7776-7783. Hoffecker,J.F.andS.A.Elias(2007),HumanecologyofBeringia, Ayala,F.J.etal(2001),Erraticoverdispersionofthreemolecular ColumbiaUniversityPress,NewYork. clocks: GPDH,SOD,andXDH,ProcNatlAcadSciUSA Lynch,M.(2010),Rate,molecularspectrum,andconsequences vol.98no.20pp11405-11410 ofhumanmutation,ProcNatlAcadSciUSAvol.107,no. Balaresque,P.etal(2010),ApredominantlyNeolithicOriginfor 3:961-968 EuropeanpaternalLineages,PLoSBiol;8:e1000285. Kendall,D.G.(1975),TheGenealogyofGenealogyBranching BatiniC.etal.(2015),Large-scalerecentexpansionofEuropean Processesbefore(andafter)1873,BulletinoftheLondon patrilineagesshownbypopulationresequencing. Nature MathematicalSociety7(3)pp225-253. Communications6:7152doi:10.1038 Kimura, M. (1968) , Evolutionary rate at the molecular level, Battaglia,V.etal.(2013),TheFirstPeoplingofSouthAmerica: NatureVol.217,624-626,doi:10.1038/217624a0 NewEvidencefromY-ChromosomeHaplogroupQ,PLoS KimuraM,andT.Ohta(1973),Mutationandevolutionatthe ONE8(8):e71390.doi:10.1371/journal.pone.0071390 molecularlevel.Genetics73(Suppl):19-35 Beaumont,M.A.etal(2002).ApproximateBayesiancomputa- Knoll,A.H.etal,(2011),Estimatingthetimingofearlyeukary- tioninpopulationgeneticsGenetics162:2025?2035 oticdiversificationwithmultigenemolecularclocks,Proc Bortolini,M.C.etal,(2003),Y-ChromosomeEvidenceforDiffer- NatlAcadSciUSAvol.108,no.33:13624-13629 ingAncientDemographicHistoriesintheAmericas,Am.J. Korber,B.,etal(2000)TimingtheancestoroftheHIV-1pan- Hum.Genet.73:524?539 demicstrains.Science,288:1789-1796. Burgarella,C.andM.Navascue(2011),Mutationrateestimates Lacana,M.etal(2011),AncientDNAsuggeststheleadingrole for 110 Y-chromosome STRs combining population and playedbymenintheNeolithicdissemination,ProcNatl father-sonpairdata,EuropeanJournalofHumanGenetics AcadSciUSAvol.108,no.45,18255-18259 19:70-75;doi:10.1038/ejhg.2010.154 Mandryk,C.A.S.etal(2001),LateQuaternarypaleoenviron- Busby,G.(2012),ThepeoplingofEuropeandthecautionarytale mentsofNorthwesternNorthAmerica: implicationsfor ofYchromosomelineageR-M269,PhilosTransRSocLond. inlandversuscoastalmigrationroutes,QuaternaryScience BiolSci.;279(1730):884-92.doi:10.1098/rspb.2011.1044 Reviews20:301-314. John F. Chandler, J.F. (2006), Estimating Per-Locus Mutation Menozzi,P.,A.Piazza,L.L.Cavalli-Sforza(1978),Syntheticmaps Rates,JournalofGeneticGenealogy2:27-33 ofhumangenefrequenciesinEuropeans,Science201:786- Cavalli-Sforza,L.L.etal(2009),Ychromosomediversity,human 792 expansion,drift,andculturalevolution,ProcNatlAcadSci Myre,N.etal(2011),AmajorY-chromosomehaplogroupR1b USA106:20174-20179. HoloceneerafoundereffectinCentralandWesternEurope, Cinnioglu,C.etal.(2004),ExcavatingY-chromosomehaplotype EuropeanJournalofHumanGenetics19(1)95-101 stratainAnatolia,HumGenet.114:127-148. Nei,M.andS.Kumar(2000)MolecularEvolutionandPhylogenet- Drummond,A.J.,A.Rambaut(2007),BEAST:Bayesianevo- ics.OxfordUniversityPress,NewYork(333pp). lutionaryanalysisbysamplingtrees,BMCEvol. Biol. 7, Nielsen, R. and J. Wakeley (2001), Distinguishing migration 214 fromisolation:aMarkovchainMonteCarloapproach,Ge- Drummond,A.Jetal(2005),Bayesiancoalescentinferenceof netics158:885-896 pastpopulationdynamicsfrommolecularsequences,Mol. Nielsen,R.(Editor)(2005)StatisticalMethodsinMolecularEvolu- Biol.Evol.22:1185-1192 tion,Springer. Durrett,R.(2002) ProbabilityModelsforDNASequenceEvolution Olver,F.W.J.(1964) Besselfunctionsofintegerorder355-434 Hand- Springer,NewYork bookofMathematicalFunctions editedbyM.Abramowitz Edmonds,C.,A.Lillie,L.L.Cavalli-Sforza,(2004),Mutationsaris- NationalBureauofStandards,Washington. inginthewavefrontofanexpandingpopulation,ProcNatl Ohta,T.(2002),Near-neutralityinevolutionofgenesandgene AcadSciUSA10:975-979. regulation. ProceedingsoftheNationalAcademyofSci- Elango, N., J.W.Thomas, S.V.Yi(2006), Variablemolecular ences99(25):16134?16137 clocks in hominoids, Proc Natl Acad Sci USA vol. 103, Pääbo,S.etal(2014),Ancienthumangenomessuggestthree no.5,1370-1375 ancestralpopulationsforpresent-dayEuropeans,Nature Epstein,C.andR.Mazzeo(2013)DegenerateDiffusionOperators 513,pp409-413. ArisingfromPopulationBiology,Princeton Reich,D.etal(2012),ReconstructingNativeAmericanpopula- Fenner.J.N. (2005) , Cross-Cultural Estimation of the Human tionhistory,Nature488:370-375 GenerationIntervalforUseinGenetics-BasedPopulation Reich,D.etal(2015),Massivemigrationfromthesteppewas DivergenceStudies,AmericanJournalofPhysicalAnthro- asourceforIndo-EuropeanlanguagesinEurope,Nature pology128,415-428. 14317,doi:10.1038/nature1431 Fu,Y.X.,W.H.Li(1993),Statisticaltestsofneutralityofmuta- Nielsen, R.etal(2015), PopulationgenomicsofBronzeAge tions,Genetics133:693-709 EurasiaNature522:167?172doi:10.1038/nature14507 Gimbutas, M.,(1956) The Prehistory of Eastern Europe, Part 1, Renfrew, A. C. (1987) Archaeology and Language:The Puzzle of Cambridge:AmericanSchoolofPrehistoricResearch#20. GENETICSJournalTemplateonOverleaf 7 bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Indo-EuropeanOrigins,London:Pimlico. Appendix RootsiS.etal(2012),Distinguishingtheco-ancestriesofhap- Biomathematicaltheory logroupGY-chromosomesinthepopulationsofEurope andtheCaucasus,EurJHumGenet. 20(12)1275-82. doi: Ourmethodrequiresareturntobasicprinciples. 10.1038/ejhg.2012.86. FundamentalSolutions Underhill, P.A. et al, (2015), The phylogenetic and ge- ographic structure of Y-chromosome haplogroup R1a TheY-chromosomehasDYSmarkedby j = 1,...n,whereone , European Journal of Human Genetics 23:124-131; cancounttheSTRnumberxj.ConsidertheprobabilityPj,k(at doi:10.1038/ejhg.2014.50 timetgenerations)thatatmarkerjwehavexj =k.Thissatisfies Wakeley,J.(1994),Substitution-ratevariationamongsitesand thehomogenousstochasticsystem theestimationoftransitionbias,Mol.Biol.Evol.11:436-442. P (t) WakCeloemy,pJ.a(n2y00P8u)bClisohaleerssc,enGtrTeehneowryo:oAdnVIinlltargoed,uCctoiolnoraRdoob.erts& j,dkt =−µjPj,k+m∑>0µj,−mPj,k−m+µj,mPj,k+m Wakeley,J.(2013),Coalescenttheoryhasmanynewbranches, This homogenous system gives a uniform expansion from a Theoret.Pop.Biol.87:1-4. singlepatriarch. Walsh,B.(2001),EstimatingtheTimetotheMostRecentCom- ThesystemisessentiallythemodelofWehrhahn(1975)who monAncestorfortheYchromosomeorMitochondrialDNA foraPairofIndividuals,Genetics158:897-912 hadµj,−1 =µj,1.Weintroduceasymmetricmutationswithtotal rate Wehrhahn,C.F.(1975),Theevolutionofselectivelysimilarelec- ∑ trophoreticallydetectableallelesinfinitenaturalpopula- µj = µj,−m+µj,m m>0 tions,Genetics80:375-394. I.J.Wilson,M.E.Weale,D.J.Balding(2003),Inferencesfrom About 50% of DYS markers show asymmetric mutations, i.e. DNA data: population histories, evolutionary processes µj,−1 (cid:54)=µj,1. and forensic match probabilities, J Roy Statist Soc A. Weusethegeneratorfunctionwithcomplexvariablez (2003);166:1-33. ∞ Zhivotovsky,L.A.(2000),EstimatingDivergenceTimewiththe G(z,t)= ∑P zk , j,k UseofMicrosatelliteGeneticDistances,MolecularBiology −∞ andEvolution18,No5:705-709 (seeDurrett,202),andnormalizedinitialcondition x = 0or Zuckerkandl,E.andL.Pauling(1965)Evolutionarydivergence j P (0)=1: and convergence in proteins. In Bryson, V.and Vogel, H.J. j,0 (editors). EvolvingGenesandProteins. AcademicPress, G(z,t)=Exp[−µjt+t ∑ µj,−mzm+µj,mz−m] NewYork.pp.97-166. m>0 Zuckerkandl,E.andL.Pauling,L.B.(1962),Moleculardisease, evolution,andgenicheterogeneity,inKasha,M.andPull- ThenGcanbeexpandedinpowersofztogivePj,k(t).Nowfor man,B(editors).HorizonsinBiochemistry.AcademicPress, thesimplestasymmetriccase,withonlyonestepmutations,we NewYork.pp.189?225. haveG(z,t)=e−µjtetµj,−1zetµj,1/z = (cid:40) ∞ µm (cid:41)(cid:40) ∞ µm (cid:41) e−µjt ∑ j,−1(zt)m ∑ j,−1(t/z)m m! m! m=0 m=0 so using the Hyperbolic Bessel Function of Order k ≥ 0, see Olver(1964) ∑∞ u2m+k I [u]= , k 22m+km!(m+k)! m=0 weseethatthehomogenoussystemhasfundamentalsolution (cid:32) (cid:33)k/2 µ Pj,k(t)=e−µjt µj,j−,11 I|k|[2t(cid:112)µj,−1 µj,1] Fromthefundamentalsolutionwefind,independentlyoftime P (t) µ j,1 j,1 = , Pj,−1(t) µj,−1 whichwecallthe asymmetricratio.Itwillberepeatedlyused. Ofcoursetheactualinitialvalueisnotx =0butwasusually j takentobethemodem whichwasassumedtobethevaluefor j originalpatriarch.Fromthepresentdistributionofdataweuse thefrequency Count(x (i)=k) j f(j,k)= . n 8 D.H.Hamilton etal. bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. OneproblemwiththeKAPZformulaisthathigherfrequencies ReductionofSingularLineages f(j,k),|k| =2,3...areoverrepresentedintheactualdata. This Welocatethesesingularlineagesbylookingforasymmetries isbecausetheprobabilityofaspontaneoustwostepmutation inthedistribution.Forauniformflowfromasinglepatriarch is much higher then the product of two one step mutations. thefrequencyofSTRvaluekisgivenby f(j,k) ∼ P (t). The j,k So instead we use the frequency to solve the transcendental asymmetricratio: equationfortheunknownt f(j,1) ∼ Pj,1(t) = µj,1 , f(j,0)∼Pj,0(t)=e−µjtI0[2t(cid:112)µj,−1 µj,1] f(j,−1) Pj,−1(t) µj,−1 iscompletelyindependentoftimet.Thereforeifsay Thisnonlinearequationiseasilysolvedviamathematicalsoft- waresuchasMATHEMATICA(Iusedversion9runningona f(j,1) µj,1 boosted2014iMacwhichhasaccuratehyperbolicBesselfunc- >> , f(j,−1) µj,−1 tions.EarlierversionsonolderiMacsgaveinaccuraciessoone had to compile one’s own functions). Using this formula re- wehaveasingularlineageatk=+1.Thustheexcessatk=+1 solvessomeotherproblemswiththeKAPZmethodevenfor is auniformflow,e.g. µj,−1 (cid:54)= µj,1givesanextraquadraticterm f(j,+1)− f(j,−1) µj,1 whichifignoredcauseslargeerrors. µj,−1 Tofirstorderapproximationthenfrequency f(j,+2)isdueto Heterogeneousdiffusionequation thissingularityatj=+1whichthereforegaveacontribution However the main problem is singularities in the stochastic µj,−1 process.Forauniformstochasticprocess,1−P (t)∼1−f(j,0) f(j,+2) istheprobabilityofsomemutation. Sotheejx,0pectedvariance µj,+1 is f(j,0)(1− f(j,0)). Thusiftheactualdatavariance Vj >> tok=0.Thusremovingtheeffectofthesingularityatk=+1 f(j,0)(1− f(j,0)) we are not uniform. Now a sublineage of leadstonewfrequencies veryhighfertilityincreasesvariance,givingapparentlygreater TMRCAalthoughitisunchanged.Onefindssimilarresultsfor f∗(j,−1) = f(j,−1) Bayesianmethods. Thecorrectapproachtononuniformityassumesattimest i (generationsago)acertainproportion0≤ρi ≤1ofthepresent f∗(j,0) = f(j,0)− f(j,+2)µj,−1 populationoriginatedfroma“virtualpatriarch”withaninitial µj,+1 STRvaluem.Theresultingsystem: i f∗(j,+1) = f(j,+1)− f(j,−1) µj,1 p (t) µj,−1 j,dkt =−µjpj,k+ ∑ µj,−mpj,k−m+µj,mpj,k+m+dρ Theseofcoursearenolongernormalizedsowerescaletoobtain m>0 therenormalizedfrequencyF(j,k),e.g. i.e. dρareatomsofweightρi withSTRvaluemi occurringat f∗(j,0) timeti. Asthesystemislinearandisotropicthesolutionisa F(j,0)= f∗(j,0)+ f∗(j,−1)+ f∗(j,+1) combination of fundamental solutions P of the homogenous system.Thusthepresentdistribution f(j,k)is whichwillbeusedtocomputetheexpansiontimeformarkerj. Therearesimilarformulaeifthesingularitywasatk=−1. ∑ f(j,k)∼ pj,k(t)= ρi Pj,k−mi(ti) Howeverthereissamplingerrorbothinthefrequenciesand i theµj,1,µj,−1.Sowebootstraptakingintoaccounttheseuncer- tainties,runningthecomputationthousandsoftimes.Generally Thisallowsustoconsiderpopulationsmixedbyhavingsingular wefindthebranchsingularityisalwaysoneofk = 0,+1,−1 lineagesfromoverfertilepatriarchs,orbyactualimmigration withnoSD.Inafewcasesthesingularitymayseemtowander fromtheoutside.ForexampleNielsen,R.andJ.Wakeley(2001) betweenk=0,+1,−1.Sointhecaseofawanderingsingularity consideredmixedpopulationswithtwosources. Howeverin weobtainadistributionoverk=0,+1,−1withameanandSD. ourgeneralcasehasnoobviouslimitonthenumberofsources. Inthesecaseswefindthesingularityisrelativelysmallanddoes The inverse problem seeks to find singularities from present notmakemuchdifferencetothefinalresult.Howevertohavea data. Unfortunately,ingeneral,inversionisillposedforsuch stablemethodwedonotthrowoutthesewanderingsingulari- systemsliketheheatequation.Thisinstabilityproducespoorac- tiesbutinthealgorithmusethemeantoaveragebetweenk=0 curacy.Furthermorethereisnouniquesolution,e.g.thepresent andk = ±1,e.g. ifthemeanisk =0thenweusetheoriginal distributioncouldhavebeencreatedyesterday. unreducedfrequency. Howeverwefindthat∼50%oftheDYSmarkersshowno Noticethatweassumeatmostonesidebranch. Intheory significantdifferencefromtheuniformexpansionofasingle therecouldbemanyandsolvingfortheseproduceevenbetter patriarch,i.e.thedatavarianceV isclosetotheexpectedvari- approximationstothepresentdata.Infactyoucouldgetperfect j ance f(j,0)(1− f(j,0)). Theothermarkersshowatmostone matchingbutfindtheatomswerecreatedyesterday!Thething significantsidebranch,i.e.thereisanoriginalbranchstartingat isthatwhilemanymarkersshowsignificantdeviationfroma timet withSTRm andasecondonewithSTRm =m ±1 uniformflowfromasinglepatriarch,afterwehavecarriedout j,0 0 1 0 attimet < t withsignificant0 < ρ < ρ . Yhuswehave reductionforonepossiblesidebranchwefindnosignificant j,1 j,0 1 0 maneable1-2sourcespermarker. differencefromauniformflow,i.e.thedifferenceiswithinthe GENETICSJournalTemplateonOverleaf 9 bioRxiv preprint first posted online Sep. 8, 2015; doi: http://dx.doi.org/10.1101/026286. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. SD.Thisisofcourseanapproximation,thenextlevelbeyond ofT,β. InSI2weshowresultsfor“pseudoR1b1a2”,i.ewith ZuckerkandlandPauling,butgiventhenoiseinthedataper- TMRCAT = 200generations. Thislooksliketheactualspec- hapsthebestwecando. Laterwefurtherreducetheeffectof trumforR1b1a2for1/β∼20generations,i.e.alopsidedbell outliersbyusingrobuststatistics. curveshowingfastexpansion. For1/β ∼100thecurvelooks Reducingthesingularlineagesincreasesthefrequency f(j,0) likeourcurveforJ2,i.espreadoutindicatingmuchbranching. ofthemodeanddecreasesthecomputed TMRCA.Butasthe EstimatingtheparameterTforanexponentialdistribution method of reducing singularities does not respect higher fre- isawellknownproblemofstatistics.Kendallprovedthebest quencies f(j,k)itfollowstheKAPZformulacannotbeusedand estimateforTwouldbemaxt.Unfortunatelythereisalsocon- j insteadweusetheprobabilityofnomutations,i.e.solve siderableerrorλ%forthemutationratesµ. Laterwegivea j j methodforreducingthiserror,evensowefindtheSDinthe F(j,0)=e−µjtI0[2t(cid:112)µj,−1 µj,1] range10%−30%whichgivescorrespondingrangeinerrorfor eacht.Weunderstandthatthet arebeinggeneratedbythedis- This is donefor each DYS marker j , givingexpansion times j j tributiondτ butsuperimposedonthisisafurtheruncertainty t ,...t for each marker, with computed CI. (An extra fixed 0 1 N duetomutationratesetc. Inparticularthelargest t maybe sourceoferroristheuncertaintyinthemutationrateswhichwe j wildlyinaccurate.Alsowefoundthatsimplytakingtheaverage dealwithlater). Wefindthereductionofsingularitiesmakes consistentlyunderestimatesthe TMRCA byawidemargin. strikingdifferencetothet oftheeffectedmarkers,oftenare- j Assumingthemutationrateshavenormaldistributionwith ductionof∼50%for TMRCA. meanµ andvarianceλ2µ2,thet haveSDt λ.Thustheactual Now the existence of side branches implies that the main j j j j j j branch could itself have been the side branch for an earlier datafort∗j hasprobabilitydensityfunctionfors>0 branchthatdidnotsurvive.Thuswedonotexpecttheexpan- siontimes t1,...tn foreachmarkertobeessentiallyequal., i.e dτ(s)=(cid:90) T e(t−T)/β e√−(t2−νs)2 dt. theyarenotwithintheSDofeachother. Indeedweseethat 0 β 2πν thedistributionofthetimest fordifferentmarkersarealmost j certainlynotrandomlyarrangedaboutasingleTRMCA Tbut The variance ν depends on two sources. First from the un- distributed from T to the present. This is seen whether you certaintyinmutationrates, foreachofthe n markersweget usereductionornot,orourmutationratesornot.(Foragiven varianceλ2,givingtotal j populationonecouldscalemutationratestogetequalt ,but j thenapplyingtheseadhocmutationratestootherpopulations ν = 1 ∑λ2 doesnotyieldthesamevalues).Thespreadoutdistributionof 1 n j j survivingbranchesisanotherverificationofourtheoryofmany extinctions,fewsurvivors. Thedistributionofthetimest for Howeverasmallsamplealsohasinherenterrorfromsampling. j differentmarkerswecallthebranchingdistribution,whichis Wearemeasuringtheprobabilitythatthereisamutation.This nowdiscussed. isbinomialwithprobabilityHj(t)givenby (cid:32) (cid:33)k/2 TheBranchingDistribution µ Thetimestjfordifferentmarkersaresortedfromtheyoungestto Hj =1−Pj,0(t)=1−e−µjt µj,j−,11 I0[2t(cid:112)µj,−1 µj,1] theoldest,formingasequencet∗,...t∗.Thegenerationofthese 1 n branchesisbyanunknownprobabilitydistribution dτ0 over HenceforsamplesizeNthereisvarianceHj(1−Hj)/N,sothe [0,T].Wemodeldτ0byassumingasurvivinglineageisgener- varianceintimeduetothisisscaledbythederivativegiving: atedatrandomwithprobabilityβ∆tintimeperiod[t,t+∆t], multipliedbytheprobabilitythatthebranchinghasn’talready Hj(1−Hj) ν = occurred.Theconstantβaveragesfertilityandextinctionrates, 2 N(H(cid:48))2 j thechanceofanewlineagesurviving.Asβ→∞wegetcurrent theorywherealllineagesoriginatefromasinglepatriarchat ThefunctionH(cid:48)hasactuallytobecomputedasaninversefunc- j timeT.Simulationswiththedatashowthatβvariesintherange tiondependingon H. Thereforethetotalvarianceaveraged j 1to∞.Wemakenoaprioriestimateofβ,unlikeBayesianmeth- overallnmarkersisν = ν +ν . Althoughforlargesamples 1 2 odswhereanoverallfertilityrateisapredeterminedparameter. (N > 1000)thesecondtermisinsignificantitdoeseffectthe Insteadourstochasticsimulationwillfindthemostlikelyβ,T resultsonceyougettoN = 100. Inouralgorithmthebranch- ineachcase. Assumingindependence,thenthegenerationof ingdistributionisusedtogeneratelargenumbersofrandom branchesfollowsthewellknownexponentialdistribution: branchingtimessoastobootstraperrorestimates.Inturnsout muchfastertocompilethedistributionfunctionasatablewhich τ [t]=Exp[β(t−T)]UnitStep[T−t] 0 canberepeatedlycalledon. Noticethisimpliesafiniteprobabilitythatsomemarkershave EstimatingTMRCAbyRobustStatistics essentiallyzeromutations. Thisisactuallyseeninexamples. BoththeHamiltonGpAandMacdonaldGpAhavenumber Inaccurate large values of t∗j are mitigated by using “robust” ofindividuals N > 100. Forthetimescaleof> 700yearswe statistics with quintiles instead of means/variances. Using donotexpectthereismorethanonemarkeroutof33which FTDNAdatawebeganwith37markers.Howeverthe4markers showsabsolutelynomutationsfromthemode.Infactinboth ofDYS464areunorderedandcannotbeused.Alsowefindthat casesthereare8markerswhereallNindividualshaveexactly markersDYS19/394,385b,459b,CDYbhaveerrors>33%in thesameSTRvalue. mutation rates so are not used. (These are some of the most Wealsotestedthistheoryinsimulationswhereweobtained popularonesintheliterature!). Wefindifwejustusethe15 randombranchingtimesbasedonthedensityforvariousvalues markerswitherrors<15%thefinalSDisabout13%compared 10 D.H.Hamilton etal.
Description: