Gene Genealogies, Variation and Evolution A Primer in Coalescent Theory Jotun Hein University of Oxford, UK Mikkel H. Schierup and Carsten Wiuf University of Aarhus, Denmark 1 PublishedintheUnitedStates byOxfordUniversityPressInc.,NewYork ©JotunHein,MikkelH.SchierupandCarstenWiuf,2005 Firstpublished2005 ISBN 0-19-852995-3(hbk.) ISBN 0-19-852996-1(pbk.) TypesetbyNewgenImagingSystems(P)Ltd.,Chennai,India PrintedinGreatBritain onacid-freepaperby BiddlesLtd.,King’sLynn,Norfolk Preface Coalescenttheoryhasgonefromanobscurecornerofpopulationgenetics toacentralconceptforanybodythatstudiesvariationatthesequencelevel. Besides filling the obvious need for a book on this subject, it is also our wish to present this theory in a straightforward and elementary manner thatcoulddispelthemisconceptionthatcoalescenttheoryisinherentlyvery difficultandneedsastrongmathematicalbackgroundtounderstandit.The key issues needed for data analysis require only basic combinatorics and probability theory. Despite the present prominence of coalescent theory, it also belongs to the future. From an application point of view, human evolution and association mapping/fine scale mapping are two areas that areboundtogrowenormouslyinthenextfewyears.Andtomakeoptimal useofthecomingfloodofdata,theoreticaladvanceswillbeneeded.There areareaswherepresenttheoryfails(orisimpracticallyslow)inthepresence ofrealdataandifempiricalresearchersaretousecoalescentbasedmethod, thereareplentyofchallengesforthetheoreticianbothinmodellingandin improvementofsimulationalgorithms. Thepresentbookisdefinitelynotexhaustive,butisonlymeanttoprovide agoodbasisforfurtherstudy.Chapter1providesthebasicsforunderstand- ing the assumptions behind and derivation of the basic coalescent model. Chapter 2 introduces the models of alleles and sequences and associated mutation processes, and Chapter 3 gives some examples of quantities that canbecalculatedoncoalescentgenealogies.Thebasiccoalescentisnaivefor manyapplicationsandChapter4and5relaxtheassumptionsofthebasic coalescenttoallowforpopulationsizechanges,populationstructure,vari- ousformsofselection,andrecombination.Chapter6changestheemphasis fromdescribingthecoalescencestructuretoinferringparametersfromdata using knowledge on this structure. In Chapter 7 and 8 two areas of much currentinterestandwherecoalescenttheoryislikelytoplayamajorroleare introduced.Chapter7discussestheusageofcoalescencetheoryinthefield of linkage disequilibrium mapping that aims at locating genes underlying common diseases. Chapter 8 relates the potential and use of coalescence theorytohumanevolution.Inanappendixabriefintroductiontotheweb based tools developed in connection with this book are presented. The collectionoftoolscanbefoundathttp://www.coalescent.dk. Contents 1 Thebasiccoalescent 1 1.1 Introduction 1 1.2 AY-chromosomedataset 5 1.3 Dataandtheory 10 1.4 TheWright–Fishermodel 11 1.4.1 AssumptionsoftheWright–Fishermodel 13 1.4.2 Thenumberofdescendantsofageneinonegeneration 14 1.4.3 Anexample 15 1.5 Thegeometricdistribution 17 1.6 Theexponentialdistribution 19 1.7 Thediscrete-timecoalescent 21 1.7.1 Coalescenceofasampleoftwogenes 21 1.7.2 Coalescenceofasampleofngenes 22 1.7.3 Example:Effectofapproximations 23 1.8 Thecontinuoustimecoalescent 24 1.9 Calculatingsimplequantitiesonacoalescenttree 25 1.9.1 Theheightofatree 25 1.9.2 Thetotalbranchlengthofatree 27 1.9.3 Theeffectofsamplingmoresequences 28 1.10 Theeffectivepopulationsize 29 1.11 TheMoranmodel 31 1.12 Robustnessofthecoalescent 32 2 Fromgenealogiestosequences 33 2.1 Mathematicalmodelsofalleles 33 2.1.1 Theinfiniteallelesmodel 33 2.1.2 Theinfinitesitesmodel 35 2.1.3 Finitesitesmodel 37 2.2 TheWright–Fishermodelwithmutation 39 2.3 Algorithmsforsimulatingsequenceevolution 41 2.4 Theprobabilityofasampleconfiguration 45 2.4.1 Infiniteallelesmodel 46 2.4.2 Infinitesitesmodel 50 2.4.3 Impossibleancestralstates 55 2.5 Quantitiesrelatedtotheinfinitesitesmodel 58 2.5.1 Thenumberofsegregatingsites 58 2.5.2 Haplotypes 60 2.5.3 Pairwisemismatchdistribution 61 2.5.4 Estimatorsofθ andTajima’sD 62 2.6 Evolutionaryversussamplingvariance 63 2.6.1 Example1:ThevariableSn 64 2.6.2 Example2:Tajima’sestimatorπˆ 65 3 Treesandtopologies 67 3.1 Someterminology 67 3.1.1 Thejumpprocessandthewaitingtimeprocess 67 3.1.2 Thecoalescentandphylogenetictrees 67 3.2 Countingtreesandtopologies 70 3.3 Genetrees 72 3.3.1 Howtobuildagenetree 75 3.4 Nestedsubsamples 76 3.5 Hangingsubtrees 78 3.5.1 Unbalancedtrees 81 3.5.2 Example:Neanderthalsequences 81 3.6 Asinglelineage 82 3.7 Disjointsubsamples 83 3.7.1 Examples 86 3.8 Asamplepartitionedbyamutation 87 3.8.1 Unknownancestralstate 89 3.8.2 TheageoftheMRCAfortwosequences 90 3.9 Theprobabilityofgoingfromnancestors tokancestors 91 4 Extensionstothebasiccoalescent 95 4.1 Introduction 95 4.2 Thecoalescentwithfluctuatingpopulationsize 95 4.2.1 Stochasticandsystematicchanges 95 4.2.2 Howtomodelpopulationchangesinthecoalescent 96 4.3 Exponentialgrowth 99 4.3.1 Thegenealogyunderexponentialgrowth 100 4.4 Populationbottlenecks 104 4.4.1 Genealogicaleffectofbottlenecks 106 4.5 Effectivepopulationsizerevisited 107 4.6 Thecoalescentwithpopulationstructure 108 4.6.1 Thefiniteislandmodel 108 4.6.2 Thecoalescenttreeinthefiniteislandmodel 110 4.6.3 Generalmodelsofsubdivision 114 4.6.4 Non-equilibriummodels 116 4.7 Coalescentwithbalancingselection 118 4.7.1 Twoallelebalancingselection 118 4.7.2 Multiallelicbalancingselection 120 4.8 Coalescentwithdirectionalselection 123 4.8.1 Theancestralselectiongraph 123 4.9 Summary 126 5 Thecoalescentwithrecombination 127 5.1 Introduction 127 5.2 Dataexamplewithrecombination 128 5.3 Modellingrecombination 130 5.3.1 Hudson’smodelofrecombination 130 5.3.2 Biologicalfeaturesofrecombination 132 5.4 TheWright–Fishermodelwithrecombination 137 5.5 Algorithms 139 5.5.1 Theancestralrecombinationgraph 139 5.5.2 SamplingARGs:Notbackintime,butalongsequences 144 5.5.3 Efficiencyofdifferentalgorithms 147 5.6 Theeffectofasinglerecombinationevent 148 5.7 Thenumberofrecombinationevents 152 5.8 Theprobabilityofadataset 153 5.9 Thenumberofsegregatingsites 155 5.10 Thecoalescentwithgeneconversion 156 5.11 Genetreeswithrecombination—fromincompatibilitiesto minimalARGs 158 5.11.1Recombinationassubtreetransfer 159 5.11.2Recombinationinferredfromhaplotypes 165 5.11.3Fromlocaltoglobalbounds 166 5.11.4MinimalARGs 167 5.11.5Topologies,recombination,andcompatibility 169 6 Gettingparametersfromdata 173 6.1 Introduction 173 6.2 Estimatorsofθ 174 6.2.1 Watterson’sestimator 175 6.2.2 Tajima’sestimator 176 6.2.3 Fu’stwoestimators 178 6.3 Estimatorsofρ 181 6.3.1 Estimatorsbasedonsummarystatistics 183 6.3.2 Pseudo-likelihoodestimators 185 6.4 MonteCarlomethods 187 6.4.1 Thelikelihoodcurve 189 6.4.2 MonteCarlointegrationandthecoalescent 191 6.4.3 MarkovchainMonteCarlo 195 7 LDmappingandthecoalescent 199 7.1 ThepotentialofLDmapping 199 7.2 LinkageversusLDmapping 200 7.3 Complexdiseaseaetiology 202 7.4 Formulatingthetask 205 7.5 Aroleforthecoalescent 206 7.6 Genealogicaltreesaroundadiseasemutation 208 7.6.1 Qualitativemeasures 209 7.6.2 Anexample 210 7.6.3 Quantifyinggenealogicaltreedifferences 213 7.7 Thegenealogicalprocessreflectedindata 216 7.8 Linkagedisequilibrium(LD) 217 7.8.1 TestingforLD 220 7.8.2 Accountingforpopulationadmixture 220 7.8.3 Differencesbetweenhumanpopulations 221 7.9 Measuringassociationusingsinglemarkers 223 7.10 HaplotypeLDmapping 223 7.11 ModelbasedLDmapping 224 7.11.1Starshapedgenealogy 225 7.11.2Coalescentbasedgenealogy 225 7.11.3Anexample 227 7.11.4Furtherchallenges 228 8 Humanevolution 231 8.1 Introduction 231 8.2 Ourphylogeneticpositionandancestralpopulationgenetics 232 8.2.1 Thenumberofgeneticancestorstoagenome 235 8.3 Humanmigrationsandpopulationstructure 240 8.3.1 OurrelationshiptotheNeanderthaler 242 8.3.2 Populationgrowth 244 8.3.3 Structurewithinglobalmodernhumanpopulations 244 8.3.4 Specifichistories 245 8.3.5 Empiricalpedigreesandthecoalescent 246 8.3.6 Othergenealogicalissues 250 8.3.7 Tracinggeneticmaterialwithintheparentgenealogy 252 Appendix:Webbasedtools 255 Bibliography 259 Index 273 The basic coalescent 1 1.1 Introduction In this chapter, we first motivate the need for a mathematical model that candescribetheprocessgeneratinggeneticdata,withspecificemphasison human variation. We assume that genetic data are in the form of DNA sequence data. The sequences or genes are all homologous copies of the samegeneticregioninthegenomeofaspecies.Whetheroneorbothcopies of a gene in an individual is sampled does not matter here, what matters is their number and genetic type. Such data are collected from one or sev- eral present-day populations of a single species, and from this sample we wanttoinferdetailsabouttheevolutionaryprocessesthatcreatedthedata. (Apopulationisherebestunderstoodasapopulationofgenes,ratherthan a population of individuals, because we focus at the level of genes.) The inferential analysis is retrospective; we seek to understand aspects of the sample’s (and the population’s) evolutionary past through analysis of the presentdaysample.Belowwesketchananalysisofadatasetfromhumans to make more clear the different types of questions that coalescent theory seekstoanswer. Thehumangenomeconsistsoftwenty-twoautosomalchromosomes,the twosexchromosomesXandY,andthemitochondrion.Onerepresentative sequence of the human genome has recently been approximately determ- ined, and is presently subject to refinement. Additionally, a major effort hasnowbeendirectedtowardsdeterminingthepopulationvariationinthe human genome through determination of single nucleotide polymorphism (SNPs).Thesearepositionsthatvarywithinorbetweenhumanpopulations. Table1.1showsthesizesofeachchromosomeandthenumberofpositions wherevariationhavesofarbeendetected.Thehumangenomeandthevari- ationthatisobservedistheresultofinteractionamongevolutionaryforces, such as mutational and selectional processes, mixing of variation through recombination,anddemographicfactors,suchasthesize,history,andgeo- graphicalstructureofthepopulations.Effectsofdemographyareillustrated by the major colonisations of the globe (Figure 1.1). The present human population migrated out of East Africa approximately 100,000 years ago 2 1:Thebasiccoalescent Table1.1 Thehumangenomea Chromosome SizeinMb LengthincM Genes SNPs SNPdensity 1 245 293 1,945 426 1.74 2 243 277 1,283 396 1.63 3 199 233 1,049 317 1.59 4 191 212 765 318 1.66 5 181 198 879 323 1.78 6 171 201 1,053 309 1.79 7 158 184 952 282 1.78 8 146 166 717 256 1.75 9 134 167 755 263 1.96 10 135 182 756 250 1.85 11 135 156 1,294 249 1.84 12 133 169 1,006 213 1.60 13 114 118 341 166 1.46 14 105 129 647 148 1.41 15 100 110 592 166 1.66 16 90 131 900 183 2.03 17 82 129 1,121 144 1.76 18 78 124 267 138 1.77 19 64 110 1,303 110 1.72 20 64 97 631 232 3.63 21 47 60 231 88 1.87 22 49 58 485 121 2.47 X 152 198 750 45 1.61 Y 50 1 94 30 1.60 aThesecondcolumnshowsthelengthofeachchromosomeinmillionbases(Mb), thethirdthelengthincentiMorgans(cM),thefourththeestimatednumberof genesforeachchromosome,thefifththenumberofcurrentlyidentifiedSNPs (thousands),andthelastcolumnshowsthecurrentdensityofSNPsperkilobases (kb).ThedetectednumberofSNPsandthusalsotheSNPdensitywillgoup withinthenextfewyears.Thedataisfromwww.ensembl.org,release17.33.1 (July2003),exceptforthegeneticmaplengthswhicharetakenfromtheGenethon geneticmap. 15–35,000 40,000 50–60,000 100,000 >40,000 Figure1.1 Theworldandhistoricalmigrations.Theapproximatedatesofmass migrations(yearsago)havemainlybeendeterminedbydatingfossilsfoundatdifferent locations.AdaptedfromCavalli-Sforza(2001).