The Geography of Recent Genetic Ancestry across Europe Peter Ralph*¤, Graham Coop* DepartmentofEvolutionandEcology&CenterforPopulationBiology,UniversityofCalifornia,Davis,California,UnitedStatesofAmerica Abstract The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2,257 Europeans (in the Population Reference Sample [POPRES] dataset) to conductoneofthefirstsurveysofrecentgenealogicalancestryoverthepast3,000yearsatacontinentalscale.Wedetected 1.9millionsharedlonggenomicsegments,andusedthelengthsofthesetoinferthedistributionofsharedancestorsacross timeandgeography.WefindthatapairofmodernEuropeanslivinginneighboringpopulationssharearound2–12genetic commonancestorsfromthelast1,500years,andupwardsof100geneticancestorsfromtheprevious1,000years.These numbersdropoffexponentiallywithgeographicdistance,butsincethesegeneticancestorsareatinyfractionofcommon genealogical ancestors, individuals from opposite ends of Europe are still expected to share millions of common genealogicalancestorsoverthelast1,000years.Thereisalsosubstantialregionalvariationinthenumberofsharedgenetic ancestors.Forexample,thereareespeciallyhighnumbersofcommonancestorssharedbetweenmanyeasternpopulations thatdateroughlytothemigrationperiod(whichincludestheSlavicandHunnicexpansionsintothatregion).Someofthe lowest levels of common ancestry are seen in the Italian and Iberian peninsulas, which may indicate different effects of historical population expansions in these areas and/or more stably structured populations. Population genomic datasets have considerable power to uncover recent demographic history, and will allow a much fuller picture of the close genealogicalkinship of individualsacross theworld. Citation:RalphP,CoopG(2013)TheGeographyofRecentGeneticAncestryacrossEurope.PLoSBiol11(5):e1001555.doi:10.1371/journal.pbio.1001555 AcademicEditor:ChrisTyler-Smith,TheWellcomeTrustSangerInstitute,UnitedKingdom ReceivedJuly16,2012;AcceptedMarch27,2013;PublishedMay7,2013 Copyright: (cid:2)2013 Ralph, Coop. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited. Funding:GC:SloanFoundationFellowship,www.sloan.org.PR:RuthL.KirschsteinFellowship,NIH#F32GM096686,grants.nih.gov.Thefundershadnorolein studydesign,datacollectionandanalysis,decisiontopublish,orpreparationofthemanuscript. CompetingInterests:Theauthorshavedeclaredthatnocompetinginterestsexist. Abbreviations:IBD,identitybydescent;SNP,singlenucleotidepolymorphism;ya,yearsago. *E-mail:[email protected](PR);[email protected](GC) ¤Currentaddress:DepartmentofMolecularandComputationalBiology,UniversityofSouthernCalifornia,LosAngeles,California,UnitedStatesofAmerica Introduction relatedness,andsoobtaininsightintothepopulationhistoryofthe past tens of generations. Here we investigate such patterns of Even seemingly unrelated humans are distant cousins to each recent relatednessin alarge Europeandataset. other,asallmembersofaspeciesarerelatedtoeachotherthrough Thepastseveralthousandyearsarerepletewitheventsthatmay avastlyramifiedfamilytree(theirpedigree).Wecanseetracesof have had significant impact on modern European relatedness, theserelationshipsingeneticdatawhenindividualsinheritshared suchastheNeolithicexpansionoffarming,theRomanempire,or genetic material from a common ancestor. Traditionally, popu- the more recent expansions of the Slavs and the Vikings. Our lation genetics has studied the distant bulk of these genetic current understanding of these events is deduced from archaeo- relationships, which in humans typically date from hundreds of logical, linguistic, cultural, historical, and genetic evidence, with thousands of years ago (e.g., [1,2]). Such studies have provided widely varying degrees of certainty. However, the demographic deep insights into the origins of modern humans (e.g., [3]), and andgenealogicalimpactoftheseeventsisstilluncertain(e.g.,[9]). into recentadmixture between divergedpopulations (e.g., [4,5]). Genetic data describing the breadth of genealogical relationships Althoughmostsuchgeneticrelationshipsamongindividualsare can therefore add another dimension to our understanding of very old, some individuals are related on far shorter time scales. these historical events. Indeed, given that each individual has 2n ancestors from n Work from uniparentally inherited markers (mtDNA and Y generationsago,theoreticalconsiderationssuggestthatallhumans chromosomes) has improved our understanding of human are related genealogically to each other over surprisingly short demographichistory(e.g.,[10]). However,interpretation of these time scales [6,7]. We are usually unaware of these close markersisdifficultsincetheyonlyrecordasinglelineageofeach genealogical ties, as few of us have knowledge of family histories individual(thematernalandpaternallineages,respectively),rather more than a few generations back, and these ancestors often do thantheentiredistributionofancestors.Genome-widegenotyping not contribute any genetic material to us [8]. However, in large and sequencing datasets have the potential to provide a much samples we can hope to identify genetic evidence of more recent richer picture of human history, as we can learn simultaneously PLOSBiology | www.plosbiology.org 1 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry than these other works, and describe continuous geographic Author Summary structure by obtaining average numbers of common ancestors Few of us know our family histories more than a few shared by many populations across time in a relatively nonpara- generations back. It is therefore easy to overlook the fact metric fashion. thatwearealldistantcousins,relatedtooneanotherviaa vast network of relationships. Here we use genome-wide Definitions: Genetic Ancestry and Identity by Descent data from European individuals to investigate these We can only hope to learn from genetic data about those relationships over the past 3,000 years, by looking for common ancestors from whom two individuals have both longstretchesofgenomethataresharedbetweenpairsof inherited the same genomic region. If a pair of individuals have individuals through their inheritance from common genetic ancestors. We quantify this ubiquitous recent both inherited some genomic region from a common ancestor, commonancestry,showingforinstancethatevenpairsof that ancestor is called a ‘‘genetic common ancestor,’’ and the individuals from opposite ends of Europe share hundreds genomicregionisshared‘‘identicalbydescent’’(IBD)bythetwo. of genetic common ancestors over this time period. Here we define an ‘‘IBD block’’ to be a contiguous segment of Despitethisdegreeofcommonality,therearealsostriking genome inherited (on at least one chromosome) from a shared regionaldifferences.SoutheasternEuropeans,forexample, common ancestor without intervening recombination (see share large numbers of common ancestors that date Figure 1A). A more usual definition of IBD restricts to those roughly to the era of the Slavic and Hunnic expansions segments inherited from some prespecified set of ‘‘founder’’ around 1,500 years ago, while most common ancestors individuals(e.g.,[8,27,28]),butweallowancestorstobearbitrarily thatItalianssharewithotherpopulationslivedlongerthan far back in time. Under our definition, everyone is IBD 2,500 years ago. The study of long stretches of shared everywhere, but mostly on very short, old segments [29]. We genetic material promises to uncover rich information measure lengths of IBD segments in units of Morgans (M) or about manyaspects ofrecent population history. centiMorgans(cM),where1Morganisdefinedtobethedistance over which an average of one recombination (i.e., a crossover) occursper meiosis.SegmentsofIBDare brokenupovertimeby about the diversity of ancestors that contributed to each recombination, which implies that older shared ancestry tends to individual’s genome. result inshorter shared IBDblocks. A number of genome-wide studies have begun to reveal Sufficiently long segments of IBD can be identified as long, quantitative insights into recent human history [11]. Within contiguousregionsoverwhichthetwoindividualsareidentical(or Europe, the first two principal axes of variation of the matrix of nearlyidentical)atasetofsinglenucleotidepolymorphisms(SNPs) genotypesarecloselyrelatedtoarotationoflatitudeandlongitude thatsegregateinthepopulation.Formal,model-basedmethodsto [12–14], as would be expected if patterns of ancestry are mostly infer IBD are only computationally feasible for very recent shaped by local migration [15]. Other work has revealed a slight ancestry (e.g., [30]), but recently, fast heuristic algorithms have decreaseindiversityrunningfromsouth-to-northinEurope,with thehighesthaplotypeandallelicdiversityintheIberianpeninsula (e.g., [14,16,17]), and the lowest haplotype diversity in England and Ireland [18]. Recently, progress has also been made using genotypes of ancient individuals to understand the prehistory of Europe [19–21]. However, we currently have little sense of the time scale of the historical events underlying modern geographic patternsofrelatedness,northedegreesofgenealogicalrelatedness they imply. Inthisarticle,weanalyzethoserarelongchunksofgenomethat are shared between pairs of individuals due to inheritance from recent common ancestors, to obtain a detailed view of the geographicstructureofrecentrelatedness.Todeterminethetime scaleoftheserelationships,wedevelopmethodologythatusesthe lengthsofsharedgenomicsegmentstoinferthedistributionofthe ages of these recent common ancestors. We find that even geographically distant Europeans share ubiquitous common ancestry within the past 1,000 years, and show that common ancestry from the past 3,000 years is a result of both local migration and large-scale historical events. We find considerable structure below the country level in sharing of recent ancestry, lending furthersupport totheidea that looking at runsof shared Figure 1. The spread of genetic ancestry. (A) A hypothetical portionofthepedigreerelatingtwosampledindividuals,whichshows ancestry canidentify verysubtle populationstructure (e.g.,[22]). six of their genealogical common ancestors, with the portions of Our method for inferring ages of common ancestors is ancestral chromosomes from which the sampled individuals have conceptually similar to the work of [23], who use total amount inheritedshadedgrey.TheIBDblockstheyhaveinheritedfromthetwo oflongrunsofsharedgenometofitsimpleparametricmodelsof geneticcommonancestorsarecoloredred,andthebluearrowdenotes recenthistory,aswellasto[3]and[24],whouseinformationfrom thepaththroughthepedigreealongwhichoneoftheseIBDblockswas short runs of shared genome to infer demographic history over inherited. (B) Cartoon of the spatial locations of ancestors of two individuals—circle size is proportional to likelihood of genetic muchlongertimescales.Otherconceptuallysimilarworkincludes contribution, and shared ancestors are marked in grey. Note that [25]and[26],whousedthelengthdistributionofadmixturetracts common ancestors are likely located between the two, and their to fit parametric models of historical admixture. We rely less on distributionbecomesmorediffusefurtherbackintime. discrete,idealizedpopulationsorparametricdemographicmodels doi:10.1371/journal.pbio.1001555.g001 PLOSBiology | www.plosbiology.org 2 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry beendevelopedthatcanbeappliedtothousandsofsamplestyped Table1.Populations, abbreviations, sample sizes(n),mean on genotyping chips (e.g.,[31,32]). numberofIBDblockssharedbyapairofindividualsfromthat Therelationshipbetweennumbersoflong,sharedsegmentsof population (‘‘self’’),and meanIBD rateaveragedacross all genome, numbers of genetic common ancestors, and numbers of otherpopulations(‘‘other’’), sorted byregional groupings genealogicalcommonancestorscanbedifficulttoenvision.Since described inthe text. everyone has exactly two biological parents, every individual has exactly 2n paths of length n meioses leading back through their pedigree, each such path ending in a grandn–1parent. However, Group Abbreviation n Self Other duetoMendeliansegregationandlimitedrecombination,genetic Egroup material will only be passed down along a small subset of these paths [8]. As n grows, these paths proliferate rapidly and so the Albania AL 9 14.5 v genealogical paths of two individuals soon overlap significantly. Austria AT 14 1.3 0.9 (ThesepointsareillustratedinFigure1.)Byobservingthenumber Bosnia BO 9 4.1 1.6 ofsharedgenomicblocks,welearnaboutthedegreetowhichtheir Bulgaria BG 1 — 1.3 genealogies overlap, or the number of common ancestors from Croatia HR 9 2.8 1.6 whichbothindividuals haveinherited geneticmaterial. At least one parent of each genetic common ancestor of two CzechRepublic CZ 9 2.1 1.3 individuals is also a genetic common ancestor, so the number of Greece EL 5 1.8 0.9 genetic common ancestors at each point back in time is strictly Hungary HU 19 1.9 1.2 increasing. A more relevant quantity is the rate of appearance of Kosovo KO 15 9.9 1.7 most recent common genetic ancestors. This quantity can be much Montenegro ME 1 – 1.8 moreintuitive,andiscloselyrelatedtothecoalescentrate[33],as Macedonia MA 4 2.5 0.4 we demonstrate later. For this reason, when we say ‘‘genetic commonancestor’’or‘‘rateofgeneticcommonancestry,’’weare Poland PL 22 3.8 1.5 referring to only the most recent genetic common ancestors from Romania RO 14 2.1 1.2 which the individuals in question inherited their shared segments Russia RU 6 4.3 1.4 of genome. Slovenia SI 2 5.0 1.3 Serbia RS 11 2.7 1.5 Results Slovakia SK 1 — 0.7 WeappliedthefastIBDmethod,implementedinBEAGLEv3.3 Ukraine UA 1 — 1.5 [31],totheEuropeansubsetofthePopulationReferenceSample Yugoslavia YU 10 3.4 1.5 (POPRES) dataset (dbgap accession phs000145.v1.p1, [34]), TCgroup which includes language and country-of-origin data for several Cyprus CY 3 2.7 0.4 thousandEuropeansgenotypedat500,000SNPs.Oursimulations showed that we have good power to detect long IBD blocks Turkey TR 4 2.2 0.5 (probability of detection 50% for blocks longer than 2 cM, rising Ngroup to98%forblockslongerthan4 cM),andalowfalsepositiverate Denmark DK 1 — 0.9 (discussed further in the Materials and Methods section). We Finland FI 1 — 1.2 excludedfromouranalysesindividualswhoreportedgrandparents Latvia LV 1 — 1.6 originating from non-European countries or more than one distinctcountry(andrefertotheremainderas‘‘Europeans’’).After Norway NO 2 2.0 0.8 removing obvious outlier individuals and close relatives, we were Sweden SE 10 3.4 1.0 leftwith2,257individualswhowegroupedusingreportedcountry Wgroup oforiginandlanguageinto40populations,listedwithsamplesizes Belgium BE 37 1.1 0.6 and average IBD levels in Table 1. For geographic analyses, we England EN 22 1.3 0.7 located each population at the largest population city in the France FR 86 0.7 0.5 appropriateregion.Pairsofindividualsinthisdatasetwerefound toshareatotalof1.9millionsegmentsofIBD,anaverageof0.74 Germany DE 71 1.1 0.9 perpairofindividuals,or831perindividual.Themeanlengthof Ireland IE 60 2.6 0.6 these blocks was 2.5cM, the median was 2.1 cM, and the 25th Netherlands NL 17 1.9 0.7 and 75th quantiles are 1.5cM and 2.9cM, respectively. The Scotland SC 5 2.2 0.7 majority of pairs sharing some IBD shared only a single block of SwissFrench CHf 839 1.3 0.6 IBD (94%). The total length of IBD blocks an individual shares SwissGerman CHd 103 1.6 0.6 withallothersrangedbetween30%and250%(average128%)of the length of the genome (greater than 100% is possible as Switzerland CH 17 1.1 0.5 individualsmayshareIBDblockswithmorethanoneotheratthe UnitedKingdom UK 358 1.2 0.7 same genomiclocation). Igroup TheobservedgenomicdensityoflongIBDblocks(percM)can Italy IT 213 0.6 0.5 be affected by recent selection [35] and by cis-acting recombina- Portugal PT 115 1.9 0.5 tionmodifiers.WefindthatthelocaldensityofIBDblocksofall lengths is relatively constant across the genome, but in certain Spain ES 130 1.5 0.4 regions the length distribution is systematically perturbed (see doi:10.1371/journal.pbio.1001555.t001 Figure S1), including around certain centromeres and the large PLOSBiology | www.plosbiology.org 3 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry Figure2.Substructurein(A)Italianand(B)U.K.samples.Theleftmostplotsof(A)showhistogramsofthenumbersofIBDblocksthateach ItaliansampleshareswithanyFrench-speakingSwiss(top)andanyonefromtheUnitedKingdom(bottom),overlaidwiththeexpecteddistribution (Poisson)iftherewasnodependencebetweenblocks.NextisshownascatterplotofnumbersofblockssharedwithFrench-speakingSwissandU.K. samples,forallsamplesfromFrance,Italy,Greece,Turkey,andCyprus.WeseethatthenumbersofrecentancestorseachItalianshareswiththe French-speakingSwissandwiththeUnitedKingdomarebothbimodal,andthatthesetwoarepositivelycorrelated,rangingcontinuouslybetween valuestypicalforTurkey/CyprusandforFrance.Figure(B)issimilar,showingthatthesubstructurewithintheUnitedKingdomispartofacontinuous trendrangingfromGermanytoIreland.TheoutliersvisibleinthescatterplotofFigure2Bareeasilyexplainedasindividualswithimmigrantrecent ancestors—thethreeoutlyingU.K.individualsinthelowerleftsharemanymoreblockswithItaliansthanallotherU.K.samples,andtheindividual labeled‘‘SK’’isaclearoutlierforthenumberofblockssharedwiththeSlovakiansample. doi:10.1371/journal.pbio.1001555.g002 inversion on chromosome 8 [36], also seen by [35]. Somewhat Two of the more striking examples of substructure are surprisingly,theMHCdoesnotshowanunusualpatternofIBD, illustratedinFigure2.Here,weseethatvariationwithincountries despitehavingshownupinothergenomicscansforIBD[35,37]. can be reflective of continuous variation in ancestry that spans a However, therearea fewotherregions wheredifferences inIBD broader geographic region, crossing geographic, political, and rate are not predicted by differences in SNP density. Notably, linguistic boundaries. Figure 2A shows the distinctly bimodal there are two regions, on chromosomes 15 and 16, which are distributionofnumbersofIBDblocksthateachItalianshareswith nearly as extreme in their deviations in IBD as the inversion on both French-speaking Swiss and the United Kingdom, and that chromosome 8, and may also correspond to large inversions these numbers are strongly correlated. Furthermore, the amount segregatinginthesample.Theseonlymakeupasmallportionof that Italians sharewith these twopopulations varies continuously thegenome,anddonotsignificantlyaffectourotheranalyses(and from values typical for Turkey and Cyprus, to values typical for so arenot removed); weleave furtheranalysisfor future work. France and Switzerland. Interestingly, the Greek samples (EL) placenearthemiddleoftheItaliangradient.Itisnaturaltoguess Substructure and Recent Migrants thatthereisanorth-southgradientofrecencyofcommonancestry We should expect significant within-population variability, as along the length of Italy, and that southern Italy has been modern countries are relatively recent constructions of diverse historicallymorecloselyconnectedtotheeasternMediterranean. assemblagesoflanguagesandheritages.Toassesstheuniformityof In contrast, within samples from the United Kingdom and ancestrywithinpopulations,weusedapermutationtesttomeasure, nearbyregions,weseeanegativecorrelationbetweennumbersof for each pair of populations x and y, the uniformity with which blocks shared with Irish and numbers of blocks shared with relationshipswithxaredistributedacrossindividualsfromy.Most Germans. From our data, we do not know if this substructure is comparisonsshowstatisticallysignificantheterogeneity(FigureS2), also geographically arranged within the United Kingdom (our which is probably due to population substructure (as well as sampleofwhichmayincludeindividualsfromNorthernIreland). correlationsintroducedbythepedigree).Anotableexceptionisthat However,anobviousexplanationofthispatternisthatindividuals nearly all populations showed no significant heterogeneity of within the United Kingdom differ in the number of recent numbers of common ancestors with Italian samples, suggesting ancestors shared with Irish, and that individuals with less Irish thatmostcommonancestorssharedwithItalylivedlongeragothan ancestryhavealargerportionoftheirrecentancestrysharedwith thetimethatstructurewithinmodern-daycountriesformed. Germans. This suggests that there is variation across the United PLOSBiology | www.plosbiology.org 4 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry Kingdom—perhaps a geographic gradient—in terms of the largedifferenceintheratesofIBDsharingbetweenregionscannot amount ofCeltic versus Germanicancestry. be explained by plausible differences in false positive rates or Thefirsttwoprincipalcomponentsofthematrixofgenotypes, power between populations, since this pattern holds even at the after suitable manipulations, can reproduce the geographic longest length scales,where blockidentification isnearly perfect. positions of European populations (e.g., [12–14]). Therefore, it is To better understand IBD within these groupings, we show in natural to compare the structure we see within populations in Figure3G–IhowaveragenumbersofIBDblocksshared,inthree termsofIBDsharingtothepositionsontheprincipalcomponents different length categories, depend on the geographic distance map. (A PCA map of these populations, produced by EIGEN- separatingthetwopopulations.Evenwithouttakingintoaccount STRAT [38], is shown in Figure S4.) It is not known what the regional variation, mean numbers of shared IBD blocks decay geographic resolution of the principal components map is, but if exponentially with distance, and further structure is revealed by relative positions within populations is meaningful, then compar- breaking out populations by the regional groupings described ison of IBD to PCA can stand in for comparison to geography. above. The exponential decays shown for each pair of groupings Indeed,asseeninFiguresS5andS6,thesubstructureofFigure2 emphasize how the decay of IBD with distance becomes more correlateswellwiththepositiononcertainprincipalcomponents, rapid for longer blocks. This is expected under models where furthersuggestingthatthestructureisgeographicallymeaningful. migration ismostlylocal, since asonelooks furtherbackin time, Conversely, since the substructure we see is highly statistically the distribution of each individual’s ancestors is less concentrated significant, this demonstrates that the scatter of positions within aroundtheindividual’slocation(recallFigure1B).Therefore,the populations on the European PCA map is at least in part signal, expected number of ancestors shared by a pair of individuals rather than noise. decreases as the geographic distance between the pair increases; andthisdecrease isfasterfor morerecentancestry. Europe-Wide Patterns of Relatedness This wider spread of older blocks can also explain why the IndividualsusuallysharethehighestnumberofIBDblockswith decay of IBD with distance varies significantly by region even if others from the same population, with some exceptions. For dispersal rates have been relatively constant. For instance, the example, individuals in the United Kingdom share more IBD gradual decay of sharing with the Iberian and Italian peninsulas blocks on average, and hence more close genetic ancestors, with couldoccurbecausetheseblocksareinheritedfrommuchlonger individuals from Ireland than with other individuals from the ago than blocks of similar lengths shared by individuals in other United Kingdom (1.26 versus 1.09 blocks at least 1 cM per pair, populations. Mann-Whitneyp,10210),andGermanssharesimilarlymorewith Conversely, there is a high level of sharing for ‘‘E–E’’ PolishthanwithotherGermans(1.24versus1.05,p=5.761026), relationships over a broad range of distances. This is especially apatternwhichcouldbeduetorecentasymmetricmigrationfrom trueforourshortest(oldest)blocks:individualsinourEgrouping a smaller population into a larger population. In Figure 3 we share on average more short blocks with individuals in distant E depict the geography of rates of IBD sharing between popula- populationsthandopairsofindividualsinthesameWpopulation. tions—that is, the average number of IBD blocks shared by a We argue below that this is because modern individuals in these randomly chosen pair of individuals. Above, maps show the IBD locationshavealargerproportionoftheirancestorsinarelatively raterelativetocertainchosenpopulations,andbelow,allpairwise smallpopulation that subsequently expanded. sharing rates are plotted against the geographic distance separat- Havingseenthecontinent-widepatternsofIBDinFigure3,itis ing the populations. It is evident that geographic proximity is a naturaltowonderifsimilarinformationiscontainedinsingle-site majordeterminantofIBDsharing(andhencerecentrelatedness), summaries of relatedness, such as mean Identity by State (IBS) withtherateofpairwiseIBDdecreasingrelativelysmoothlyasthe values across European populations. The mean IBS between geographic separation of the pair of populations increases. Note populations x and y is defined as the probability that two that even populations represented by only a single sample are randomly chosen alleles from x and y are identical (‘‘By State’’), included, as these showed a surprisingly consistent signal despite averagingoverSNPsandindividuals.IntheanalogousplotofIBS thesmallsample size. againstgeographicdistance(FigureS9),someofthepatternsseen Superimposedonthisgeographicdecaythereisstrikingregional inFigure3arepresent,andsomearenot.Forinstance,thereisa variation in rates of IBD. To further explore this variation, we continuous decay with geographic distance (linear, not exponen- divided the populations into the four groups listed in Table 1, tial), andcomparisons tothesouthern ‘‘I’’groupandtoCyprus/ using geographic location and correlations in the pattern of IBD Turkey are even more well-separated below the others. On the sharing with other populations (shown in Figure S7). These five otherhand,the‘‘E–E’’comparisonsdonotshowhigherIBSthan groupings are defined as: Europe ‘‘E,’’ lying to the east of thebulk of theremainingcomparisons. GermanyandAustria;Europe‘‘N,’’lyingtothenorthofGermany and Poland; Europe ‘‘W,’’ to the west of Germany and Austria Ages and Numbers of Common Ancestors (inclusive); the Iberian and Italian peninsulas ‘‘I’’; and Turkey/ Cyprus ‘‘TC.’’ Although the general pattern of regional IBD Each block of genome shared IBD by a pair of individuals variationisstrong,noneofthesegroupshavesharpboundaries— represents genetic material inherited from one of their genetic for instance, Germany, Austria, and Slovakia are intermediate commonancestors.SincethedistributionoflengthsofIBDblocks between E and W. Furthermore, we suspect that the Italian and differsdependingontheageoftheancestors—forexample,older Iberianpeninsulaslikelydonot grouptogetherbecause ofhigher blocks tend to be shorter—it is possible to use the distribution of shared ancestry with each other, but rather because of similarly lengths of IBD blocks to infer numbers of most recent pairwise low rates of IBD with other European populations. The overall genetic common ancestors back through time averaged across meanIBDratesbetweentheseregionsareshowninTable2,and pairs of individuals. For this inference, we restricted to blocks comparisonsbetweendifferentgroupingsarecoloreddifferentlyin longer than 2 cM, where we had good power to detect true IBD Figure 3G–I, showing that rates of IBD sharing between E blocks.Weobtaindatesinunitsofgenerationsinthepast,andfor populationsandbetweenNpopulationsaverageafactorofabout ease of discussion convert these to years ago (ya) by taking the three higher than other comparisons at similar distances. Such a mean human generation timetobe30years [39]. PLOSBiology | www.plosbiology.org 5 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry Figure3.Geographicdecayofrecentrelatedness.Inallfigures,colorsgivecategoriesbasedontheregionalgroupingsofTable1.(A–F)The areaofthecirclelocatedonaparticularpopulationisproportionaltothemeannumberofIBDblocksoflengthatleast1cMsharedbetweenrandom individualschosenfromthatpopulationandthepopulationnamedinthelabel(alsomarkedwithastar).BothregionalvariationofoverallIBDrates and gradual geographic decay are apparent. (G–I) Mean number of IBD blocks of lengths 1–3cM (oldest), 3–5cM, and .5cM (youngest), respectively,sharedbyapairofindividualsacrossallpairsofpopulations;theareaofthepointisproportionaltosamplesize(numberofdistinct pairs),cappedatareasonablevalue;andlinesshowanexponentialdecayfittoeachcategory(usingaPoissonGLMweightedbysamplesize). ComparisonswithnosharedIBDareusedinthefitbutnotshowninthefigure(duetothelogscale).‘‘E–E,’’‘‘N–N,’’and‘‘W–W’’denoteanytwo populationsbothintheE,N,orWgrouping,respectively;‘‘TC-any’’denotesanypopulationpairedwithTurkeyorCyprus;‘‘I-(I,E,N,W)’’denotesItaly, Spain,orPortugalpairedwithanypopulationexceptTurkeyorCyprus;and‘‘betweenE,N,W’’denotestheremainingpairs(whenbothpopulations areinE,N,orW,butthetwoareindifferentgroups).TheexponentialfitfortheN–Npointsisnotshownduetotheverysmallsamplesize.SeeFigure S8foranSVGversionoftheseplotswhereitispossibletoidentifyindividualpoints. doi:10.1371/journal.pbio.1001555.g003 Nature of the results on age inference. There are two overcome by careful estimation and modeling of error, described major difficulties to overcome, however. First, detection is noisy: in the Materials and Methods section. The second problem is we do not detect all IBD segments (especially shorter ones), and moreseriousandunavoidable:theinferenceproblemisextremely someofourIBDsegmentsarefalsepositives.Thisproblemcanbe ‘‘ill conditioned’’ (in the sense of [40]), meaning in this case that PLOSBiology | www.plosbiology.org 6 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry histories. The reader should also bear in mind that we do not Table2. RatesofIBD withinand betweeneach geographic depict thedependence of uncertaintybetween intervals. groupinggiveninTable 1. Resultsofageinference. InFigure4weshowhowtheage andnumberofsharedpairwisegeneticcommonancestorschanges as we move away from the Balkans (left column) and the United IBDRate E I N TC W Kingdom (right column), along with two examples of how the E 2.57 0.44 0.99 0.62 0.53 observed block length distribution is composed of ancestry from I 0.44 0.80 0.43 0.41 0.45 differentdepths.[Theaveragenumberofsharedpairwisegenetic N 0.99 0.43 2.62 0.33 0.86 common ancestors from generation n is the probability that the most recent common ancestor of a pair at a single site lived in TC 0.62 0.41 0.33 1.43 0.25 generation n (i.e., the coalescent rate) multiplied by the expected W 0.53 0.45 0.86 0.25 0.93 number of segments that recombination has broken a pair of doi:10.1371/journal.pbio.1001555.t002 individuals’genomesintothatmanygenerationsback,asshownin the Materials and Methods section.] More plots of this form are there are many possible histories of shared ancestry that fit the shown in Figure S16, and coalescent rates between pairs of data nearly equally well. For this reason, there is a fairly large, populations are shown inthe(equivalent) FigureS15. unavoidablelimittothetemporalresolution,butwestillobtaina Mostdetectable recentcommonancestors livedbetween 1,500 good dealofuseful information. and2,500yearsago,andonlyasmallproportionofblockslonger We deal with this uncertainty by describing the set of histories than 2cM are inherited from longer ago than 4,000 years. (i.e., historical numbers of common genetic ancestors) that are Obviously,thereareavastnumberofgeneticcommonancestors consistentwiththedata,summarizedintwoways.First,itisuseful older than this, but the blocks inherited from such common to look at individual consistent histories, which gives a sense of ancestorsaresufficiently unlikelytobelonger than2 cMthatwe recurrent patterns and possible historical signals. Figure 4 shows do not detect many. For the most part, blocks longer than 4 cM for several populations both the best-fitting history (in black) and come from 500–1,500 years ago, and blocks longer than 10cM thesmoothesthistorythatstillfitsthedata(inred).Wecanmake fromthelast 500years. general statements if they hold across all (or most) consistent Inmostcases,onlypairswithinthesamepopulationarelikelyto histories. Second, we can summarize the entire set of consistent share genetic common ancestors within the last 500 years. histories by finding confidence intervals (bounds) for the total Exceptions are generally neighboring populations (e.g., United numberofcommonancestorsaggregatedincertaintimeperiods. KingdomandIreland).Duringtheperiod500–1,500ya,individuals TheseareshowninFigure5,givingestimates(coloredbands)and typicallysharetenstohundredsofgeneticcommonancestorswith bounds (vertical lines) for the total numbers of genetic common others in the same or nearby populations, although some distant ancestors in each of three time periods, roughly 0–500 ya, 500– populationshaveverylowrates.Longeragothan1,500ya,pairsof 1,500 ya,and 1,500–2,500ya(‘‘ya’’denotes ‘‘yearsago’’). Figure individuals from any part of Europe share hundreds of genetic S12 (and S13) is a version of Figure 5 with more populations (in ancestorsincommon,andsomesharesignificantlymore. coalescentunits,respectively),andplotsanalogoustoFigure4for Regional variation: Interesting cases. We now examine all these histories are shown in Figure S16. For a precise some ofthemore striking patterns wesee in moredetail. description of the problem and our methods, see the Materials There is relatively little common ancestry shared between the and Methods section. We validated the method through simula- Italian peninsula and other locations, and what there is seems to tion (details in Text S1), and found that it performed well to the derivemostlyfromlongeragothan2,500ya.Anexceptionisthat temporal resolution discussed here. We note that in simulations Italy and the neighboring Balkan populations share small but where the population size changes smoothly, the maximum significant numbers of common ancestors in the last 1,500 years, likelihood solution is often overly peaky, whereas the smoothed as seen in Figures S16 and S17. The rate of genetic common solution can smear out the signal of rapid change in population ancestry between pairs of Italian individuals seems to have been size.Inlightofthatweencouragethereadertoviewtruthaslying fairly constant for the past 2,500 years, which combined with somewhere between these twosolutions, andtonot overinterpret significant structure within Italy suggests a constant exchange of specificpeaksinthemaximumlikelihood,whichmayoccurdueto migrants between coherentsubpopulations. numerical properties of the inference. That said, there are a Patterns for the Iberian peninsula are similar, with both Spain number of sharp peaks in common ancestry shared across many and Portugal showing very few common ancestors with other population comparisons older than 2,000 ya, which may populations over the last 2,500 years. However, the rate of IBD potentially indicate demographic events in a shared ancestral sharing within the peninsula is much higher than within Italy— population. A more thorough investigation of these older shared duringthelast1,500yearstheIberianpeninsulasharesfewerthan signals would potentially need a more model-based approach, so twogeneticcommonancestorswithotherpopulations,compared we restrict ourselves here to talking about the broad differences to roughly 30 per pair within the peninsula; Italians share on between the distribution of common shared ancestors between average onlyabout eightwith eachother during thisperiod. regions. The higher rates of IBD between populations in the ‘‘E’’ The time periods we use for these bounds are quite large, but groupingshowninFigure3seemtoderivemostlyfromancestors this is unavoidable, because of a trade-off between temporal living1,500–2,500ya,butalsoshowincreasednumbersfrom500– resolutionanduncertaintyinnumbersofcommonancestors.Also 1,500 ya, as shown in Figure 5 and Figure S17. For comparison, note that the lower bounds on numbers of common ancestors the IBD rate is high enough that even geographically distant during each time interval are often close to zero. This is because individuals in these eastern populations share about as many onecan(roughlyspeaking)obtainahistorywithequallygoodfitby commonancestorsasdotwoIrishortwoFrench-speakingSwiss. moving ancestors from that time interval into the neighboring ByfarthehighestratesofIBDwithinanypopulationsisfound ones,resultinginpeaksoneithersideoftheselectedtimeinterval betweenAlbanianspeakers—around90ancestorsfrom0–500ya, (seeFigureS14),eventhoughthesedonotgenerallyreflectrealistic andaround600ancestorsfrom500–1,500ya(sohighthatweleft PLOSBiology | www.plosbiology.org 7 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry Figure4.Estimatedaveragenumberofmostrecentgeneticcommonancestorspergenerationbackthroughtime.Estimatedaverage numberofmostrecentgeneticcommonancestorspergenerationbackthroughtimesharedby(A)pairsofindividualsfrom‘‘theBalkans’’(former Yugoslavia, Bulgaria, Romania, Croatia, Bosnia, Montenegro, Macedonia, Serbia, and Slovenia, excluding Albanian speakers) and shared by one individualfromtheBalkanswithoneindividualfrom(B)Albanian-speakingpopulations,(C)Italy,or(D)France.Theblackdistributionisthemaximum likelihoodfit;showninredissmoothestsolutionthatstillfitsthedata,asdescribedintheMaterialsandMethods.(E)showstheobservedIBDlength distributionforpairsofindividualsfromtheBalkans(redcurve),alongwiththedistributionpredictedbythesmooth(red)distributionin(A),asa stackedareaplotpartitionedbytimeperiodinwhichthecommonancestorlived.Thepartitionswithsignificantcontributionarelabeledontheleft verticalaxis(ingenerationsago),andthelegendin(J)givesthesamepartitions,inyearsago;theverticalscaleisgivenontherightverticalaxis.The secondcolumnoffigures(F–J)issimilar,exceptthatcomparisonsarerelativetosamplesfromtheUnitedKingdom. doi:10.1371/journal.pbio.1001555.g004 themoutofFigure5;seeFigureS12).Beyond1,500ya,therates Europe, These differences reflect the impact of major historical of IBD drop to levels typical for other populations in the eastern and demographic events, superimposed against a background of grouping. localmigrationandgenerallyhighgenealogicalrelatednessacross Therearecleardifferencesinthenumberandtimingofgenetic Europe. We now turn to discuss possible causes and implications common ancestors shared by individuals from different parts of of these results. PLOSBiology | www.plosbiology.org 8 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry Figure 5. Estimated average total numbers of genetic common ancestors shared per pair of individuals in various pairs of populations, in roughly the time periods 0–500 ya, 500–1,500 ya, 1,500–2,500 ya, and 2,500–4,300 ya. We have combined some populationstoobtainlargersamplesizes:‘‘S-C’’denotesSerbo-CroatianspeakersinformerYugoslavia,‘‘PL’’denotesPoland,‘‘R-B’’denotesRomania andBulgaria,‘‘DE’’denotesGermany,‘‘UK’’denotestheUnitedKingdom,‘‘IT’’denotesItaly,and‘‘Iber’’denotesSpainandPortugal.Forinstance,the greenbarsintheleftmostpanelstellusthatSerbo-CroatianspeakersandGermansmostlikelyshare0–0.25mostrecentgeneticcommonancestor fromthelast500years,3–12fromtheperiod500–1,500yearsago,120–150from1500–2,500ya,and170–250from2,500–4,400ya.Althoughthe lowerboundsappeartoextendtozero,theyaresignificantlyabovezeroinnearlyallcasesexceptforthemostrecentperiod0–540ya. doi:10.1371/journal.pbio.1001555.g005 Discussion From our numerical results, the average number of genetic commonancestorsfromthelast1,000yearssharedbyindividuals Genetic common ancestry within the last 2,500 years across living at least 2,000 km apart is about 1/32 (and at least 1/80); Europe has been shaped by diverse demographic and historical between 1,000 and 2,000ya they share about one; and between events. There are both continental trends, such as a decrease of 2,000and3,000yatheyshareabove10.Sincethechanceissmall shared ancestry with distance; regional patterns, such as higher that anygenetic material has been transmitted along a particular IBD in eastern and northern populations; and diverse outlying genealogical path from ancestor to descendent more than eight signals. We have furthermore quantified numbers of genetic generationsdeep[8]—about.008at240ya,and2.561027at480 common ancestors that populations share with each other back ya—thisimplies,conservatively,thousandsofsharedgenealogical through time, albeit with a (unavoidably) coarse temporal ancestors in only the last 1,000 years even between pairs of resolution. These numbers are intriguing not only because of the individuals separated by large geographic distances. At first sight differences between populations, which reflect historical events, thisresultseemscounterintuitive.However,as1,000yearsisabout butthehighdegreeofimpliedgenealogicalcommonalitybetween 33 generations, and 233<1010 is far larger than the size of the even geographically distantpopulations. European population, so long as populations have mixed Ubiquityofcommonancestry. Wehaveshownthattypical sufficiently, by 1,000 years ago everyone (who left descendants) pairsofindividualsdrawnfromacrossEuropehaveagoodchance wouldbeanancestorofeverypresent-dayEuropean.Ourresults ofsharinglongstretchesofidentitybydescent,evenwhentheyare are therefore one of the first genomic demonstrations of the separated by thousands of kilometers. We can furthermore counterintuitive but necessary fact that all Europeans are conclude that pairs of individuals across Europe are reasonably genealogically related over very short time periods, and lends likely to share common genetic ancestors within the last 1,000 substantial support to models predicting close and ubiquitous years, and are certain to share many within the last 2,500 years. commonancestry ofall modern humans[7]. PLOSBiology | www.plosbiology.org 9 May2013 | Volume 11 | Issue 5 | e1001555 GeographyofRecentGeneticAncestry ThefactthatmostpeoplealivetodayinEuropesharenearlythe Clearly the ancestry of Europeans is far more diverse than those same set of (European, and possibly world-wide) ancestors from represented here, but such steps seemed necessary to make best only1,000yearsagoseemstocontradictthesignalsoflong-term, initial use ofthisdataset. albeit subtle, population genetic structure within Europe (e.g., Ages of particular common ancestors. We have shown [13,14]). These two facts can be reconciled by the fact that even that the problem of inferring the average distribution of genetic thoughthedistributionofancestors(ascartoonedinFigure1B)has common ancestors back through time has a large degree of spread to cover the continent, there remain differences in degree fundamental uncertainty. The data effectively leave a large ofrelatednessofmodernindividualstotheseancestralindividuals. number of degrees of freedom unspecified, so one must either Forexample,someoneinSpainmayberelatedtoanancestorin describe the set of possible histories, as we do, and/or use prior theIberianpeninsulathroughperhaps1,000differentroutesback information torestrict these degrees offreedom. through the pedigree, but to an ancestor in the Baltic region by A related but far more intractable problem is to make a good only 10 different routes, so that the probability that this Spanish guess of how long ago a certain shared genetic common ancestor individual inherited genetic material from the Iberian ancestor is lived,aspersonalgenomeserviceswouldliketodo,forinstance:if roughly 100 times higher. This allows the amount of genetic youandIsharea10cMblockofgenomeIBD,whendidourmost materialsharedbypairsofextantindividualstovaryeveniftheset recent common ancestor likely live? Since the mean length of an of ancestors isconstant. IBDblockinheritedfromfivegenerationsagois10cM,wemight Relationtosingle-sitesummaries. Otherworkhasstudied expecttheaverageageoftheancestorofa10cMblocktobefrom fine-scaledifferentiationbetweenpopulationswithinEuropebased around five generations. However, a direct calculation from our on statistics such as F , IBS (e.g., [14,18]), or PCA [13], that results says that the typical age of a 10cM block shared by two ST summarize in various ways single-marker correlations, averaged individuals from the United Kingdom is between 32 and 52 acrossloci.LikeratesofIBD,thesemeasuresofdifferentiationcan generations (depending on the inferred distribution used). This be thought of as weighted averages of past coalescent rates [41– discrepancy results from the fact that you are a priori much more 44], but take much of their information from much more distant likelytoshareacommongeneticancestorfurtherinthepast,and times(tensofthousandsofgenerations).Asexpected,wehaveseen this acts to skew our answers away from the naive expectation— bothstrongconsistencybetweenthesemeasuresandIBD(e.g.,the even though it is unlikely that a 10cM block is inherited from a decay with geographic distance), as well as distinct patterns (e.g., particular shared ancestor from 40 generations ago, there are a highersharingineasternEurope).Theseresultshighlightthefact greatnumberofsucholdersharedancestors.Thisalsomeansthat thatlongsegmentsofIBDcontaininformationaboutmuchmore estimatedagesmustdependdrasticallyonthepopulations’shared recenteventsthandosingle-site summaries,informationthatcan histories:forinstance,theageofsucha block sharedbysomeone be leveragedtolearn about thetimingof these events. fromtheUnitedKingdomwithsomeonefromItalyisevenolder, Limitationsofsampling. Aconcernaboutourresultsisthat usually from around 60 generations ago. This may not apply to theEuropeanindividualsinthePOPRESdatasetwereallsampled ancestors from the past very few (perhaps less than eight) in either Lausanne or London. This might bias our results, for generations, from whom we expect to inherit multiple long instance, if an immigrant community originated mostly from a blocks—in this case, we can hope to infer a specific genealogical particularsmallportionoftheirhomepopulation,therebysharing relationshipwithreasonablecertainty(e.g.,[49,50]),althougheven aparticularlyhighnumberofrecentcommonancestorswitheach thencaremustbetakentoexcludethepossibilitythatthesemultiple other.Weseeremarkablylittleevidencethatthisisthecase:there blockshavenotbeeninheritedfromdistinctcommonancestors. is a high degree of consistency in numbers of IBD blocks shared Although the sharing of a long genomic segment can be an across samples from each population, and between neighboring intriguing sign of some recent shared ancestry, the ubiquity of populations.Forinstance,wecouldarguethatthehighdegreeof shared genealogical ancestry only tens of generations ago across shared common ancestry among Albanian speakers was because Europe(andlikelytheworld,[7])makessuchsharingunsurprising, most of these sampled originated from a small area rather than andassignmenttoparticulargenealogicalrelationshipsimpossible. uniformly across Albania and Kosovo. However, this would not What is informative about these chance sharing events from explain the high rate of IBD between Albanian speakers and distant ancestors is that they provide a fine-scale view of an neighboring populations. Even populations from which we only individual’s distribution of ancestors (e.g., Figure 3), and that in have one or two samples, which we at first assumed would be aggregate they can provide an unprecedented view into even unusably noisy, provide generally reliable, consistent patterns, as small-scale human demographic history. evidenced by,forexample, Figure S3. Where do your nth cousins live? Our results also offer a Conversely, it might be a concern that individuals sampled in waytounderstandthegeographiclocationofindividualsofagiven Lausanne or London are more likely to have recent ancestors degreeofrelatedness.ThevaluesofFigure5(andFigureS12)can morewidelydispersedthanistypicalfortheirpopulationoforigin. be interpreted as the distribution of distant cousins for any This is a possibility we cannot discard, and if true, would mean reference population—for instance, the set of bars for Poland there is more structure within Europe than what we detect. (‘‘PL’’)inthetoprowshowsthatarandomlychosendistantcousin However, by the incredibly rapid spread of ancestry, this is ofaPolishindividualwiththecommonancestorlivinginthepast unlikelytohaveaneffectovermorethanafewgenerationsandso 500yearsmostlikelylivesinPolandbuthasareasonablechance does not pose a serious concern about our results about the of living in the Balkan peninsula or Germany. Here ‘‘randomly ubiquitous levels of common ancestry. Fine-scale geographic chosen’’ means chosen randomly proportional to the paths sampling of Europe as a whole is needed to address these issues, through the pedigree—concretely, take a random walk back and these efforts are underway in a number of populations (e.g., through the pedigree to an ancestor in the appropriate time [45–48]). period, and then take a random walkback down. If onestarts in Finally, we have necessarily taken a narrow view of European Poland, then the chance of arriving in, say, Romania is ancestry as we have restricted our sample to individuals who are proportionaltotheaveragenumberof(genetic)commonancestors notoutlierswithrespecttogeneticancestry,andwhenpossibleto sharedbyapairfromPoland andRomania,whichisexactlythe those having all four grandparents drawn from the same county. number estimated inFigure 5. PLOSBiology | www.plosbiology.org 10 May2013 | Volume 11 | Issue 5 | e1001555
Description: