Ancestral Components of Admixed Genomes in a Mexican Cohort Nicholas A. Johnson1¤a, Marc A. Coram2, Mark D. Shriver3, Isabelle Romieu4¤b, Gregory S. Barsh5,6*, Stephanie J. London7,8, Hua Tang5* 1DepartmentofStatistics,StanfordUniversity,Stanford,California,UnitedStatesofAmerica,2DepartmentofHealthResearchandPolicy,StanfordUniversitySchoolof Medicine,Stanford,California,UnitedStatesofAmerica,3DepartmentofAnthropology,PennsylvaniaStateUniversity,UniversityPark,Pennsylvania,UnitedStatesof America,4NationalInstituteofPublicHealth,Cuernavaca,Mexico,5DepartmentofGenetics,StanfordUniversitySchoolofMedicine,Stanford,California,UnitedStatesof America,6HudsonAlphaInstituteforBiotechnology,Huntsville,Alabama,UnitedStatesofAmerica,7EpidemiologyBranch,NationalInstituteofEnvironmentalHealth Sciences,NationalInstitutesofHealth,DepartmentofHealthandHumanServices,ResearchTrianglePark,NorthCarolina,UnitedStatesofAmerica,8Laboratoryof RespiratoryBiology,NationalInstituteofEnvironmentalHealthSciences,NationalInstitutesofHealth,DepartmentofHealthandHumanServices,ResearchTrianglePark, NorthCarolina,UnitedStatesofAmerica Abstract Formostoftheworld,humangenomestructureatapopulationlevelisshapedbyinterplaybetweenancientgeographic isolation and more recent demographic shifts, factors that are captured by the concepts of biogeographic ancestry and admixture,respectively.Theancestryofnon-admixedindividualscanoftenbetracedtoaspecificpopulationinaprecise region, but current approaches for studying admixed individuals generally yield coarse information in which genome ancestry proportions are identified according to continent of origin. Here we introduce a new analytic strategy for this problem that allows fine-grained characterization of admixed individuals with respect to both geographic and genomic coordinates.Ancestrysegmentsfromdifferentcontinents,identifiedwithaprobabilisticmodel,areusedtoconstructand study ‘‘virtual genomes’’ of admixed individuals. We apply this approach to a cohort of 492 parent–offspring trios from Mexico City. The relative contributions from the three continental-level ancestral populations—Africa, Europe, and America—varysubstantiallybetweenindividuals,andthedistributionofhaplotypeblocklengthsuggestsanadmixingtime of10–15generations.TheEuropeanandIndigenousAmericanvirtualgenomesofeachMexicanindividualcanbetracedto precise regions within each continent, and they reveal a gradient of Amerindian ancestry between indigenous people of southwestern Mexico and Mayans of the Yucatan Peninsula. This contrasts sharply with the African roots of African Americans,whichhavebeencharacterizedbyauniformmixingofmultipleWestAfricanpopulations.Wealsousethevirtual EuropeanandIndigenousAmericangenomestosearchforthesignaturesofselectionintheancestralpopulations,andwe identifypreviouslyknowntargetsofselectioninotherpopulations,aswellasnewcandidateloci.Theabilitytoinferprecise ancestralcomponentsofadmixedgenomeswillfacilitatestudiesofdisease-relatedphenotypesandwillallownewinsight intothe adaptive anddemographic history ofindigenous people. Citation: Johnson NA, Coram MA, Shriver MD, Romieu I, Barsh GS, et al. (2011) Ancestral Components of Admixed Genomes in a Mexican Cohort. PLoS Genet7(12):e1002410.doi:10.1371/journal.pgen.1002410 Editor:GregoryP.Copenhaver,TheUniversityofNorthCarolinaatChapelHill,UnitedStatesofAmerica ReceivedJuly4,2011;AcceptedOctober20,2011;PublishedDecember15,2011 Thisisanopen-accessarticle,freeofallcopyright,andmaybefreelyreproduced,distributed,transmitted,modified,builtupon,orotherwiseusedbyanyonefor anylawfulpurpose.TheworkismadeavailableundertheCreativeCommonsCC0publicdomaindedication. Funding:TheresearchwassupportedbyaSloanFoundationResearchFellowshipandNIGMSgrantGM073059toHTandbytheDivisionofIntramuralResearch, NationalInstituteofEnvironmentalHealthSciences,NationalInstitutesofHealth,DHHS,toSJL.NAJissupportedbytheStanfordGenomeTrainingProgram(T32 HG000044).Thefundershadnoroleinstudydesign,datacollectionandanalysis,decisiontopublish,orpreparationofthemanuscript. CompetingInterests:Theauthorshavedeclaredthatnocompetinginterestsexist. *E-mail:[email protected](HT);[email protected](GB) ¤aCurrentaddress:Google,Mountainview,California,UnitedStatesofAmerica ¤bCurrentaddress:InternationalAgencyforResearchonCancer,Lyon,France Introduction driving force in shaping the pattern of genetic variation that we observe today [1–5]. In parallel, analyses based on European, During the past decade, data generated by high-throughput African and East Asian populations have revealed that recent genotyping technologies have enabled studies probing into two positive selection is a prevalent phenomenon throughout the central questions in human evolutionary biology: the character- genome [6–8]. Using data from the Human Genome Diversity- izationofhumanpopulationgeneticstructure,andthesearchfor CEPH Panel (HGDP), a recent and comprehensive survey themolecularsignatureofnaturalselection.Insightsgleanedfrom suggeststhat,whileadaptationtolocalenvironmentisacommon thesestudieshaveprovidedimportantcluesforunderstandingthe theme throughout human evolution, the genetic loci involved in phenotypic diversity of our species, and variables representing adaptation show little overlap among non-contiguous geographic population structure are routinely incorporated as covariates in regions [9]. genome-wideassociationstudiesofcomplextraitsanddiseases.At While geography poses a significant reproductive barrier, a global level, as well as within a continent or even a sub- multiple waves of massive trans-continental migration have continentalregion,geographyhasbeenshowntoactastheleading occurred during the past centuries, giving rise to admixed PLoSGenetics | www.plosgenetics.org 1 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes Wetackledthisproblembyemployingananalyticstrategythat Author Summary works backwards according to the temporal nature of demo- Admixed individuals, such as African Americans and graphic events that underlie human admixture: genomes are first Latinos, arise from mating between individuals from separatedintothemajorandmostrecentcomponentsthatreflect different continents. Detailed knowledge about the inter-continental migration, then each of those components is ancestral origin of an admixed population not only further investigated separately. As described below, we apply a provides insight regarding the history of the population probabilisticmethodforinferringlocus-specificancestryalongthe itself, but also affords opportunities to study the evolu- chromosome, followed by a variant of PCA tofurther investigate tionary biology of the ancestral populations. Applying eachoftheancestry-specificgenomiccomponents,whichweterm novel statistical methods, we analyzed the high-density ‘‘virtual genomes’’. This hierarchical strategy yields a fine-scale genotype data of nearly 1,500 Mexican individuals from view of genetic structure in admixed populations, and provides Mexico City, who are admixed among Indigenous Amer- insight into the population history of nonextant ancestral icans, Europeans, and Africans. The relative contributions populations. As an example, we study a cohort of 492 parent- fromthethreecontinental-levelancestralpopulationsvary offspringtriosrecruitedfromMexicoCity.Ourresultsconfirmthe substantiallybetweenindividuals.TheEuropeanancestors a priori expectation that the most significant European contrib- of these Mexican individuals genetically resemble South- ern Europeans, such as the Spaniard and the Portuguese. utors tothe Mexican gene pool are populations from the Iberian TheIndigenousAmericanancestry oftheMexicansinour Peninsula,butrevealthattheIndigenousAmericancomponentof study is largely attributed to the indigenous groups theMexican genomesismore complex. residing in the southwestern region of Mexico, although Studying the genetic structure of admixed genomes also offers some individuals have inherited varying degrees of the unique opportunity to probe the adaptive landscape of the ancestry from the Mayans of the Yucatan Peninsula and ancestralpopulations.Thisisparticularlypowerfulforstudyingthe other indigenous American populations. A search for Indigenous American populations, for which limited genotype signatures of selection, focusing on the parts of the data is available. As proof of principle, we report a novel genomes derived from an ancestral population (e.g. applicationoftheextendedhaplotypehomozygositytestforrecent IndigenousAmerican),identifiesregionsinwhichagenetic positive selection to the European and Indigenous American variantmayhavebeenfavoredbynaturalselectioninthat ‘‘virtual genomes’’ evident in the Mexican cohort, and identify ancestral population. numerous loci aspotential targets of positiveselection. populations.Theancestryofnon-admixedindividualscanoftenbe Results traced to precise regions based solely on genetic data, but characterizingthesub-continentalancestryoriginsofanadmixed Overview individual has not been demonstrated to date. For example, the Our analytic strategy for studying population structure in two largest minority groups in North America, Latinos and admixedpopulationsisshowninFigure1;detailsoftheapproach African Americans, both arose as a result of mating among are described in what follows, and in the Materials and Methods populationsthathadbeeninhistoricalreproductiveisolation.The section. This approach first applies a model-based clustering ‘‘Hispanic’’or‘‘Latino’’populationsincludetheethnicallydiverse method, frappe, to the intact genotype matrix, identifying groups of Latin America; although significant genetic contribu- components that correspond to variation in continental-level tionscanbetracedtoIndigenousAmerican,EuropeanandWest admixture proportions, and estimating the relative proportion of Africanpopulations,ithasbeenchallengingtodeterminewhether those components for each individual. Locus-specific continental one’s Indigenous American ancestors originate from North, ancestry along a genome is then inferred using SABER+, an Central,orSouthAmerica.Solvingthisproblemhasimplications extension of a Markov-Hidden Markov Model method [12] that for both a deeper understanding of human evolution and for partitions each genome into ancestral haplotype segments or humandisease,sincegeneticdiversitybetweenLatinopopulations ‘‘virtual genomes’’. Finally, within-continent population structure ischaracterizedbothbyvariationincontinent-levelancestry–e.g. is determined by applying PCA to the virtual genomes, treating Mexicans on average have lower African ancestry than Puerto therestofthegenomeasmissing.Toaccountforthelargeamount Ricans – and by the population structure among the ancestral of the missing data resulted from the continent-specific genomes, Indigenous American populations [4,10]. weimplementavariationofthesubspacePCA(ssPCA)algorithm The assessment of the precise ancestral origin and the [13]. quantification of genetic structure within an ancestry component Most of the results described here are from a panel of 492 are limited, in part, by analytic challenges. Principal Component Mexican parent-offspring trios recruited from Mexico City Analysis(PCA)isaclassictechniqueformultivariatedataanalysis, (MEX1) as part of a previous genome-wide association study which aims to project high-dimensional data to a much lower using genotype data from the Illumina 550K platform [14]. For dimensionwhilecapturingthegreatestlevelofvariation[11].This comparison, we also examined data from 23 HapMap Phase3 approach has gained popularity in genetic analyses due to both Mexican trios recruited from Los Angeles, California (MEX2; computationalefficiencyandinterpretability:whentheunderlying http://hapmap.org). Reference populations for inferring conti- populationstructureisdrivenmainlybyreproductiveisolationand nental-levelancestryweretakenfromHapMap(CEU,YRI),and subsequentgeneticdifferentiation,theprincipalcomponents(PCs) additional sources asdescribed belowandin TableS1. mirrorthegeographicoriginsofindividuals[3].Byitself,however, PCA is not well suited for studying admixed populations: while Continental-level ancestry: Global and local estimates leading PCs usually represent the relative contributions of Among the 984 parents of the Mexico City trios (MEX1), we continentally-divided ancestral populations, subsequent PCs may used frappe to estimate median ancestry proportions of 65% be simultaneously influenced by structures withinone ormore of Indigenous American, 31% European, and 3% African; the the ancestral populations, and are consequently difficult to corresponding statistics in the 46 HapMap Mexican individuals interpret. from Los Angeles (MEX2) are 45%, 49%, and 5%, respectively PLoSGenetics | www.plosgenetics.org 2 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes Figure 1. Schematic describing the analytic framework for characterizing continental-level and within-continent populations structureinanadmixedpopulation.Forvirtualgenomes,haplotypesfromallbutoneancestralpopulationare‘‘masked’’asmissingdata. doi:10.1371/journal.pgen.1002410.g001 (Figure2A).ThedistributionofIndigenousAmericanancestryin clusteratavertexofthetriangle.Theseobservationssuggestthat theMexicoCitypopulationisshiftedupwardcomparedtotheLos the ability of SABER+ to assign local ancestry to a specific Angeles population (Figure 2B), which may reflect differences in continental origin is highly accurate, which is essential for theextentofEuropeanadmixture.Africanancestryislowinboth subsequent analyses. cohorts,althoughthedistributionisskewedtotheright,reaching over40% forsomeindividuals. History of admixture We next used SABER+toestimaterecombination breakpoints The distribution of the length of ancestry blocks is shaped by between ancestral chromosomes and thus locus-specific ancestral population history since admixture. When two individuals from origin— Indigenous American, European, or African—in indi- different parental populations mate, the first generation offspring viduals from the MEX1 and MEX2 cohorts. For the work inherits exactly one chromosome from each parental population. described here, the primary goal of SABER+ is to partition the In subsequent generations, recombination events in an admixed Mexican genomes into haplotype segments according to conti- individual generate mosaic chromosomes of smaller ancestry nentalancestrythatcanbeusedforsubsequentanalysis.However, segments. Intuitively, more recent admixing gives rise to longer theoutputofSABER+canalsobeusedasanindependentmeans ancestry blocks than older admixture. Furthermore, conditioning of assessing global ancestry, simply by averaging locus-specific onthetimesinceadmixingwithinanindividual’spedigree,block ancestries across all markers, and yields estimates that are highly length distribution also depends on the individual level ancestry correlated (r.0.99)with frappe (FigureS1). proportions:e.g.,anindividualwith90%Europeanancestrytends To facilitate the analyses of sub-continental genetic structure, to have long European ancestral blocks because recombination weconstructedvirtualgenomesbyretaininghaplotypesegmentsfrom events in the person’s genealogy are likely to have joined two a single continental-ancestral population, while masking (i.e. European haplotypes, and therefore fewer ancestry changes are setting to missing) segments from all other ancestral populations; expected. for example, MEX1AMR and MEX1EUR denote the sets of A likelihood-based model has been proposed that can estimate IndigenousAmericanandEuropeanhaplotypesegmentsfromthe severalaspectsofadmixturehistory[15].However,theadmixing Mexico Cityindividuals, respectively. rates in Mexicans from the European, Indigenous American and Inaprincipalcomponentanalysisofthisdatathatincludesthe Africanancestralpopulationsarelikelydependentanddifficultto YRI and CEU HapMap populations, the Indigenous American, modelwiththislikelihood-basedmethod;therefore,weattempted European,andAfricanvirtualgenomesmarkverticesofatriangle to estimate admixing time using a different approach. We first (Figure2C)inwhichtheintactgenomesoftheMEX1andMEX2 computedthetheoreticalnumberofancestryblocksforindividuals individualsaredistributedbroadlyalonganIndigenousAmerican according to their ancestral proportions, and carried out that – European axis represented by PC1. The exact position of the computation assuming a series of different admixing times (5–25 intact MEX1 and MEX2 genomes depends on admixture generations, dotted lines in Figure 2D). The parabolic shape of proportions;individualswiththegreatestlevelofAfricanancestry, thesecurvesconformstotheintuitiveideaoutlinedabovethatthe which corresponds to PC2, mostly lie at intermediate positions number of blockpeaksat an intermediate ancestryproportion. along the Indigenous American -European axis. Importantly, the Wethensuperimposedtheobservednumberofancestryblocks MEXEUR and MEXAFR virtual genomes (red and blue points, ineachMEX1individualontothetheoreticalcurves;theseresults respectively) form discrete clusters whose locations coincide with suggest an admixing time of 10–15 generations ago (Figure 2D). those of the HapMap CEU and YRI, respectively, and, while The admixing time of the European component appears slightly there is no reference population in this analysis for Indigenous longer than that for the Indigenous American component (15 American, the MEXAMR virtual genomes also form a discrete generationsvs.12);onepotentialexplanationisthatsomemixing PLoSGenetics | www.plosgenetics.org 3 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes Figure 2. Continental-level ancestry proportions and admixing time. (A) Individual ancestry proportions (red=European; yellow=- IndigenousAmerican,blue=African).(B)HistogramcomparingtheIndigenousAmericanancestryintheMexicoCitycohort(Mex1)andLosAngeles cohort(Mex2).(C)PrincipalcomponentanalysisoftheMexicanindividualsfromMexicoCityandLosAngeles,CA.Redandbluecirclesindicatethe locationoftheHapMapCEUandYRIindividuals,respectively.Graypointsrepresentadmixedindividuals;virtualgenomes,MexAMR(yellow),MexEUR (blue)andMexEUR(red)areprojectedtothePCplot.(D)Admixingtimesestimatedbythenumberofancestryblocks.ThenumberofEuropean(vs non-European)ancestryblocksisplottedagainstEuropeanancestry(red);analogously,thenumbersofIndigenousAmerican(vsnon-Indigenous American)blocksareplottedagainsttheIndigenousAmericanancestry(yellow).Eachcurverepresentstheexpectednumberofsuchblocksfora specificadmixingtime(ingenerations). doi:10.1371/journal.pgen.1002410.g002 occurred between the European and the African ancestral bewellseparatedinthepresenceofupto5%error,i.e.Indigenous individuals prior to admixing with the Indigenous American American alleles mistakenly included in the European virtual populations. genomes, which is well above the level of uncertainty (,2%) associated with theSABER+approach. Regional ancestry of virtual genomes InadditiontotheHapMapCEU,whoaremostlyofNorthern WiththeMEXEURuncoupledfromtheMEXAMRgenomes,we European ancestry, we used individuals recruited from Dublin, investigated structure within each of these virtual genomes (Ireland), Warsaw (Poland), Rome (Italy) and Porto (Portugal) to separately. (We did not investigate the MEXAFR virtual genomes providereferencesfordifferentareaswithinEurope.Thefirsttwo duetotheirsmallsamplesize).Becausethereisalargeamountof PCs provide good separation of these reference populations, and missingdata,e.g.thevirtualgenomeofoneindividualmaycover correspond roughly to North-South and West-East gradients verydifferentlocifromthevirtualgenomeofotherindividuals,we (Figure3A).BoththeMEX1EURandMEX2EURvirtualgenomes usedthessPCAapproachasdescribedinMaterialsandMethods. are most closely related to intact genomes from Porto, which we To help evaluate the robustness of our approach, we carried out interpret as a surrogate for populations from the Iberian simulationexperiments,inwhichtheeffectsofrandomerrorinthe Peninsula, [3], consistent with the historical record that the first inference of continental locus-specific ancestry were measured Europeanmigrants toMexico wereSpaniards. with regard to their impact on accuracy of within-continent For analysis of the MEXAMR virtual genomes, we introduced substructure estimates. Results summarized in Materials and 129 individuals representing 8 different Indigenous American Methods and Figure S2 indicate that European substructure can populationsasreferencegenomes(TableS1)[16].Initially,wealso PLoSGenetics | www.plosgenetics.org 4 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes Figure 3. Population structure within the European and Indigenous American components of Mexican genomes. (A) Principal componentanalysisoftheEuropeanchromosomalsegmentstracestheancestryoriginclosesttoPortuguese.(B)Principalcomponentanalysisof IndigenousAmericansegments,HapMapCEUandvariousIndigenousAmericanpopulations.(C)Sameas(B),withCEUindividualsremoved.(D)Same as(B),withCEU,Surui,KaritianaandPimaremoved.ThearrowhighlightsaMexicanindividualwhoseancestryistracedtotheSouthAmericangroup ofAymara. doi:10.1371/journal.pgen.1002410.g003 included the HapMap CEU based on previous results in which similar to those from Los Angeles (MEX2AMR); further, we someIndigenousAmericanindividualsfromtheHumanGenome observe a gradient with varying contribution from Mayans, with Diversity Panel (HGDP) were observed to have non-negligible some Mexicans deriving their Indigenous American ancestry levelsofEuropeanancestry[2].Indeed,thefirsttwoPCsforthis predominantly from Mayans. One individual from Mexico City analysis occur along European-Indigenous American and within- hasanIndigenousAmericanvirtualgenomethatislocalizedwith America axes (Figure 3B), and reveal varying levels of European the Quechua (arrow, Figure 3C) and therefore is likely to have a ancestryintheMayan,QuechuaandColombianpopulations.In source of Indigenous American ancestrythat isdistinct fromthat this analysis and subsequent ones carried out in which certain of theother Mexicans. reference populations were removed (CEU removed from Figure 3C; CEU, Surui, Karitiana and Pima removed from Investigating natural selection in ancestors of admixed Figure3D),theMEXAMRvirtualgenomesaremostcloselyrelated genomes tointactgenomesofindividualsfromsouthwesternMexicanstate The ability to accurately construct ancestral virtual genomes ofGuerrero(Guerr),whichincludesNahua,MixtecandTlapanec fromadmixedgenomesprovidesanumberofopportunitiesinthe indigenous groups. Although the Guerrero individuals and the areas of human evolution and genetic anthropology. As an PimaindividualsclustertogetherinFigure3C,theyareseparable example of how such data can be used more generally, we on PC 3 (Figure S3), along which the Guerrero, but not Pima examined the Mexican ancestral components for regions of individuals, cluster with MEXAMR. The Indigenous American extended haplotype homozygosity, which mark loci that have virtual genomes of Mexicans from Mexico City (MEX1AMR) are undergone recent positive selection. We used the integrated PLoSGenetics | www.plosgenetics.org 5 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes haplotype score (iHS) statistic [8], with a modified normalization andat least10%of SNPswith|iHS|.2.5,andrankedwindows procedure so astofita standard normal distribution. by the maximum |iHS| score. The top 10 regions within the For the virtual genome SNPs that show the strongest evidence MEXEUR and MEXAMR components are shown in Table 1 and ofpositiveselection,thedegreeofoverlapbetweentheEuropeans Table2.Theonlygenomiclocationthatfeaturesinbothlistsisthe and Indigenous Americans is similar to that expected by chance HLAregiononchr6p,aregionknowntohaveexperiencedstrong (Figure 4A). Specifically, we considered SNPs with |iHS|.2.5, selection [17]; however, the precise variants that show high iHS which represent approximately the top 1% scores in either scores differ between the Europeans and Indigenous Americans. components;3874and3931SNPsmeetthiscriterioninMEXAMR Outside the HLA region, the most prominent signal in the and MEXEUR, respectively, with 57 SNPs overlap between the EuropeancomponentcoincideswithAPBA2onchr15q,whichis two sets (expected overlap=40, p=0.094). Similarly, we found in close proximity to a known pigmentation gene, OCA2. In the little overlap between the iHS scores in MEXAMR and those Indigenous Americans component, the strongest signal occurs in computed based on the HapMap populations [8] (Figure 4B). In chr 6p12.3-2; this region harbors numerous genes, including contrast, the correlation is much higher between the iHS in IL17A which is associated with chronic inflammatory diseases MEXEURandthosefromHapMapCEU(r=0.79),whichreflects suchasrheumatoidarthritis,andPKHD1whichisassociatedwith shared population and adaptive histories of Southern Europeans polycystic kidneydisease [18,19]. (the MEXEUR)andtheCEU(mostly fromNorthern andCentral Europe).Specifically,of3257and3460SNPswith|iHS|.2.5in Discussion MEXEUR and CEU, respectively, 655 are overlapping (expected overlap=32,p,2.2216)(Figure4C).Thesefindingsareconsistent Previous approaches to analyzing the ancestry in admixed with previous observations that intact genomes from the HGDP individuals have largely focused on estimating continental-level collection exhibit histories of positive selection that differ admixtureproportions.Withincontinentalancestryanalyseshave according tocontinent[9]. beenperformedatapopulationlevelbutnotatanindividuallevel. We also asked what genes might underlie the strongest Our approach is distinct in several ways from a recent study, signatures of positive selection. Towards this end, we grouped which reports the affinity, at a population level, of admixed SNPs into 50kb windows, selected regions with at least 20 SNPs populationstovariousancestralgroups[20].Thislatterapproach Figure4.Signaturesofrecentpositiveselection.(A)Overlapoftop1%SNPs(with|iHS|.2.5)inMexEURandMexAMR.(B)OverlapofSNPswith |iHS|.2.5 in MexAMR, HapMap CEU and YRI [8]. (C) Overlap of SNPs with |iHS|.2.5 in MexEUR, HapMap CEU and YRI. Numbers in red denotes significantenrichmentinoverlap. doi:10.1371/journal.pgen.1002410.g004 PLoSGenetics | www.plosgenetics.org 6 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes Table1. Regionsshowing strongest evidenceofrecent positiveselection inthe Europeancomponent. CytologicalPosition Genes(Number) Size(kb) NumberofSNPswith|iHS|.2.5 Max|iHS| 6p21.33-32 TRIM31,HLA(88) 2000 113/581 5.71 15q11-q12 APBA2(1) 194 30/37 4.41 17q24 RGS9(1) 268 33/54 4.36 18p11.23 PTPRM(1) 192 24/34 4.31 15q26.1 none 154 20/50 4.02 5q22.1 TMEM232,SLC25A46(2) 309 19/26 4.00 9p22.3 SMARCA2(1) 87 29/47 3.95 3q22.1 COL6A5(1) 164 17/36 3.89 9q33.2 DENND1A(1) 480 34/61 3.86 3p14.1 none 82 13/23 3.84 22q12.2 SFI1,PISD(3) 267 17/31 3.83 Gene(Number)displaysthenumberofgenesintheregion;inthecasearegionencompassesmultiplegenes,wearbitrarilychoosetwo.Sizedenotesthelengthofthe regionunderselection;consecutiveregionsaremerged.NumberofSNPswith|iHS|.2.5providesthenumberofSNPswith|iHS|.2.5/totalnumberofSNPs.Max|iHS| isthemaximum|iHS|scoreachievedinaregion. doi:10.1371/journal.pgen.1002410.t001 requiresapre-definednotionofsubpopulation,suchasMexicans and is more accurate for analyzing high-density genotype data. versusPuertoRicans,andproducesapopulationlevelsummaryof The accuracy of the locus-specific ancestry is supported by two genetic relationship. In contrast, our approach does not rely on observations. First, in the continental-level PCA analysis suchpre-definedethnicgroups,andthushastheabilitytoidentify (Figure 1A), all ‘‘virtual genomes’’ that are attributed to a single previously unrecognized substructure at an individual level, such ancestralpopulation,clustertightlywiththereferenceindividuals. as thedetection of one individual withSouth Americanancestry. Had there been substantial error in the local ancestry inference, Inwhatfollows,wefirstdiscussaspectsoftheapproachthatmay someofthesegenomeswouldappearadmixedandlieinbetween be generally relevant, and then provide some insights in the thevertices.Second,weincludedtheHapMapCEUindividualsin evolutionaryhistoryoftheMexicanpopulation.Inthecontextof the analysis of the Indigenous American components of the genome-wide association studies of complex traits and diseases, genomesbecausesomeoftheMayanindividualshavebeenshown variablesrepresentingbothcontinental-levelandwithin-continent to have European admixture [2]. Indeed, although Figure 3B population structure need to be adjusted to provide a more clearlyrevealsEuropeanadmixtureinsomeMayanandQuechua accurate correction forpopulation stratification[21,22]. individuals, little European admixture is detected in the putative Twomethodologicalinnovationscontributedtothehierarchical Indigenous American genomes of the Mexicans, MexAMR. We depictionofindividualancestryorigin:animprovedalgorithmfor note that, although many methods for estimating local ancestry, locus-specific ancestry inference, which accommodates multiple including the method used in this study, are applicable to ancestralpopulations,andasubspacePCAalgorithmthatpermits unphaseddata,theparent-offspringtriostructureoftheMexican varyingdegreesofmissingdata.SABER+usesagraphicalmodel data (both the Mexico City cohort and the HapMap Mexican toaccountforhaplotypestructurewithinanancestralpopulation, sample) allows accurate haplotype phasing for each individual, Table2. Regionsshowing strongest evidenceofrecent positiveselection inthe IndigenousAmerican component. CytologicalPosition Genes(Number) Size(kb) NumberofSNPswith|iHS|.2.5 Max|iHS| 6p21.33 TRIM40,TRIM31(4) 75 4/22 4.33 6p12.3-2 IL17,PKHD1(18) 3083 181/333 4.26 19q13.11 SLC7A9,GPATC1(6) 421 37/59 4.23 12p12.1-23 SSPN,ITPR2(2) 547 70/123 4.15 1q24.1 MAEL,GPA33(3) 164 24/45 4.14 10p12.32 PLXDC2(1) 123 11/40 3.94 14q12 STXBP6(1) 146 13/30 3.92 16q23.2 none 197 17/48 3.72 5q33.1 SLC36A3,GM2A(3) 133 14/29 3.54 2q24.2 LY75,PLA2R1(2) 244 18/53 3.53 Gene(Number)displaysthenumberofgenesintheregion;inthecasearegionencompassesmultiplegenes,wearbitrarilychoosetwo.Sizedenotesthelengthofthe regionunderselection;consecutiveregionsaremerged.NumberofSNPswith|iHS|.2.5providesthenumberofSNPswith|iHS|.2.5/totalnumberofSNPs.Max|iHS| isthemaximum|iHS|scoreachievedinaregion. doi:10.1371/journal.pgen.1002410.t002 PLoSGenetics | www.plosgenetics.org 7 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes which likely improves the accuracy of inference of local ancestry African American individual to multiple West/Central West along each haplotype. As a result, both phasing and ancestry African groups, but the relative proportions are nearly constant inference are likely more accurate than those estimated based on acrossallindividuals[24].Thisdifferencecanbereconciledbythe unphased genotype data. distinct migratory histories: the African ancestry in African- Typically, application of PCA for genetic structure analyses American populations is largely derived from the trans-Atlantic makesuseoftheprogramEigenstrat,whichisbasedontheeigen- slave trade, which forcibly departed African individuals from decompositionofthecovariancematrix,V~GG’,whereG9isthe various geographic regions of Western Africa, ranging from transpose of the centered and scaled genotype matrix, G [21]. In Senegal to Nigeria to Angola [25]. In contrast, no evidence computingthiscovariancematrix,missinggenotypesaresettothe suggests massive relocation of the Indigenous Americans during column means; thus V is computed based on genotypes that are the colonization in North America, and hence reproductive ij non-missing in both individuals i and j. While this approach is isolation likely has been maintained between geographically adequate for analyses based on high-density genotypes with very separated Indigenous American populations. One limitation of lowlevelsofmissinggenotypes,itisnotappropriateforanalyzing the current study is the incomplete sampling of the Indigenous thevirtual genomes, whichfeature large andvarying proportions American populations in our reference panel, which represents ofmissingdata.Considertwoindividualseachwith30%ancestry two distinct regions in Mexico: the Southwest coastal State of from the population of interest (e.g. Europe): within each GuerreroandtheYucatanPeninsula.Thus,whilemostMexicans individual, 9% of the genome is expected to be homozygous in trace their Indigenous American ancestries to the indigenous European ancestry, and therefore ,1% of the markers are groups from the State of Guerrero (Guerr), it is possible that the expectedtobenon-missinginbothgenomesafterexcludingnon- true ancestors of the extant Mexicans are an un-sampled group European genotypes. This leads to two problems: first, it has that is genetically similar to Guerr. With the coming of whole reduced power for detecting population substructure because it genomesequencingdata,itispossiblethatindigenouspopulations uses only a small fraction of informative genotypes for the fromneighboringstatescanbedistinguished,andthusitmayeven continentofinterest;moreimportantly,thesamplingvariabilityof be possible to detect admixture from closely related Indigenous the covariance estimates depends heavily on the proportion of American groups. missinggenotypes,biasingthePCssuchthatindividualswithhigh missing rate become outliers along each PC. The ssPCA we Selection implement does not require the computation of the covariance The EHH analyses of the Southern European and the matrix,andusesallinformativemarkersineachgenome;henceit Indigenous American components of the Mexican genomes is less sensitive to the missing data. Since the algorithm can separatelyrevealednumerousintriguingputativetargetsofrecent compute the first k PCs without computing all PCs, it also has positiveselection.Wenotethatmanyotherapproacheshavebeen computational advantages over the current covariance-based developed to detect specific types of selective events, and are implementation, especially when the number of individuals is equallyapplicable[26,27].WechosetousetheiHStestbecauseit large. has been applied to both the HapMap dataset and the HGDP dataset,thusfacilitatingcomparison.Thegoalofthispaperisnot Pattern of ancestry origin to conduct a comprehensive survey of the selective landscape in We find extensive variation with respect to continental-level theancestralpopulationsofthepresentdayMexicans,butrather ancestryproportions,bothbetweengeographicregions–shownby to illustrate the potential benefits of such endeavors. Given the the much higher Indigenous American ancestry in the Mexico difficulties in recruitinglarge samples ofnon-admixed indigenous City cohort compared to the HapMap Mexican Americans from individuals from each well-defined Indigenous American group, Los Angeles – and between individuals within each cohort. This wearguethatadmixedpopulationswillprovidevaluableinsightin study benefits from the ability to divide the genome of a single futureendeavorsinunderstandingtheevolutionaryhistoriesofthe Mexicanindividualintoitsconstituentancestralcomponents.The IndigenousAmericanpopulations,someofwhichmayhavebeen abilitytotracechromosomalsegmentstotheirrespectiveancestral extinct.Forexample,individualswithfullTa´ınoancestryarerare, populations allows us to scrutinize the ancestry origin of each butapproximately15%ofthecontemporarygenepoolofPuerto individualwithinacontinent.WithintheEuropeancomponentof Ricans may have been derived from Ta´ınos. Hence, admixed theMexicangenomes(MexEUR),nearlyallindividuals,bothfrom Puerto Rican genomes can be used to learn about those of the Mexico City and from Los Angeles, trace their European ancestral Ta´ınos [28]. We note that this approach of assembling ancestries to a Southern European population, as represented in an ancestral population from a mixed population has also our study by the Portuguese. Within the Indigenous American provided important insights in the Aboriginal Australian popula- component of the genomes (MexAMR), a majority of individuals tion ina recentstudy [29]. tracetheirancestriestogroupsfromthesouthwestcoastalregions Distinguishingbetweenselectiveeventsthatoccurredwithinthe ofMexico,consistentwithapreviousstudy,whichfoundZapotec ancestral populations and those that occurred post-admixing individuals from the State of Oaxaca to best approximate the requires careful consideration of the tests and associated Indigenous American ancestral population for Mestizos [23]. assumptions. In the current setting, we reasoned that, since a Importantly, we find evidence of varying levels of Mayan novel adaptive allele is unlikely to be swept to a substantial admixture, as well as one individual with Indigenous American frequencywithinaperiodoflessthan500years(sincethearrival ancestry from Bolivia/Peru. Of note, individuals with high levels oftheEuropeansinMexico),andsincetheEHHmethoddoesnot of Mayan or South American ancestries do not stand out in the have appreciable power to detect low frequency adaptive alleles continental-level PCA, as their continental-level ancestry propor- [9],mostofthesignalsdetectedbytheEHHhadoccurredpriorto tionsare comparable totherestof theMexicans. admixing, and hence represent selection within the ancestral ThefindingthatmostMexicanindividualstracetheirEuropean populations. On the other hand, the preservation of a long and Indigenous American ancestry to well-defined geographic haplotype excludesthepossibilityofveryancientselectiveevents; regions contrasts sharply the lack of structure in the African this belief is also supported by the observation that there is little ancestry in the African Americans: not only did we trace each overlapbetweenthesignaturesdetectedintheSouthernEuropean PLoSGenetics | www.plosgenetics.org 8 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes andtheIndigenousAmericancomponents.Inpreviousstudiesof TableS1.WeusedBEAGLEtoconstructhaplotypesforMexican Puerto Ricans and African Americans, numerous genomic trios[34].Aschildrenprovidenoadditionalinformationregarding locations were found where locus-specific ancestry deviate from population structure or adaptation, they are not used in thegenome-wideaverage,andcouldrepresenttargetsofselection subsequent analyses. in the admixed populations [30,31]. In the current analyses, the onlylocusshowingdeviationfromthegenome-wideaverageisthe Genome-wide and locus-specific ancestry inference HLA region on chr 6, again supporting a population-specific Continental-level admixture proportions were estimated two pattern of selection. Therefore, the adaptive history of the ways:(1)amodel-basedclusteringalgorithmimplementedinfrappe Indigenous American groups may vary considerably, and should [35], and (2) average locus-specific ancestries across all markers. bestudiedseparatelyandnotasawholegroup.Suchanalysescan Locus-specificancestrywasestimatedwithSABER+,anextension beachieved,forexample,byexaminingtheIndigenousAmerican ofapreviouslydescribedapproach,SABER,thatusesaMarkov- components inMexicansversus that of Puerto Ricans. Hidden Markov Model [12]. SABER+ differs from SABER in Our results have important implications for the design of implementation of a new algorithm, an Autoregressive Hidden genome-wide association studies based on admixed populations. Markov Model (ARHMM), in which haplotype structure within Epidemiologicstudieshavefoundvaryingprevalenceofconditions theancestral populations isadaptively constructed using a binary such as asthma, diabetes and alcohol-related problems across decisiontreebasedonasmanyas15markers,andwhichtherefore Hispanic national groups [28,32,33]. Distinct population and does not require a priori knowledge of genome-wide ancestry adaptive history among Hispanics ethnic groups can give rise to proportions(Johnsonetal.,inpreparation).Insimulationstudies, heterogeneity in complex traits. Therefore, the importance of theARHMMachievesaccuracycomparabletoHapMix[36]but accounting for intra-continental genetic structure in disease is more flexible in modeling the three-way admixture in the mapping studies, in addition to adjusting inter-continental Mexican population and does not require information about the admixture proportions, needstobecarefully evaluated. recombination rate. HapMapCEUandYRIindividualswereusedasthereference Materials and Methods ancestral populations.Basedonfrappe andsupportedbyPCA, 50 individuals in MEX1 set have more than 95% Indigenous Populations American ancestry. These individuals were initially used to The Mexican individuals analyzed in this project come from approximate the Indigenous American ancestors in the locus- two sources: a panel of 492 Mexican parent-offspring trios specificancestryanalyses;aniterativeprocedureisusedtoidentify recruited from Mexico City as part of a previous genome-wide and correct for the non-Indigenous American segments in these associationstudy(MEX1)[14],and23HapMapPhase3Mexican individuals. Accuracy of the locus-specific ancestry is verified by trios recruited from Los Angeles, California (MEX2; http:// performing a PC analysis, treating each individual as three non- hapmap.org). For estimating locus-specific ancestry, we used the admixed genomes, MexEUR, MexAMR, and MexAFR (see section HapMap CEU (N=88) and YRI (N=100) individuals for the ‘‘subspace PCA’’ below). ancestralpopulations.ToanalyzetheEuropeancomponentofthe admixed genome, we augmented the Mexican datasets with individuals recruited from Dublin, Ireland (N=43), Rome, Italy Subspace PCA (ssPCA) (45), Warsaw, Poland (N=45) and Porto, Portugal (N=43). For We implemented this algorithm to accommodate the large the Indigenous American component analyses, we combined the amount of missing genotype data in partially masked virtual data generated in two previous studies [2,16]. Four Mayan genomes,andusedittoderiveallthePCAresultsreported here. individuals with substantial European admixture are removed. The statistical theory of the algorithm in a general data mining The combined set used for the subsequent analyses includes 14 context can be found in [13]; however, various modifications are individualsfromGuerrero,Mexico(twoNahua,sevenMixtecand required for the current setting, as described below. Let Gh fiveTlapanec),24MayanindividualsfromtheYucatanPeninsula, (h=1,2) be two N6M matrices, in which (cid:2)g1 ,g2 (cid:3) denote the nm nm 24 Quechua collected in Cerro de Pasco, Peru, 25 individuals of unordered pair of alleles at SNP m (m=1,…,M) in individual n largelyAymaraancestrycollectedinLaPaz,Bolivia,13Karitiana (n=1,…,N); the columns of Gh are standardized to have mean 0 and eight Surui from Brazil, seven Colombians, and 14 Pima. and variance 1. To compute the subspace spanned by the first k Because the sample sizes for Nahua, Mixtec and Tlapnec are principal components (PC), we begin by finding a matrix small, and all individuals were recruited from the same state, we decomposition, G(cid:2)&AST, which minimizes the reconstruction considered these individuals as one group. Table S1 summarizes error, R, defined as: theindividuals usedforeach analysis. Genotyping QC and haplotype construction R&jjG{G(cid:2)jj2F~XXX(gnhm{Xankskm)2, Genotypingandqualitycontrolprocedureshavebeendescribed h n m k intheprimarypublicationsforeachdataset,exceptforthedataset of 176 European individuals. Briefly, MEX1 and the HGDP subjecttotheconstraintsthatthecolumnvectorsofAareofunit individuals were genotyped on Illumina 550K and on 650K norm and mutually orthogonal and the row vectors of S are also Beadchip, respectively. The Indigenous American individuals mutuallyorthogonal.HereAisaN6dmatrix,SisaM6dmatrix, fromBighametal.(2009)weregenotypedonAffymetrix1MSNP andd,N#MrepresentsthedesirednumberofleadingPC’s.The arrays[16].Thesetof176Europeanindividualsweregenotyped algorithm we use is a generalized instance of the coordinate using Illumina HumanHap300 arrays; this dataset originally descent approach [37], which iteratively optimizes matrix A for included 180 individuals; four individuals were found with non- fixed Sandthenoptimizes Sfixing Aaccording totherules: negligiblenon-Europeanancestryandwereexcluded.SNPswitha call rate of less than 95% were excluded. The number of individuals and markers used for each analysis is summarized in Ar~Ar{1zl½(G{Ar{1Sr{1)(Sr{1)0(cid:3) and j j j PLoSGenetics | www.plosgenetics.org 9 December2011 | Volume 7 | Issue 12 | e1002410 AncestralComponentsofAdmixedMexicanGenomes Estimating time since admixing Sr~Sr{1zl½(Ar{1)0(G{ArSr{1)(cid:3), We used the number of ancestry blocks in an individual as j j j summary statistics. Tracing through a pedigree of T generations, the expected number of recombination events in a haploid where l is a learning rate, the superscripts, r, indicate iteration, genomeis0.016TL,whereListhetotalgenomelength(takento andthesubscripts,j,denotethej-thcolumnofamatrix.Itcanbe be3435cM[38]).Underahybrid-isolationmodelandassuminga shownthatthecolumnsofAandSspanthesubspaceofthefirstd genome-wide ancestry proportion of z, a fraction of 26z(12z) of PCs, and that the leading PCs can be computed by orthogonal- the recombination events occurs between two haplotypes of izingthecolumnsofAandS[13].Toevaluatetheaccuracyofour opposite ancestry and thus leads to transitions in ancestry. When modifiedssPCAapproach,weapplieditinparallelwithEigenstrat wecountthenumberofancestryblocksintherealdata,wedonot [21] to the intact Mexican genomes, and found the leading PCs observerecombination eventsthat occur between twohaplotypes produced by the two algorithms were virtually identical, up to a of the same ancestry. Hence the expected number of ancestry permutation of signs. switchesinadiploidgenomeisB=(26260.01)6TL6z(12z),and eachancestryswitchcreatesoneadditionalancestryblock.When Simulation study with known substructure there is no ancestry switch in a genome, the number of ancestry We carried out two simulation experiments to evaluate the blocks is defined to be the same as the number of chromosomes. impactofstatisticaluncertaintiesassociatedwithestimatinglocus- Therefore,foreachspecifictimeofadmixing,T,wecomputedthe specificancestry,andtoinvestigatetheperformanceofthessPCA expectednumberofancestryblocksasB+2622,withthegenome- approach. wide ancestry proportion, z, varying from 0 to 1 at 100 equally In the first set of simulations, we created 10 datasets in which spaced grid points. Each curve in Figure 2D shows the expected 400 admixed genomes were modeled to mimic a Latino numberofancestryblocksasafunctionofadmixtureproportions population: each individual draws chromosomal segments from foraspecificadmixingtime.TheestimatednumbersofEuropean EuropeanandIndigenousAmericanancestry,andtheproportion and non-European ancestry blocks from the Mexican individuals of Indigenous American ancestry in each individual matches were tallied and compared to the expected values. To assess the what we observed in MEX1. For 200 individuals, European- impact of uncertainty associated with estimating the number of derived segments were sampled from the HapMap CEU ancestry blocks, we note that errors in estimating locus-specific haplotypes, representing Northern and Western European ancestry often create very short ancestry blocks. Hence, we ancestry, while for the remaining 200 individuals, the Europe- simulated admixed genomes according to the hybrid-isolation an-derived segments were sampled from MexEUR inferred from model, but removed extremely short blocks (segments with ,10 the actual Mexican genotype data, representing Southern SNPs)frombothsimulatedgenomesandrealdata.Theestimated European ancestry. The chromosomal segments from CEU and admixing timeremainedthesameunderthisalternative analysis, MexEUR in the admixed individuals were treated as the true suggesting the estimated admixing time is relatively robust. The European virtual genomes. hybrid-isolation model was chosen because of the mathematical To evaluate the potential impact of statistical uncertainty, we simplicity;underamorerealisticcontinuousgene-flowmodel,the introducedrandomerrorsinwhichthetrueidentitiesofEuropean estimated times of admixing should be interpreted as an vs.IndigenousAmericansegmentswereswitchedwithprobability approximationofaverageadmixingtime,weightedbytherelative e.ThetopPCforeachsetofsimulatedvirtualgenomes(ateachof levelof gene-flow ineachgeneration. 8 error rates, e=0.01–0.20) was computed with ssPCA. We evaluate the effect of these errors by calculating a confusion Substructure within the European and Indigenous fraction, j, that quantifies the accuracy with whichthe estimated American component of the genome first PC separates individuals with Northern vs. Southern Europeanancestry,andisdefinedastheproportionofindividuals To assess the sub-continental population structure, each thatlieonthe‘‘wrong’’sideofathresholdthatbestseparatesthe Mexican genome was partitioned into three non-admixed twogroups.Thus,jcanrangefrom0(perfectseparation)tonearly genomes, by masking (i.e. setting to missing) alleles from all but 50% (complete confusion as would be observed for genetically one ancestral population. In other words, the European homogenous groups). Finally, we analyze each of the 10 datasets component of a Mexican’s genome (MexEUR) was derived by usingSABER+,exactlyaswasdoneforrealdata:applyssPCAto treatingasmissingallalleleswhoseoriginswereinferredasAfrican estimate substructure, and calculate a confusion fraction. The or Indigenous American. For within European analysis, we results for this simulation experiment are depicted in Figure S2, applied ssPCA on the dataset consisting of MexEUR (including andshowthattheconfusionfractionincreasessubstantially,froma bothMexicoCityandHapMapsamples),88HapMapCEUand mean of 2.14% to 17.5%, at error rates between 0.03 and 0.05. 176 European individuals from four cities: Dublin (Ireland), UsingSABER+onthesesame10datasetsyieldsameanconfusion Warsaw (Poland), Rome (Italy) and Porto (Portugal). Because of fraction of 1.58%, which corresponds to an error rate ,0.02 the limited number of informative haplotype segments, Mexican (indicated bythearrowin Figure S2). individuals with less than 25% European ancestry were excluded In a second set of simulations to investigate the ability of the from this analysis. In an analogous fashion, we analyzed the ssPCAapproachtodealwithmissingdata,wecreatedfivedatasets Mexican component of the genome, MexAMR, along with 129 in which the proportion of genome-wide European ancestry in indigenous Indigenous American individuals representing 8 each of 400 admixed genomes was fixed at either 50% or 30%, populations (Table S1). Because the African ancestry is low in respectively.ApplyingSABER+andssPCAtothesedatasetsyields both Mexican cohorts (3% and 5%, respectively), we did not mean confusion fractions of 0 and 0.7%, respectively, indicating analyze thewithin-Africa population structure. that our approach performs well for situations such as the one described here, where mean genome-wide continental ancestry Detecting signature of selection proportions are above 30% for both the European and the or all SNPs with frequencies between .05 and .95 in the Indigenous American components. respectivepopulations,iHSwascalculatedfollowingVoightetal. PLoSGenetics | www.plosgenetics.org 10 December2011 | Volume 7 | Issue 12 | e1002410
Description: