UCLA UCLA Previously Published Works Title Selecting SNPs informative for African, American Indian and European Ancestry: application to the Family Investigation of Nephropathy and Diabetes (FIND). Permalink https://escholarship.org/uc/item/75f580zv Journal BMC genomics, 17(1) ISSN 1471-2164 Authors Williams, Robert C Elston, Robert C Kumar, Pankaj et al. Publication Date 2016-05-01 DOI 10.1186/s12864-016-2654-x Peer reviewed eScholarship.org Powered by the California Digital Library University of California Williamsetal.BMCGenomics (2016) 17:325 DOI10.1186/s12864-016-2654-x RESEARCH ARTICLE Open Access Selecting SNPs informative for African, American Indian and European Ancestry: application to the Family Investigation of Nephropathy and Diabetes (FIND) RobertC.Williams1*,RobertC.Elston2,PankajKumar1,WilliamC.Knowler1,HannaE.Abboud3ˆ,SharonAdler4, DonaldW.Bowden5,JasminDivers5,BarryI.Freedman5,RobertP.IgoJr.2,EliIpp4,SudhaK.Iyengar2,PaulL.Kimmel6, MichaelJ.Klag7,OrlyKohn8,CarlD.Langefeld5,DavidJ.Leehey9,RobertG.Nelson1,SusanneB.Nicholas10, MadeleineV.Pahl11,RulanS.Parekh12,JeromeI.Rotter13,JeffreyR.Schelling14,JohnR.Sedor14,VallabhO.Shah15, MichaelW.Smith16,KentD.Taylor13,FarookThameem3,17,DenyseThornley-Brown18,CherylA.Winkler19, XiuqingGuo13,PhillipZager15,RobertL.Hanson1andtheFINDResearchGroup Abstract Background: The presence of population structure ina samplemay confound the search for importantgenetic lociassociated with disease.Our four samples in theFamily Investigation ofNephropathy and Diabetes (FIND), European Americans, Mexican Americans, African Americans, and American Indians are partof a genome- wide association study in which population structure mightbe particularly important.We therefore decided to study in detail one component of this, individual genetic ancestry (IGA). FromSNPs present on theAffymetrix 6.0Human SNP array, weidentified 3sets of ancestry informative markers (AIMs), each maximized for the information inone thethreecontrasts among ancestral populations:Europeans (HAPMAP, CEU), Africans(HAPMAP, YRI and LWK), and Native Americans (full heritage Pima Indians). We estimate IGA and present an algorithm for their standard errors, compare IGA to principalcomponents,emphasize theimportanceof balancing information inthe ancestry informative markers (AIMs), and testthe associationof IGA with diabetic nephropathy inthecombined sample. Results: A fixed parental allele maximum likelihood algorithm was applied to the FIND to estimate IGA infour samples: 869 American Indians; 1385 African Americans; 1451 Mexican Americans; and 826 EuropeanAmericans. When the information inthe AIMs is unbalanced, the estimates are incorrect with large error. Individual genetic admixture is highlycorrelatedwith principle components for capturing population structure. It takes ~700 SNPsto reduce the average standard errorof individualadmixture below 0.01. When thesamples are combined, the resulting population structure creates associations between IGA and diabetic nephropathy. (Continuedonnextpage) *Correspondence:[email protected] ˆ Deceased 1PhoenixEpidemiologyandClinicalResearchBranch,NationalInstituteof DiabetesandDigestiveandKidneyDiseases,NationalInstitutesofHealth, Phoenix,AZ85014,USA Fulllistofauthorinformationisavailableattheendofthearticle ©2016Williamsetal.OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinkto theCreativeCommonslicense,andindicateifchangesweremade.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. Williamsetal.BMCGenomics (2016) 17:325 Page2of17 (Continuedfrompreviouspage) Conclusions: The identifiedset of AIMs, which include American Indian parental allele frequencies, maybe particularly useful for estimatinggenetic admixture inpopulationsfrom theAmericas. Failure to balance information inmaximum likelihood, poly-ancestrymodelscreates biased estimates of individualadmixture with large error. This also occurs when estimatingIGA using theBayesian clustering method as implemented inthe program STRUCTURE. Odds ratios for the associations of IGA with disease are consistent withwhat is known about theincidence and prevalence of diabetic nephropathy inthese populations. Keywords: Individual genetic ancestry, Populationstructure, SNP,Diabetic nephropathy Background among the ancestral populations. [We define an infor- The Family Investigation of Nephropathy and Diabetes mation contrast (In) as the information in the difference (FIND) is a multicenter study that is designed to find (δ) in SNP allele frequency between a pair of ancestral genesthatcontributetotheonsetofdiabeticnephropathy populations (Fig. 1).] The problem was revealed in the in four target, self-reported, heritage groups: European courseofthisresearchwhenpreliminaryanalysesinvolv- Americans, Mexican Americans, American Indians, and ing 1390 SNPs returned apparently incorrect IGA esti- African Americans [1–4]. Two strategies were employed mates,andwe suspected thatthis was due to insufficient toascertaintheroleofspecificgenes,afamily-basedlink- information regarding one of the contrasts between an- age study and a case–control genome-wide association cestral populations. In order to explore the reason for study (GWAS). In the GWAS each person in the four this discrepancy we first defined three allele frequency groups was typed for 1 M single nucleotide polymor- contrasts in a 3 ancestry model with European (EU), phisms (SNPs) on a common platform after which the American Indians (AI), and African (AF) as ancestral genotype distributions in the cases and controls for each populations: |EU-AI|, |EU-AF|, and |AI-AF|. Then we SNP were compared to identify risk alleles with genome- chose subsets of 1300 AIMs that maximized the infor- wide significance. A common practice in GWAS such as mation for each contrast–450, 450, and 400 SNPs, re- the FIND is to control for population stratification by spectively–and used these individually and together to adding principal components (PCs) or individual genetic estimate individual ancestry in the respective ancestral ancestry (IGA) estimates as covariates to the statistical populations from HapMap and the Pima, in which one models[5]. expects the mean ancestral component to approximate While the assessment of IGA is potentially important 1.0; e.g., the expectation for the Pima is that mean AI for GWAS and for other genetic analyses, the evaluation ancestry will approximate 1.0. The origin of the unstable of an American Indian heritage has been difficult estimates was traced to a set of SNPsin which the infor- because there has been little information on ancestry in- mation is not balanced across the 3 contrasts. We define formative markers (AIMs) from a large sample of a balanced model as one that includes markers that pro- American Indians typed on a commercially available vide suitable information for contrasting all pairs of par- platform. ThePimaIndiansoftheGilaRiverIndianCom- ental populations and show that, when the model is not munity in Arizona, who have a very high prevalence of balanced, the IGA estimates are incorrect with large type 2 diabetes, are one of the most intensively studied error.Themethodandresultsarepresentedbelow. AmericanIndiangroups in the United States; genetic and heritageanalyseshavebeenperformedinthisnativegroup Methods for many years, involving research that includes GWAS Studyparticipantsandphenotypes with100Kand1MSNParrays[6–10].PimaIndiansalso The criteria for diabetes, nephropathy, and the overall constituted a large proportion of the American Indian study design for the FIND have been previously de- sampleintheFIND.ThereforedatafromthePimaIndian scribed [1–4]. TheFIND isamulti-ethnicfamilystudyof GWAS, conducted with the Affymetrix Genome-Wide severe kidney disease, where the index case had diabetic Human 6.0 SNP array [11], were used to isolate inform- nephropathy and at least one sibling reported a diagnosis ative markers for IGA in American Indians, which were of either diabetic nephropathy or long-standing diabetes then combined with 3 populations from HapMap to cre- without nephropathy. Samples from four different ethnic ateapanelofAIMs. FINDgroupswerecollected:AfricanAmerican,American WeusetheAIMsandFINDsamplestoaddressanim- Indian, European American, and Mexican American. For portant methodological issue, that in poly-ancestry (>2) the discovery GWAS unrelated cases and controls were maximum likelihood models the accuracy of the esti- genotyped, yielding one individual per pedigree, except mates depends on balancing the information contrasts that in American Indians and Mexican Americans, Williamsetal.BMCGenomics (2016) 17:325 Page3of17 Fig.1Informationcontrasts.Fora3-ancestralpopulationmodeltherearethreeinformationcontraststhatarerepresentedbytheabsolutevalue ofthedifferenceoftherespectiveallelefrequenciesforallele1oftheSNP:|P-P|,|P-P|,and|P-P|,avaluethatisusuallygiventhesymbolδ. 1 2 1 3 2 3 ThevariableInistheinformation-for-assignmentstatistic.Accurateindividualancestryestimatesdependuponbalancingtheinformationbetween these3contrasts because the total available sample was small, some family Affymetrix 6.0 chip during the GWAS phase. Genotypes members were also genotyped. Patients with severe DN were called using the Birdseed version 2 algorithm [12] based upon diabetes duration>5 years and urine albu- implemented in the Genotyping Console software min/creatinine ratio (UACR)≥0.3 mg/g or with severe (Affymetrix). Alleles1and2foreachSNPareassignedin kidneydisease(ESRD)weredefinedascases.Controlshad theorderthattheyarefoundintheHapMapdataset. DMdurations≥9years,UACR<30mg/g,andserumcre- atinine<1.6 mg/dl (males) or<1.4 mg/dl (females) with- Statisticalmethods outfirst-degreerelativeshavingkidneydisease.Additional Estimates of IGA and their variances were calculated for casesandcontrolsthatwerenotpartoftheoriginalFIND each subject in a three parent population model by a studywereincludedtoincreasethestatisticalpower. fixed parental allele maximum likelihood method [13]. A likelihood function L (μ , μ , μ ) is maximized with 1 2 3 Genotyping respect to the population parameters for European A total of 5156 discovery DNA samples, plus 244 blind (EU, μ ), American Indian (AI, μ ), and African heri- 1 2 duplicates, was submitted to Affymetrix, Inc. (Santa tage (AF, μ ), giving respective statistics m , m , and 3 1 2 Clara, CA) for genotyping. Genotypes were generated m , in the interval [0, 1]. Thelikelihoodalgorithmmax- 3 with the Affymetrix Genome-Wide Human 6.0 SNP imizes m and m based on G SNPs. Let p be the fre- 1 2 ijg array [11] using the Affymetrix Commercial Service quency of the jth allele for the gth SNP with codominant (Santa Clara, California), via a contract to Translational allelesA1andA2intheithancestralpopulationforwhich Genomics Research Institute (TGEN, Phoenix, AZ). Europeans are ancestral population i=1, American In- Samples were submitted at a concentration of 100 ng/μl dians i=2, and Africans i=3. Then let Δ be defined as g in Tris-EDTA buffer, then plated according to ethnic theallelefrequencydifferenceforthegthSNPwhere: membership that included HapMap controls and blind duplicates on each plate. Samples were tested for DNA Δ1 ¼p −p g 11g 31g quality and quantity using PicoGreen prior to genotyp- Δ2 ¼p −p ing. All ethnic groups were genotyped with the g 21g 31g Williamsetal.BMCGenomics (2016) 17:325 Page4of17 Δ3 ¼p −p Ibadan, Nigeria (HapMap abbreviation: YRI) and the g 12g 32g Luhya (N=110) in Webuye, Kenya (HapMap abbrevi- Δ4g ¼p22g−p32g: ation: LWK) were averaged when the SNP was present in both populations, or used singly from either group For the gth SNP and ancestral proportions m and m 1 2 when present in just one; allele frequency is represented the allele frequencies P and P in the hybrid popu- hA1 hA2 as P and ancestry as AF. To represent American In- lationcan beestimated by AF dians (AI), Pima Indians in Arizona were chosen with P ¼ p þ m Δ1 þ m Δ2 genotypic data from individuals who had participated hA1g 31g 1 g 2 g in a GWAS conducted with the Affymetrix 6.0 array PhA2g ¼ p32g þ m1Δ3g þ m2Δ4g: (N=964) [7, 8]. Each person in this sample was a self-reported, full heritage Piman (Pima or Tohono UnderHardy-Weinbergequilibriumthe likelihoods for O’odham or combination of the two tribes), allele fre- thethree genotypesatSNPgare: quency is represented as P and ancestry as AI. Inform- (cid:4)(cid:1) (cid:3) (cid:5) AI LðA1A1Þ ¼ ln P 2 ativelociwereidentifiedacrossthe22autosomes.Eachof g hA1g thethreeallelefrequencydifferences,δ,contrastsforallele (cid:1) (cid:3) LðA1A2Þ ¼ ln 2P P 1, |PEU-PAI|, |PEU-PAF|, |PAI-PAF|, is represented by a set g hA1g hA2g of markers such that one contrast was maximized for in- (cid:4)(cid:1) (cid:3) (cid:5) LðA2A2Þ ¼ ln P 2 : formation, δ≥0.5, while δ<0.3 for the other two (Fig. 1, g hA2g AdditionalFiles1,2,and3).SNPswereselectedsuchthat within each set there is at least 500 kb distance between When calculating one likelihood L for G genotypes, g syntenic SNPs; thus, linkage disequilibrium among SNPs for each possible combination of m and m ancestral 1 2 isexpectedtobeminimal.SNPswithallelesA/TandC/G proportions (in increments of 0.001), the estimates werenotincludedbecauseoftheambiguityintheirinter- (mlm , mlm ) are those which maximize the likelihood, 1 2 pretation.AfterthefirstselectionofAIMSthe128Ameri- thatis: can Indians who were common to the FIND study and X maxm1 ¼0→1 G L ; the Phoenix GWAS parental group were compared for m2 ¼0→1 g¼1 g eachinformativeSNP.Areplicatetypingerrorthreshold≤ 0.032wasestablishedforinclusionofaSNP.Tohelpbal- and ance the information between the 3 sets of AIMs the mlm ¼1:0−ðmlm þ mlm Þ: information-for-assignmentstatistic[16] 3 1 2 ! The variances and covariance are calculated as (See XG XK p Appendixforderivationoftheinformation matrix) [14]: In ¼ −plj log pljþ Kijlogpij 2 XG XG 3−1 j¼1 i¼1 66664XGg¼1ImgImg1m12 gX¼G1ImgImg1m2277775 ¼(cid:6)CovðVmðlmml1m;m1Þlm2Þ CovðVmðlmml1m;m2Þlm2Þ(cid:7) jotvhweraSasNlluPsaevidne,ratwhgheeerifetrhepqiajuneicsnectshytreaolfrfpeqoauplleuenllaectyio1onf,aaatlnledSleNp1Pjl aisjt.tthhInee g¼1 g¼1 addition, F-statistics were calculated by the method of Vðmlm Þ¼ Vðmlm Þþ Vðmlm Þ Weir and Cockerham [17] to determine the utility of F 3 1 2 st þ2Covðmlm ;mlm Þ forbalancing information inthe contrasts. 1 2 Estimates of individual admixture were also calculated with standard error for the 4 parental samples with the STRUCTURE [18] pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEðmlmÞ¼ VðmlmÞ: program to compare this Bayesian clustering method i i with the fixed parental allele algorithm and to determine Populations from the HapMap were chosen to repre- whether either or both were vulnerable to the unbal- sent European and African origins [15]; genotypes for anced information in the choice of human ancestry SNPs represented on the Affymetrix array were obtained SNPs. from the HapMap website for inclusion as parental Principal components (PCs) were computed using AIMs. To represent Europe, CEPH “Centre d’Etude du SNPs that passed quality control and were not in gen- Polymorphisme Humain”, Utah residents (N=174) with omic regions with extended linkage disequilibrium (LD). ancestry from northern and western Europe (HapMap Specifically, markers in the following regions were ex- abbreviation: CEU) were used; allele frequencies were cluded: chromosomes 5 (44–51.5 Mb), 6 (25–33.5 Mb), represented by P and ancestry by EU. To represent 8 (8–12 Mb), 11 (45–57 Mb), and 17 (40–43 Mb). The EU Africa, the allele frequencies for theYoruba (N=209) in PC analysis was computed on the combined ethnic Williamsetal.BMCGenomics (2016) 17:325 Page5of17 samples for the GWAS. The first two principal compo- contrasts |P -P | and |P -P |, and 400 SNPs maxi- EU AI EU AF nents were determined to account for a large proportion mized forinformationinthe contrast|P -P |. AI AF of the genetic variation in the multi-ethnic PC analysis The power of each SNP to estimate IGA is propor- and appropriately reduce the inflation factor in the tional to the magnitude of the allele frequency difference ethnic-specific logistic regression models. Outlying indi- between the two parental populations, or δ, in the three viduals based on the first two PCs were excluded from difference-contrasts for each marker, |P -P |, |P - EU AI EU the GWAS and, thus, are not included in the present P |, and |P -P |, and the information-for-assignment AF AI AF analysis. A total of 33 individuals were omitted based on statistic In, which was also calculated for each contrast outlyingPCAvalues. (Table 2). Within each information contrast, and its set Logistic regression was performed by standard of SNPs, the two statistics are closely matched with δ≅ methods with the disease, diabetic nephropathy, as the 0.53 for each contrast and In≅37. While the informa- dependent variable and enrolment age, sex (women), en- tion was balanced across the 3 sets of SNPs, when one rolment center, and the respective heritage estimates as considers these two measures across all 1300 SNPs for explanatory variables. each contrast, they also represent a balanced design (Table2). Results The statistic F was also calculated for each set of st Across the 22 autosomes 1300 SNPswere selected as in- SNPs, two ancestral populations at a time, as well as for formative for individual ancestry with δ≥0.5 (Table 1). all SNPs, two ancestral populations at a time, for each The number of informative SNPs generally scaled with information contrast (Table 3). While the information the size of the chromosome. Three distinct sets of SNPs was balanced for mean δ and In, the mean F were vari- st were chosensuchthateachsetwasmaximizedforitsin- ableacrosscontrasts. formation in one of the three contrasts, while the three Individual ancestry estimates were first computed for sets together were balanced for information across the the 4 ancestral populations to test the validity of the three contrasts. There were 450 SNPs in each of the method and the stability of the estimates of individual Table1Descriptivestatisticsfor1300ancestryinformativeSNPLoci Maximizedcontrasts,δ≥0.5 Chromosome #SNPs Meandistance(Bp) |P -P |N=450 |P -P |N=450 |P -P |N=400 EU AI EU AF AI AF 1 113 4,747,709 44 34 35 2 101 4,296,466 35 36 30 3 89 4,427,268 29 30 30 4 89 4,480,508 26 32 31 5 87 4,286,121 29 33 25 6 76 4,328,735 33 27 16 7 84 3,812,870 25 34 25 8 85 3,480,996 27 33 25 9 66 4,917,071 23 25 18 10 59 4,358,358 17 24 18 11 66 4,356,288 32 14 20 12 51 4,784,074 17 20 14 13 54 3,707,021 15 19 20 14 36 5,120,754 11 12 13 15 50 3,516,390 14 19 17 16 49 3,846,889 22 12 15 17 29 4,218,684 8 11 10 18 34 3,410,019 13 11 10 19 19 6,022,778 6 8 5 20 31 4,143,198 10 8 13 21 15 3,683,800 6 5 4 22 17 4,038,881 8 3 6 Williamsetal.BMCGenomics (2016) 17:325 Page6of17 Table2Measuresforbalancinginformation(standarddeviation)inthethreeinformationcontrasts Informationcontrast Information |P -P | |P -P | |P -P | EU AI EU AF AI AF NumberofSNPs N=450 N=450 N=400 Information-for-Assignment,In 37.3 37.3 36.7 Meanδ 0.529(0.022)N=450 0.528(0.022)N=450 0.542(0.027)N=400 AllSNPsN=1300 Information-for-Assignment,In 56.5 56.9 58.3 Meanδ 0.351(0.132) 0.364(0.121) 0.350(0.130) ancestry in the maximum likelihood model (Table 4, The maximum likelihood individual ancestry algo- Fig. 2). The expectation was that the vector of inform- rithm was then applied to the four FIND samples using ative SNPs would return a mean value of 1.0 for the the pooled set of 1300 SNPs (Table 5, Fig. 3). For the heritage of the respective ancestry group. For the CEPH FIND European Americans the EU component had a sample the mean value for European ancestry is 0.998 mean of 0.961 (mean standard error for individual an- for the |P -P | set of SNPs, 0.986 for the contrast cestry, 0.008), with small mean proportions for AI and EU AI |P -P |,and0.990whenestimatedwithall1300SNPs, AF. American Indians in the FIND had a large AI mean EU AF with theAI and AFancestryestimates beingclose to0.0. estimate, 0.945 (0.007), with small components for EU However, when EU ancestry is estimated in the HapMap and AF. The primary heritage in the FIND African CEU sample using only the 400 SNPs maximized for the Americans is AF, 0.830 (0.008), with the balance being |P -P | contrast, then Ex(EU)=1.0 is not met and the primarily from European heritage, 0.149 (0.009). The AI AF mean estimates become unstable: EU=0.656, AI=0.185, FIND Mexican Americans represent their complex ori- andAF=0.159.ThesamepatternholdsforAFestimates gin from the three heritage groups: EU 0.476 (0.012), AI in HapMap LWK and YRI and for AI estimates in the 0.447(0.011),andAF0.077(0.011). Pima; when the ancestral component is not part of the The standard error for each individual ancestry esti- maximized contrast, then the maximum likelihood matewascalculated from the2x2 information matrixfor model becomes unstable and returns unreliable, incor- each personover the vector of non-missing SNPsfor the rect estimates forindividualancestry.Whenthe3setsof estimate. Figure 4 illustrates the effect on the standard maximized markers are pooled, however, the model error of adding SNPs to the estimate. The cumulative returnsstable,correctestimates (Table4,Fig.2). standard error was calculated for SNPs 1 to 1300, in The above analysis was repeated for the 4 ancestral chromosome and position order, and then averaged at samples using the STRUCTURE Bayesian cluster each point over the four FIND populations for the EU, method with 3 ancestral components, EU, AI, and AF AI, and AF ancestral components. To insure a mean (K=3) and gave very similar results to those presented standard error<0.01,approximately700SNPsareneces- in Table 4 and Fig. 2 (Additional file 4: Table S1, sary forthemaximum likelihood model. Additional file 5: Figures S1–S4). In two instances for The performance of these markers in an admixed the Pima Indians, for maximized contrasts |P -P | population using the Bayesian method was assessed by AI AF and |P -P |, the Bayesian method did not return theSTRUCTUREprogramfortheFINDMexicanAmer- EU AI the expected value of AI even when the information icans with 3 ancestry components (K=3). When geno- in the contrast was maximized for this ancestral com- typic data representative of the three ancestral reference ponent. When the 1300 SNPs with balanced informa- populations were included (CEU, YRI+LWK, Pima In- tion were incorporated into the STRUCTURE dians),theoveralladmixtureproportionswere verysimi- program, it returned the expected mean values and lar to those obtained with the maximum likelihood proportions of ancestry in the four ancestral samples method (Fig. 5, Panel a). Since individual level data may (Additional file 5: Figures S1 and S5). not be readily available for a suitable American Indian Table3MeanFst(standarddeviation)inindividualandcombinedcontrasts Informationcontrast |P -P | |P -P | |P -P | EU AI EU AF AI AF Fstbycontrast 0.516(0.023)N=450 0.381(0.015)N=450 0.437(0.035)N=400 FstoverallcontrastsN=1300 0.502(0.036) 0.367(0.018) 0.421(0.028) W illia m s e t a l. B M C G e n o m ic s (2 0 1 6 ) 1 7 :3 2 5 Table4Mean(standarddeviation)ofsourcesamplesforAIMstypedwiththe3setsofinformativemarkers SNPsinestimates SourcesamplesforAIMs HapMapCEU,N=165 HapMapLWK,N=110 HapMapYRI,N=193 Pima,N=964 EU AI AF EU AI AF EU AI AF EU AI AF |P -P |N=450 .988(.022) .004(.010) .008(.021) .163(.182) .144(.180) .693(.359) .133(.175) .152(.178) .715(.349) .004(.029) .985(.065) .011(.053) EU AI |P -P |N=450 .986(.023) .010(.022) .004(.010) .010(.015) .017(.030) .972(.028) .001(.003) .004(.012) .995(.012) .143(.169) .715(.325) .142(.164) EU AF |P -P |N=400 .656(.403) .185(.221) .159(.186) .012(.022) .008(.015) .980(.023) .004(.016) .002(.006) .994(.017) .009(.049) .988(.055) .003(.017) AI AF AllSNPsN=1300 .990(.014) .005(.010) .005(.010) .015(.015) .007(.011 .978(.016) .001(.004) .002(.007) .996(.008) .007(.039) .989(.048) .003(.021) Eachsetismaximizedforinformationinonecontrast,andwithallcombinedSNPs EUEuropeanancestry,AIAmericanIndianancestry,AFAfricanancestry P a g e 7 o f 1 7 Williamsetal.BMCGenomics (2016) 17:325 Page8of17 Fig.2MeanancestrywhenestimatedwiththreesetsofSNPs,eachsetmaximizedforinformationinonecontrast.Eachoftheancestral populationswasmodeledbysamplesfromHapMaporfromthePimaIndianGWAS.ThreesetsofSNPswereeachmaximizedforinformationin oneofthethreecontrastsandthenusedtoestimatetherespectivemeanancestry(CEU,European(EU);LWKandYRI,African(AF);Pima, AmericanIndian(AI))ineachsample,withtheexpectationofameanof1.0.Whentheancestryofthesamplewasnotrepresentedinthe maximizedcontrastset,thentheestimatesofindividualancestrybecomeunstablewithlargeerror reference population, the analyses were repeated without variablewhentested within each sample. However, when including the Pima data as a reference. In this situation, the samples were combined (N=4126) it introduced the Amerindian component in the FIND Mexican population structure and each heritage variable had a American participants was modestly overestimated in significant odds ratio when tested singly in the model: comparison to the case when the Pimas were included, EU odds ratio 0.338, p<0.0001; AI 1.960, 0.028; and AF while the European component was underestimated 2.519,p<0.0001; (Fig. 5,Panel b). Logistic regressions were repeated in the combined An alternate method for controlling for population samplewiththesame covariatesandtwo individual heri- structure in GWAS is to calculate the PCs from the tage variables at a time (Table 6, Additional file 4: Tables samples. To compare the PC and heritage estimates, a S3 and S4). Given that the heritage values for each per- Pearson correlation coefficient was calculated for the 3 son sum to 1.0, the variable that is left out of the logistic admixture components and the first 2 PCs for the com- model isthe reference for the other two in the computa- bined sample (N=4391). The EU heritage component tion of the odds ratio. When EU and AI are included, was highly correlated with PC2 [0.9954 (95 % C.I. with AF as a reference, EU is significantly less than 1.0 0.9951, 0.9957)], while the AF heritage had a more while the 95 % confidence contrast of AI includes 1.0. modest correlation [0.9111 (0.9060, 0.9160)] with PC1. When EU and AF are in the same model, with AI as a American Indian heritage was negatively correlated reference, a similar pattern results. Finally, when EU is with both PC1 and PC2 [−0.8600 (−0.8675, −0.8521) the reference for AI and AF, both covariates have odds and −0.5067 (−0.5283, −0.4843)]. When the three ad- ratios greater than 1.0: AI 3.762, p<0.0001 and AF mixture components were each used as a dependent 2.956,p<0.0001. variable in a linear regression with PC1 and PC2 as ex- planatory variables, the R-square values were close to Discussion 1.0: EU (0.995), AI (0.996), and AF (0.996). ApanelofSNPsinformativeforAfrican,AmericanIndian To assess the potential role of ancestry in confounding andEuropeanancestry associations with diabetic nephropathy, each of the three A panel of 1300 SNPs was developed which can serve as heritage estimates was first tested singly for association, informative markers for African, American Indian and with the covariates, for each of the 4 FIND populations European ancestry; these ancestry components are often (Additional file 4: Tables S2, S3 and S4). While the vari- of interest in genetic epidemiologic studies of pop- ables enrolled age, sex, and enrolment center were con- ulations from the Americas. Although other similar sistently associated with the disease, there was no marker panels have been developed, the samples used as significant association with any individual heritage the American Indian ancestral group were few and W illia m s e t a l. B M C G e n o m ic s (2 0 1 6 ) 1 7 :3 2 5 Table5Mean(standarddeviation)andrangeforindividualheritageandstandarderrorestimatesforFINDpopulations EU AI AF Heritage Standarderror Heritage Standarderror Heritage Standarderror FINDpopulation N Mean Range Mean Range Mean Range Mean Range Mean Range Mean Range EuropeanAmerican 826 .961(.090) .059–1.0 .008(.002) .002–.014 0.014(.023) 0–.237 .011(.002) .003–.015 0.025(.084) 0–.923 .008(.002) .001–.014 AmericanIndian 869 .045(.090) 0–.712 .009(.003) .002–.017 0.945(.111) 0–1.0 .007(.003) .001–.016 0.010(.049) 0–.866 .007(.002) .001–.015 MexicanAmerican 1451 .476(.134) 0–.974 .012(.001) .006–.015 0.447(.140) 0–1.0 .011(.001) .004–.015 0.077(.053) 0–.845 .011(.001) .005–.014 AfricanAmerican 1385 .149(.104) 0–.638 .009(.002) .002–.016 0.021(.030) 0–.539 .010(.002) .002–.016 0.830(.111) .058–1.0 .008(.002) .001–.014 IGAestimateswerecomputedwith1300AncestryInformativeMarkersanda3AncestralcomponentsModel EUEuropeanheritage,AIAmericanIndianheritage,AFAfricanheritage P a g e 9 o f 1 7
Description: