Confidence Intervals for Population Allele Frequencies: The General Case of Sampling from a Finite Diploid Population of Any Size Tak Fung1,2*, Kevin Keenan3 1NationalUniversityofSingapore,DepartmentofBiologicalSciences,Singapore,Singapore,2Queen’sUniversityBelfast,SchoolofBiologicalSciences,Belfast,Northern Ireland,UnitedKingdom,3Queen’sUniversityBelfast,InstituteforGlobalFoodSecurity,SchoolofBiologicalSciences,Belfast,NorthernIreland,UnitedKingdom Abstract The estimation of population allele frequencies using sample data forms a central component of studies in population genetics. These estimates can be used to test hypotheses on the evolutionary processes governing changes in genetic variation among populations. However, existing studies frequently do not account for sampling uncertainty in these estimates,thuscompromisingtheirutility.Incorporationofthisuncertaintyhasbeenhinderedbythelackofamethodfor constructingconfidenceintervalscontainingthepopulationallelefrequencies,forthegeneralcaseofsamplingfromafinite diploid population of any size. In this study, we address this important knowledge gap by presenting a rigorous mathematicalmethodtoconstructsuchconfidenceintervals.Forarangeofscenarios,themethodisusedtodemonstrate that for a particular allele, in order to obtain accurate estimates within 0.05 of the population allele frequency with high probability(§95%),asamplesizeofw30isoftenrequired.Thisanalysisisaugmentedbyanapplicationofthemethodto empirical sample allele frequency data for two populations of the checkerspot butterfly (Melitaea cinxia L.), occupying meadows in Finland. For each population, the method is used to derive §98:3% confidence intervals for the population frequenciesofthreealleles.Theseintervalsarethenusedtoconstructtwojoint§95%confidenceregions,onefortheset ofthreefrequenciesforeachpopulation.Theseregionsarethenusedtoderivea§95%confidenceintervalforJost’sD,a measureofgeneticdifferentiationbetweenthetwopopulations.Overall,theresultsdemonstratethepracticalutilityofthe method with respect to informing sampling design and accounting for sampling uncertainty in studies of population genetics,important forscientific hypothesis-testingand also forrisk-based natural resource management. Citation:FungT,KeenanK(2014)ConfidenceIntervalsforPopulationAlleleFrequencies:TheGeneralCaseofSamplingfromaFiniteDiploidPopulationofAny Size.PLoSONE9(1):e85925.doi:10.1371/journal.pone.0085925 Editor:GuyBrock,UniversityofLouisville,UnitedStatesofAmerica ReceivedJune23,2013;AcceptedDecember4,2013;PublishedJanuary21,2014 Copyright:(cid:2)2014Fung,Keenan.Thisisanopen-accessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense,whichpermits unrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited. Funding:TFisbeingsupportedbytheNationalUniversityofSingaporestart-upgrantWBSR-154-000-551-133.Inaddition,TFacknowledgespastfunding supportbyaPh.D.studentshipfromtheBeaufortMarineResearchAwardinEcosystemApproachtoFisheriesManagement,carriedoutundertheSeaChange StrategyandtheStrategyforScienceTechnologyandInnovation(2006–2013),withthesupportoftheMarineInstitute,fundedundertheMarineResearchSub- ProgrammeoftheIrishNationalDevelopmentPlan2007–2013(http://www.marine.ie/home/research/MIFunded/BeaufortAwards/).KKisbeingsupportedbya Ph.D.studentshipfromtheBeaufortMarineResearchAwardinFishPopulationGenetics,fundedbytheIrishGovernmentundertheSeaChangeStrategyas above.Thefundershadnoroleinstudydesign,datacollectionandanalysis,decisiontopublish,orpreparationofthemanuscript. CompetingInterests:Theauthorshavedeclaredthatnocompetinginterestsexist. *E-mail:[email protected] Introduction populationallelefrequencieshavetobeestimatedusingasubsetof sampledindividuals.Fordiploidorganismsinalargepopulation,it Spatiotemporalpatternsofgeneticvariationamongpopulations has been established that the frequency of an allele in a sample are often used to test hypotheses about processes underlying the provides an unbiased estimateof thefrequency in thepopulation patterns, such as selection, migration and genetic drift (e.g., as a whole [10]. Thus, if samples of a given size are repeatedly [1,2,3,4,5]). This genetic variation captures differences in genetic taken from a large population, then the mean frequency of an structure among populations, where the genetic structure of a allele ina sample converges to thepopulation allele frequency as population is determined by the distribution of alleles among thenumberofsamplesincreases.However,manystudiesonlytake individuals in the population. The allele distribution of a asinglesamplefromapopulationandpresentorusetheresulting populationistheresultofarangeofbiologicalandenvironmental frequencies of alleles in the sample, without accounting for processes acting on the population and also on surrounding sampling uncertainty (e.g., [11,12,13,14,15,16,17,18]). Therefore, populations within the same geographical region, which result in these studies implicitly assume that the sample allele frequencies non-random mixing of gametes among individuals in all popula- areclosetothepopulationallelefrequencies,whichisbynomeans tions. Patterns in allele distributions among populations are guaranteed. Interpretation of findings from these studies is generally assessed by consideration of variations in allele therefore complicated by the potential for large sampling frequencies (e.g.,[6,7,8,9]). uncertainty.Ideally,inthissinglesamplecase,uncertaintybounds Logistic and ethical constraints mean that in practice, a forthepopulationallelefrequencieswouldbequantified,basedon population is unlikely to be sampled in its entirety, such that the thesample allele frequencies. PLOSONE | www.plosone.org 1 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies ArecentstudybyHaleetal.[19]usedcomputersimulationsto a function of their population allele frequencies [9]. The data draw samples from four diploid populations, with allele frequen- comes from a study [13] where it is unclear that the populations ciesbasedonfourempiricaldatasets.Subsequently,theyusedthe can be assumed to be of infinite sizes or at HWE. This example sample data to derive the means, variances and ranges of allele illustrates how the mathematical theory underlying our method frequencies in samples of varying size, as well as the means, can be used to quantify sampling uncertainty not only in variances and ranges of indicators that are a function of allele population allele frequencies, but also in parameters that are frequencies (specifically, heterozygosity and F ). Using these functionsofthesefrequencies,importantforhypothesis-testingand ST results, Hale et al. [19] concluded that a sample size of 30 is also fornatural resource management. sufficienttoaccuratelyestimatepopulationallelefrequencieswhen Overall, this study provides a rigorous mathematical quantifi- using microsatellites. However, the authors did not quantify cation of sampling uncertainty in population allele frequencies, uncertainty bounds for the population allele frequencies using without the typically unrealistic constraints of assuming that a their sample data, such that their assessment of accuracy lacks a population has an infinite size and is at HWE. Thus, the results rigorous quantitative basis. Furthermore, for each sample size, can beappliedtoa wide range ofstudies inpopulation genetics. sample frequencies were calculated using only 100 replicate samples, which may not give a good approximation of the true Methods dispersioninthesamplefrequencies.Thishighlightsaweaknessof using a computational approach that lacks an underlying Inordertoconstructconfidenceintervalsforthegeneralcaseof mathematicaltheory.Moreover,Haleetal.[19]didnotconsider taking samples of any size from a diploid population of any size the situation identified earlier, where one sample is taken and (larger than or equal to the sample size) and with any degree of there is a need to quantify uncertainty of population allele deviation from HWE, the sampling distribution of the allele frequencies using just the sample allele frequencies. For this frequencies in this general case is first exactly specified. This situation,earlierstudieshavederivedconfidenceintervalsinorder distribution is then used to derive formulae specifying confidence to capture uncertainty in the population allele frequencies. If a intervals that contain the population allele frequencies with diploid population is of infinite size and is at Hardy-Weinberg probability above a known threshold. Using these formulae, equilibrium (HWE), then the frequency of an allele in a sample confidence intervals are derived for a range of archetypal followsabinomialdistribution[10].Thus,Gillespie[20]proposed scenarios, which are then used to calculate sample sizes that the use of the Wald confidence interval [21]. However, this permitaccurateestimatesofpopulationallelefrequenciesinthese intervalonlyperformswellforsufficientlylargesamplesizes–with scenarios.Inaddition,theformulaeareappliedtoarealscenario small sample sizes, it is too short [22]. For the binomial where samples are taken from two butterfly populations [13], to distribution, other confidence intervals have been derived that constructconfidenceintervalsforthepopulationallelefrequencies do perform well for small sample sizes [22,23], notably the of these two populations. The intervals are then used to derive a Clopper-Pearsonintervalthatwasderivedeightdecadesago[24]. corresponding confidence interval forJost’s D [9]. However, these only apply in the limited cases when the population is at HWE. Weir [10] proposed a confidence interval Derivation of sampling distribution of allele frequencies that allows for deviations from HWE, analogous to the Wald Consider a population of M diploid individuals, from which a confidence interval. However, like the latter, it is expected to sampleofNindividualsisrandomlydrawn(M§N).Atthelocus perform well only for sufficiently large sample sizes [10]. This is ofinterest,therearenw1alleles,denotedbyAi,i[f1,2, ...,ng. problematic because the accuracy at a particular sample size is Let the population allele frequencies be denoted by pi, with the unknown, so ‘‘sufficiently large’’ cannot be rigorously quantified. corresponding sample allele frequencies being denoted by pi,N. Moreover, all the confidence intervals considered only apply to Also,letPij bethefrequencyofindividualsinthepopulationwith caseswhenthepopulationcanbeassumedtobeofinfinitesize,i.e. alleles A and A (i,j[f1,2, ...,ng), such that MP is the i j ij whenthepopulationsizeismuchlargerthanthesamplesize.This numberofcorrespondingindividualsinthebasepopulation.p is i does not reflect the range of scenarios encountered in empirical related to P by the formula p~P zPn (cid:2)P =2(cid:3). Here, ij i ii j~1,j=i ij research (e.g., [13,25,26,27]). P is a measure of the homozygosity of individuals in the ii In this study, we build on previous work by constructing population with respect to allele A. P ~p{Pn (cid:2)P =2(cid:3), confidence intervals for population allele frequencies for the i ii i j~1,j=i ij andthusP ƒp.If0ƒpƒ0:5,thentheminimumvalueofP is generalcasewhere(i)thepopulationisdiploid,finiteandcanbeof ii i i ii 0, since all copies of allele A can be distributed among anysize;(ii)thesamplecantakeanysizelessthanorequaltothat i heterozygotes of allele A. However, if pw0:5, then this is not ofthepopulation;and(iii)thepopulationcandeviatefromHWE i i possible – there must be at least one homozygote of allele A. In to any extent. These confidence intervals are guaranteed to i this case, theminimum number of homozygotes is realized when containthepopulationallelefrequencywithaprobabilityabovea all heterozygotes have a copy of allele A, i.e. when known threshold. The method derived for constructing these P zPn P ~1. Rearranging this for Pni P and intervalsisthenusedtocalculatesamplesizesrequiredtoachieve ii j~1,j=i ij j~1,j=i ij accurateestimatesofpopulationallelefrequencies,underarange substituting into Pii~pi{Pnj~1; j=i(cid:2)Pij=2(cid:3) gives the minimum of scenarios. Here, accuracy is measured as the length of the value ofPii as2pi{1. Thus,overall, Maxf0,2pi{1gƒPiiƒpi. confidenceintervals.Thesamplesizesderivedserveasaguidefor The frequency of allele Ai in the sample of size N is given by determinationofsuitablesamplesizesinfuturepopulationgenetic pi,N~Yi,N=ð2NÞ,whereYi,N isthenumberofcopiesofalleleAi studies. In particular, we show that a sample size of 30 does not inthesample.Thus,theprobabilitydistributionofpi,N isthesame necessarilygiveaccurateestimates,thusrefiningtheconclusionof astheprobabilitydistributionofYi,N exceptwiththex-axisscaled Haleetal.[19].Lastly,weprovideanexampleofhowthemethod byafactor1=ð2NÞ.Denotetheprobabilitymassfunction(pmf)for can be applied to microsatellite data for two populations of the Yi,N by PðYi,N~yi,NÞ. The range of Yi,N is ½0,2N(cid:2), so checkerspotbutterfly(MelitaeacinxiaL.)[13],toderiveconfidence PðYi,N~yi,NÞ~0 for yi,N outside this range. Thus, for the intervals for population allele frequencies and also for Jost’s D, a following calculations, only yi,N [½0,2N(cid:2) are considered. Now, measureofgeneticdifferentiationbetweentwopopulationsthatis Yi,N~2XiizXi:,whereXi:~Pnj~1,j=iXij andXij isthenumber PLOSONE | www.plosone.org 2 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies of individuals in the sample with alleles Ai and Aj. Xii, Xi: and (inequality (6)) ensures that xiiƒyi,N=2ƒN, as required. Denote Zi~N{Xii{Xi: thusrepresentthenumberofindividualsinthe the upper and lower bounds in (8) by Lðyi,NÞ and Uðyi,NÞ sample with two copies, one copy and no copies of allele Ai, respectively. ThenPðYi,N~yi,NÞcan bewrittenas: respectively. Xii,Xi: andZi followa multivariatehypergeometric distribution with pmf givenby: PðY ~y Þ i,N i,N P~ðX iiM~xPxiiiiii,!X0B@i:~Mxji,:Px,jn=iZ:iiP~ijz1CAiÞ0B@M{MPii{ziMj,Pjn=iPij1CA, ð1Þ ~~xii~fflloocooerril½½iXXUUngðð½yyLiið,,NNyiÞÞ,N(cid:2)(cid:2)Þ(cid:2)PPð XXii~ii~xixi,iiX,Xi:~i:~xiy:,i,NZ{i~2zxiiÞi,!, ð9Þ M! Z~Nzx {y xii~ceiling½Lðyi,NÞ(cid:2) i ii i,N N where zi has been rewritten using zi~N{xii{xi: and where terms on the right-hand side in brackets are binomial xi:~yi,N{2xii, ceiling½a(cid:2) is the smallest integer larger than a, coefficients. Since the number of individuals in the sample must andfloor½a(cid:2)isthelargestintegersmallerthana.Theceilingand equal N, xiizxi:zzi~N. PðYi,N~yi,NÞ is the sum of floor functions are introduced to ensure that xii is an integer, PðXii~xii,Xi:~xi:,Zi~ziÞ for all those biologically feasible whichitmustbetohaveabiologicalinterpretation.PðYi,N~yi,NÞ combinations of xii, xi: and zi satisfying yi,N~2xiizxi:. It will isequaltoPðpi,N~yi,N=ð2NÞÞ,thepmfforpi,N,andthusdefines nowbeshownthatthissummationcanbesimplifiedtothesumof the probability distribution of pi,N. Equation (9) can be used to an expression that depends only on xii and yi,N, over all xii define the probability distribution of pi,N given only four between lower and upper bounds that only depend on yi,N. This parameters: M,pi,Pii and N. considerably eases calculation of PðY ~y Þ. Firstly, the i,N i,N number of individuals in the sample with two copies, one copy Derivation of confidence intervals for sample allele ornocopiesofalleleAicannotexceedthecorrespondingnumbers frequencies in thesampled population. Thisgives rise tothreeinequalities: For given population size (M), population allele frequency (p), i population frequency of homozygotes with allele A (P ) and i ii x ƒMP , ð2Þ sample size (N), the probability distribution for the sample allele ii ii frequency(p )canbecalculatedexactlyusingequation(9).The i,N mean value ofthisdistribution is: n X xi:ƒMj,j=iPij, ð3Þ E½pi,N(cid:2)~E(cid:4)Y2Ni,N(cid:5)~2E½Xii2(cid:2)zNE½Xi:(cid:2) 2NP zN Pn P ð10Þ N{xii{xi:ƒM{MPii{M Xn Pij: ð4Þ ~ ii 2Nj~1,j=i ij~Piiz Xn P2ij~pi, j,j=i j~1,j=i Secondly,thenumberofindividualswithtwocopies,onecopyor asinthecaseofsamplingfromaninfinitediploidpopulation[10]. no copies of allele Ai in the sample must be non-negative. This In equation (10), the expectation E(cid:6)Xij(cid:7)~NPij has been used, gives rise tothree moreinequalities: whichfollowsfromthefactthattheX ’s,withi,j[f1,2, ...,ng, ij followamultivariatehypergeometricdistributionwithparameters xii§0, ð5Þ M, N and MPij [28]. In addition, it can be proved that the varianceof p is i,N xi:§0, ð6Þ s2½p (cid:2)~ðM{NÞ(cid:2)pi{2p2izPii(cid:3) ð11Þ i,N 2ðM{1ÞN N{xii{xi:§0: ð7Þ (csoererecFtiiloenSfa1c).torThðMe {vaNriaÞn=cðeM{thu1sÞ tihnacltudteensdsatosta1ndasarMd ?fin?ite, SPinnj~ce1,j=pii~PiPj~iiz2ðpPi{nj~P1i,ijÞ=ain(cid:2)dPixj=i:2~(cid:3)yia,Nn{d 2xyiii,N.T~h2uxs,iizthexii:n,eqtuhaeln- sas2½pi,N(cid:2)?(cid:2)rpeiq{ui2repd2iz. Pii(cid:3)=ð2TNhÞe,retfhoerev,ariance ains the casMe o?f?an, infinitepopulationsize[10].Thecumulativedistributionfunction ities (2)–(7) canberearranged toobtain thedouble inequality: (cdf) for p isspecifiedby: i,N ny o Max i,N{MpzMP ,y {N,0 ƒx ƒ (cid:8) y (cid:9) 2 i ii i,N ii Pðp ƒwÞ~P p ƒ i,N ð8Þ i,N i,N 2N n y o Min MPii, i2,N,M{Nzyi,N{2MpizMPii : Xyi,N ð12Þ ~PðY ƒy Þ~ PðY ~kÞ, i,N i,N i,N It is noted that xi:~yi,N{2xii, which means that xi:§0 k~0 PLOSONE | www.plosone.org 3 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies where y is an integer in the interval ½0,2N(cid:2). This cdf can be ofcopiesofallotherallelesinthepopulationmustbeatleastthe i,N calculatedusingequation(9)andisafunctionofp.Toconstructa number found in the sample, 2N{^yy ; this constrains the i i,N CIforp givenM,NandP ,considertestingthenullhypothesis maximum value of p according to i ii i,0 H0 :pi~pi,0 against the alternative H1:pi=pi,0 at significance pi,0ƒ(cid:6)2M{(cid:2)2N{^yyi,N(cid:3)(cid:7)(cid:12)ð2MÞ~1{(cid:6)(cid:2)2N{^yyi,N(cid:3)(cid:12)ð2MÞ(cid:7). For level a for an observed value of Yi,N, denoted by ^yyi,N. The null agivenvalueofpi,0,Maxf0,2pi,0{1gƒPii,0ƒpi,0,asdetermined ahcycpeoptthaensciseisrengiootnredjeefcitneeddifbyusBinag~p(cid:6)iy~i,Np,ia,0=2,,^yyyii,,NN,1f{alðlas=2wÞ(cid:7)i,thwinhearne ienacrlluiedre. I^yyfi,PNi,Hfo’0r,aa~t1lea,stthoennethpeaiarccoefpvtaanluceesroegfiopni,0isanexdtePnidi,e0,dbtoy yi,N,a=2isthelargestintegerforwhichP(cid:2)Yi,Nvyi,N,a=2(cid:3)ƒa=2and decreasing a. yi,N,1{ða=2Þ is the smallest integer for which CI’sspecifiedbyequation(13)canbecalculatedgivenM,N,Pii P(cid:2)Yi,Nwyi,N,1{ða=2Þ(cid:3)ƒa=2. Ba can be calculated using equation anda,whereasthosespecifiedbyequations(14a)and(14b)canbe (12).Onemethodofconstructinga§100ð1{aÞ%CIforp isto calculated given just M, N and a. In this paper, computation of i determinethesetofvaluesofp forwhichthenullhypothesisis any CI’s using these equations, incorporating the underlying i,0 not rejected at significance level a, denoted by formulae specified by equations (1), (8), (9) and (12), was carried P ~(cid:10)p :^yy [B (cid:11), anddefiningtheCI as out using the software package Mathematica v5.0 [31]. Supporting i,H0,a i,0 i,N a Webpage1providesMathematicacodeforcomputationoftheCI’s and can be viewed at http://rpubs.com/kkeenan02/Fung- h (cid:8) (cid:9) (cid:8) (cid:9)i ½L ,U (cid:2)~ Min P ,Max P ð13Þ Keenan-Mathematica/. However, other software packages such a a i,H0,a i,H0,a asMATLAB[32]andR[33]couldalsobeusedtoimplementthe formulae; indeed, an R version of the code is provided on [29]. This hypothesis-testing approach corresponds to the ‘‘test- Supporting Webpage 2 and can be viewed at http://rpubs.com/ method’’ described by Talens [30], who applied it to construct kkeenan02/Fung-Keenan-R/.Allsourcecodeforthecomposition CI’sforparametersoftheunivariatehypergeometricdistribution. IfP ~1,theemptyset,thentheacceptanceregionneedsto of Supporting Webpages 1 and 2, including the raw Mathematica i,H0,a and R code used, can be accessed at https://github.com/ beextendedtoinclude^yy foratleastonevalueofp .Thus,in i,N i,0 kkeenan02/Fung-Keenan2013/. thiscase,aisdecreaseduntil^yy isintheacceptanceregionforat i,N least onep value. i,0 The CI specified by equation(13) isnot an exact 100ð1{aÞ% Determining minimum sample sizes for accurate CIbecause thedistribution forY isdiscreteandbecause there estimation of population allele frequencies i,N may be some p values between MinðP Þ and MaxðP Þ In studies of population genetics, it is desirable to obtain a i,0 i,H0,a i,H0,a forwhich^yy =[B ,sincePðY ƒy Þisnotguaranteedtobea sample allele frequency close to the population allele frequency i,N a i,N i,N monotonic function of p (see equations (9) and (12)). It is noted with high probability [19], i.e. a short CI of high probability. i that for the probability parameter in a binomial distribution, Thus, under three representative scenarios, we calculate the Clopper and Pearson [24] derived §100ð1{aÞ% CI’s using an minimumsamplesizesrequiredtoobtain§95%CI’swithlengths analogous method. Thus, for the case of a large population size ƒ0:2andƒ0:1acrossallpossiblevaluesoftheobservedsample relative to the sample size (MwwN) and HWE, where pi,N allele frequency, ^ppi,N~^yyi,N(cid:12)ð2NÞ. These minimum sample sizes approximately follows abinomial distribution, CI’sforpi derived aredenotedbyNƒ0:2andNƒ0:1respectively.Theyarecalculated usingourmethodwouldbevirtuallythesameasClopper-Pearson by starting with N~10, computing CI lengths for all possible CI’s.Thiscanbeverifiedbyexplicitlycalculatingandcomparing values of ^ppi,N and then taking the maximum value. This is CI’s using the two methods – for example, if repeatedforincreasingNinincrementsof10untilthemaximum M~1,000,000wwN~30and^yy ~10,thenbothmethodsgive CIlengthbecomesƒ0:1.TheNvaluewithmaximumCIlength i,N theCI ½0:083,0:285(cid:2) forp. closest to 0.2 is then chosen, and then increased or decreased as i In deriving equation (13), it was assumed that Pii is known. necessarytofindNƒ0:2.Nƒ0:1 isderivedanalogously. Maximum However,P isgenerallyunknown.Inthiscase,a§100ð1{aÞ% CIlengthsof0.2and0.1representsmallmaximumabsoluteerrors ii confidence region (CR) can be derived for p and P in an intheestimatedallelefrequencyof0.1and0.05respectively,ifthe i ii analogous way, by considering the null hypothesis estimate is taken as the mid-point of the CI. The first scenario H’ :p~p ,P ~P .The CR canbe derived bydetermining examined,Scenario1,issamplingfromapopulationofM~1,000 0 i i,0 ii ii,0 the set of vectors ðp ,P Þ for which the null hypothesis is not at HWE. M~1,000 is larger than or on the same order of i,0 ii,0 rejected at significance level a. This set is denoted by magnitude as the upper bound of the range of size estimates for P ~(cid:10)ðp ,P Þ:^yy [B (cid:11), where B is as defined before. 15/24(63%)speciespopulationscollatedbyFrankhametal.[25], Uis,Hin’0g,atheCRi,,0§1ii0,00ð1{i,NaÞ%CaI’sforp anda P canbedefinedas covering mammals, birds, insects and plants. Thus, M~1,000 is i ii takentorepresentalargepopulation.Later,Misvariedovertwo ordersofmagnitudetoconsiderpopulationsthatrangefromsmall (cid:6)L’pi,a,U’pi,a(cid:7)~hMin(cid:8)Pi,H’0,a,1(cid:9),Max(cid:8)Pi,H’0,a,1(cid:9)i, ð14aÞ toverylarge.SincethereisaHWE,Pii~p2i.Also,inthesampled population, thenumber of homozygotes with allele i,MP , must ii and beaninteger.Thisrestrictsthenumberofvaluesthatp cantake i to11equallyspacedvaluesintheinterval½0,1(cid:2)(startingfrom0). (cid:6)L’Pii,a,U’Pii,a(cid:7)~hMin(cid:8)Pi,H’0,a,2(cid:9),Max(cid:8)Pi,H’0,a,2(cid:9)i, ð14bÞ hoSldc,ensuacrhiosth2ataPnidi=3pr2ie.pIrnesSecnetnsairtuioat2io,nPsiiwihsearsesuHmWedEtodotaekseniotst minimumvalue,whichisMaxf0,2p{1g,whereasinScenario3, i whereP isthesetofvaluesconsistingofthejthelementsin i,H’0,a,j Pii is assumed to take its maximum value, which is pi. In the set of vectors Pi,H’0,a. In determining Pi,H’0,a, pi,0 and Pii,0 comparisonwithScenario 1,pi cannowtake alargernumber of values over the biologically feasible ranges are tested. Since the values–itcantakethe2,001equallyspacedvaluesintheinterval numberofcopiesofalleleAiinthepopulationmustbeatleastthe ½0,1(cid:2) in Scenario 2 and the 1,001 equally spaced values in the number foundinthesample,^yy (cid:12)ð2MÞƒp .Also,thenumber interval ½0,1(cid:2) in Scenario 3. Since the variance of p is an i,N i,0 i,N PLOSONE | www.plosone.org 4 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies increasing function of P (equation (11)), CI’s derived using p terminology is not used in our study for clarity. The subpopula- ii i,N areexpectedtoincreaseinlengthwithP aswell.Thismeansthat tions form part of a total of about 536 subpopulations on the ii CIlengthsandminimumsamplesizes(Nƒ0:2 andNƒ0:1)derived A˚landIslands,withanestimatedtotalsizerangingfrom35,000to forScenarios2and3areexpectedtoencompasstheentireranges at least 200,000 [13]. Thus, the size of each subpopulation is of possiblevalues. assumed to be ½(35,000z200,000)=2(cid:2)=536~219, such that the Inthethreescenariosexamined,P isspecifiedasafunctionof Pra¨sto¨ and Finstro¨m populations are assumed to have a size of ii p, such that equation (13) can be used to calculate a CI for p M ~2|219~438 and M ~7|219~1533 respectively. The i i 1 2 givenavalueof^ppi,N.CalculationofthisCIinvolvesconsideringall observedsampleallelefrequenciesinthetwopopulations,denoted possiblevaluesofp anddeterminingunderwhichvalues^pp falls by^pp and^qq respectively,aretakendirectlyfrom[13].These i i,N i,N1 i,N2 within the corresponding acceptance region (as described above). areusedtocalculatetheobservednumberofcopiesofalleleA in i An alternative scenario, Scenario 4, is that the relationship eachpopulation,denotedby^yy and^zz respectively,usingthe i,N1 i,N2 betweenPii andpi isunknown.Inthiscase,equation(14a)hasto formulae ^yyi,N1~2N1^ppi,N1 and ^zzi,N2~2N2^qqi,N2. Due to rounding be used to calculate a CI for pi given a value of ^ppi,N, which errorin^ppi,N1 and^qqi,N2 valuesfrom[13],^yyi,N1 and^zzi,N2 hadtobe involvesconsideringallpossiblevaluesofpiandPii.Thisresultsin rounded to the nearest integer. Since the true homozygosity of a considerable increase in computation time; for example, when eachalleleineachpopulationisunknown[13],conservativeCI’s sampling N~30 individuals from a population of M~1,000, are calculatedassuming P takesitsmaximumvalueof p (thisis ii i 2,001 values of pi need to be considered but 1,002,001 expectedtomaximizethelengthsoftheCI’s,asexplained inthe combinations of pi and Pii need to be considered, representing previous section). In addition, a~0:025=3 is chosen, such that a an increase in computational time by two orders of magnitude. §98:3% CI is derived for each of the three population allele However,foragivenvalueofpi,themaximumvalueofPii gives frequencies in each of the two populations. The reason for this thehighestvarianceforpi,N andisthusexpectedtomaximizethe choice of a is because for each population, using the Bonferroni lengthoftheacceptanceregionwithinwhich^ppi,N couldfallwithin. Inequality[34],thecubicregiondefinedbythethreeCI’scanbe Thus,CI’sforpi derivedunderScenario4areexpectedtoclosely taken as a §100ð1{2bÞ% confidence region (CR) for the three match those derived under the case of maximum homozygosity population allele frequencies, where 3a§b. The choice of (Scenario3).ThiscanbeverifiedbyexplicitcalculationoftheCI’s a~0:025=3 allows §95% CR’s to be derived for each set of – for example, given M~100, N~30 and ^ppi,N~15, the CI’s three population allelefrequencies inthetwopopulations. derived under Scenario 4 and Scenario 3 are ½0:14,0:4(cid:2) and Todemonstratehowthetheorydevelopedinthispapercanbe ½0:15,0:39(cid:2), representing a difference in length of only 0.02. usedtoderiveCI’sforgeneticindicatorsthatareafunctionofthe Similar results hold for ^ppi,N~10 and 5. Therefore, results from population allele frequencies, the §95% CR’s are used to Scenario3areusedtoapproximatethoseforScenario4,obviating calculate a §95% CI for Jost’s D for the CINX1 locus and the the need for long computational runtimes to explore all possible Pra¨sto¨ andFinstro¨mbutterflypopulations.Jost’sDisameasureof combinations of pi andPii. genetic distance between populations [9]. For the case of two Lastly,fromequation(11),thevarianceofpi,N isanincreasing populations, itisgivenby function of M. Thus, CI lengths for p are expected to increase i with M, which would result in increases in Nƒ0:2 and Nƒ0:1. To (cid:4) (cid:13)J (cid:14)(cid:5) test this explicitly, for a population at HWE and with N~30, D ~2 1{ T , ð15Þ maximumlengthsforthe§95%CIforp (acrossallvaluesof^pp ) Jost JS i i,N are calculated for M~100, 250, 500, 750, 1,000, 2,500, 5,000, where,using thenotation inthisexample, 7,500and10,000.Thisrangecorrespondstopopulationsthatare smalltovery large [25]. J ~X4 (cid:8)pizqi(cid:9)2~1ðp zq Þ2z1X3 ðpzqÞ2 Application to an empirical data set for checkerspot T 2 4 4 4 4 i i i~1 i~1 butterflies ð16Þ To demonstrate how the theory developed can be used in ~1 2{X3 p{X3 q!2z1X3 ðpzqÞ2 practice, it is applied to microsatellite data for samples from two 4 i i 4 i i populations of the checkerspot butterfly (Melitaea cinxia L.) i~1 i~1 i~1 occupying meadows on the A˚land Islands in Finland [13]. and Specifically, for the CINX1 locus, §100ð1{2aÞ% CI’s are calculated for the frequencies of alleles A, B and C for the Pra¨sto¨ 4 4 and Finstro¨m populations, using the corresponding sample allele Pp2zPq2 frequencies(Table1ofPaloetal.[13]).Foreachpopulation,aCI J ~i~1 i i~1 i ~1(cid:2)p2zq2(cid:3)z1X3 (cid:2)p2zq2(cid:3) is not calculated for the population frequency of the fourth and S 2 2 4 4 2 i i finalalleleD,becausethisisfixedbythepopulationfrequenciesof i~1 ð17Þ tuhseedfiirnstouthrrseteudayll,ehleesn.cTefoorathc,haiellveelescoAn,sBis,teCncayndwDithartehreefneorrtaedtiotno ~1224 1{X3 pi!2z 1{X3 qi!235z12X3 (cid:2)p2izq2i(cid:3): asallelesA ,A ,A andA respectively.Thesamplesizeforthe i~1 i~1 i~1 1 2 3 4 Pra¨sto¨ population is N ~53, whereas that for the Finstro¨m 1 ptoopduelnaotitoentihseNs2a~mp7l4e[a1l3le]l.epfi,rNe1quanendcqiei,sN2o,fiA[fi1n,2t,h3e,4Pg,ra¨asrteo¨uasnedd mAinloimwiezringlimthite tfuoncatio§n9in5%(15C) Igivfeonr tDheJosctoncsatnrainbtes tdheartivtehde sbixy i Finstro¨mpopulationsrespectively.Similarly,p andq areusedto population allele frequencies are contained within the two i i denote the population allele frequencies of A in the two corresponding §95% CR’s, and the two constraints i populations, respectively. The two populations consist of two 1{P3i~1pi~p4§0 and 1{P3i~1qi~q4§0. This lower limit and seven subpopulations respectively. Thus, technically, the two is denoted by L . Similarly, the upper limit to a §95% CI for D populations can be referred to as metapopulations, although this D canbederivedbymaximizingthefunctionin(15)underthe Jost PLOSONE | www.plosone.org 5 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies same constraints. This is denoted by U . In this study, the D ‘‘Minimize’’ and ‘‘Maximize’’ functions in Mathematica v5.0 [31] areusedtocomputeL andU ,butcorrespondingfunctionsmay D D beusedinothersoftwarepackages,suchasthe‘‘solnp’’functionin the ‘‘Rsolnp’’ R package. A §95% CI for D can then be Jost definedastheinterval½L ,U (cid:2).SupportingWebpage1provides D D Mathematica code that can be used to calculate ½L ,U (cid:2) for the D D butterfly case study examined (http://rpubs.com/kkeenan02/ Fung-Keenan-Mathematica/).CorrespondingRcodeispresented on Supporting Webpage 2 (http://rpubs.com/kkeenan02/Fung- Keenan-R/). Results Maximum length of §95%confidence interval with increasing sample size For Scenario 1, where the sampled diploid population of size M~1,000 was at HWE, the maximum length of the §95% CI for the population frequency of allele A, p, considering all i i possible observed values of the sample allele frequency, ^pp , i,N decreased non-linearly with sample size N (Figure 1). The Figure2.Changeinlengthof§95%confidenceintervalacross minimum N required to achieve a maximum length ƒ0:2, differentobservedvalues,underHardy-Weinbergequilibrium. Nlenƒg0t:h2,ƒwa0s:12,2N,ƒw0h:1e,rweaasst4h9e(Fmiginuirmeu1m). N required to achieve a cHFooanrrdfaiyd-esWanmeciepnbilneetreogrfvesaiqlzu(eCilINi)bfrtoiaurkmteh,negfprraoopmphulasahtpoioownpinufrgleaqttiuhoeennloceyfngositfzheanoMfalt~lheel1e,§0A090(5pa%t) ForgivenNand^pp ,theCIforp consistsofallpossiblevalues i i i,N i acrossallpossibleobservedvaluesofthesampleallelefrequency(p ). i,N ofpi forwhich^ppi,N liesinthecorrespondingacceptanceregion,as Thethreecurvescorrespond toN~10,30and50.Equation(13)was described in Methods. This acceptance region is expected to usedtocalculatethelengthofeachCI. doi:10.1371/journal.pone.0085925.g002 increasewiththevarianceofp ,s2½p (cid:2).AtHWE,thefrequency i,N i,N ofhomozygotesofalleleA,P ,isequaltop2;thus,accordingto i ii i equation (11), s2½p (cid:2) takes its highest values at intermediate i,N valuesofp.Asaresult,intermediatevaluesof^pp arelikelytofall i i,N within the acceptance regions of more values of p, such that the i corresponding CI lengths are generally longer. Simulation results for Scenario 1 were broadly in agreement with these theoretical expectations (Figure2). InScenario2,wherethepopulationwasnolongeratHWEbut attaineditslowestP ,themaximumCIlengthalsodecreasednon- ii linearlywithincreasingN(Figure1).ComparedwithScenario1, Nƒ0:2wasslightlylargerandNƒ0:1waslargerbyaboutafactorof two, taking values of 26 and 94 respectively (Figure 1). This is contrary to expectations that minimum homozygosity would give smallerNƒ0:2andNƒ0:1(seeMethods).Thereasonisthatalthough s2½p (cid:2) is smaller for given N and p compared with the case of i,N i HWE, which would be expected to result in fewer p values for i which a given ^pp falls within the corresponding acceptance i,N regionsandhenceashorterCI,therearemorepossiblevaluesof Figure 1. Change in maximum length of $95% confidence p (2,001 compared with 11 – see Methods), which increased the i interval with increasing sample size. Graph showing how the number ofp valuesforwhich ^pp falls withinthecorresponding maximum length of the §95% confidence interval (CI) for the i i,N acceptance regions. For Scenario 2, P ~Maxf0,2p{1g, such population frequency of an allele A (p) changes with increasing ii i i i sample size N, when sampling from a diploid population of size thats2½pi,N(cid:2)attainsitshighestvaluesatpi valuescloseto0.25and M~1,000. For a given N, the maximum CI length was derived by 0.75(equation(11)).Thus,forgivenN,valuesof^pp near0.25and i,N calculating CI lengths for all possible values of the observed sample 0.75 are expected to generally exhibit the longest CI lengths. allelefrequencyandthentakingthemaximumlength.Thethreecurves SimulationresultsforScenario2werebroadlyinagreementwith correspond to three scenarios where the population is (1) at Hardy- Weinbergequilibrium(HWE),(2)attainsitslowesthomozygosityvalue these theoretical expectations (Figure3). withrespecttoA,and(3)attainsitshighesthomozygosityvaluewith For the last scenario, Scenario 3, the population attained its i respect to Ai. For each curve, the two filled circles represent the highestPii,representingtheoppositeextremetoScenario2.Asfor minimumNvaluesrequiredforthemaximumCIlengthtoreachvalues the previous two scenarios, the maximum CI length decreased ofƒ0:2andƒ0:1.Forvisualguidance,thetwodashedhorizontallines markmaximumCIlengthsof0.2and0.1.TheexactvaluesofNtested non-linearlywithincreasingN(Figure1),butthistime,Nƒ0:2and aredescribedinMethods,andequation(13)wasusedtocalculatetheCI Nƒ0:1 took values of 94 and 285 respectively. These values were lengths. approximately four and six times as large as the corresponding doi:10.1371/journal.pone.0085925.g001 values in Scenario 1 (Figure 1). In Scenario 3, Pii~pi, and PLOSONE | www.plosone.org 6 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies Figure3.Changeinlengthof§95%confidenceintervalacross Figure4.Changeinlengthof§95%confidenceintervalacross differentobservedvalues,underminimumhomozygosity.Fora differentobservedvalues,undermaximumhomozygosity.For sample of size N taken from a population of size M~1,000with the asampleofsizeNtakenfromapopulationofsizeM~1,000withthe minimum homozygosity possible for an allele A, graph showing the maximum homozygosity possiblefor an allele A, graph showingthe i i length of the §95% confidence interval (CI) for the population length of the §95% confidence interval (CI) for the population frequencyofA (p)acrossallpossibleobservedvaluesofthesample frequencyofA (p)acrossallpossibleobservedvaluesofthesample i i i i allelefrequency(p ).ThethreecurvescorrespondtoN~10,30and allele frequency (p ). The four curves correspond to N~10, 30, 100 i,N i,N 100.Equation(13)wasusedtocalculatethelengthofeachCI. and200.Equation(13)wasusedtocalculatethelengthofeachCI. doi:10.1371/journal.pone.0085925.g003 doi:10.1371/journal.pone.0085925.g004 equation(11)showsthatintermediatevaluesofpi givethehighest ½0:598,0:872(cid:2) and ½0:114,0:381(cid:2) respectively. Using the same valuesofs2½pi,N(cid:2).Therefore,asinScenario1,intermediatevalues methodology, for the Finstro¨m population, the §98:3% CI’s for of^ppi,N areexpectedtogenerallyexhibitthelongestCIlengthsfor the population frequencies of A1, A2 and A3 were calculated given N. Again, simulation results were consistent with these as½0:003,0:109(cid:2),½0:714,0:914(cid:2)and½0:003,0:109(cid:2) respectively. expectations (Figure4). Maximum length of §95%confidence interval with increasing population size WhentakingasampleofsizeN~30fromadiploidpopulation atHWE,themaximumlengthofthe§95%CIforp,acrossall i ^pp , increased with population size M (Figure 5). As M was i,N increased from a small value of 100 to 1,000, the maximum CI length remained the same at 0.2 (this is possible because there is nothingtopreventdifferentvaluesofMgivingthesamemaximum CIlength,usingequation(13)).ThemaximumCIlengthincreased when M was increased from 1,000 to a large value of 2,500, but only by 0.04. Thereafter, the length remained constant up to a very large value of M~7,500, and only increased by a small amount of 0.02 with a further increase in M to 10,000. Thus, simulation results confirm the theoretical expectation that CI lengthincreaseswithM(asexplainedinMethods–thisexpectation arises because s2½p (cid:2) increases with M, according to equation i,N (11)).However,theincreaseintheCIlengthwasmodestasMwas increased overtwoorders ofmagnitude. Figure 5. Change in maximum length of §95% confidence §95%confidence interval forJost’sDbetween two intervalwithincreasingpopulationsize.Graphshowinghowthe checkerspot butterfly populations maximum length of the §95% confidence interval (CI) for the For the Pra¨sto¨ checkerspot butterfly population, the §98:3% population frequency of an allele Ai (pi) changes with increasing CI’s for the population frequencies forthree of thefour alleles at populationsizeM,whentakingsamplesofsizeN~30.Thepopulation is at Hardy-Weinberg equilibrium. For a given M, the maximum CI theCINX1locuswerecomputedusingequation(13)anddatafrom lengthwasderivedbycalculatingCIlengthsforallpossiblevaluesof Palo et al. [13], as described in Methods. The three alleles are the observed sample allele frequency and then taking the maximum denoted by A1, A2 and A3 respectively, with the population length. M values of 100, 250, 500, 750, 1,000, 2,500, 5,000, 7,500 and frequencies denoted by p1, p2 and p3 respectively. The three 10,000weretested,asindicatedbythefilledcircles. corresponding §98:3% CI’s derived were ½0:002,0:080(cid:2), doi:10.1371/journal.pone.0085925.g005 PLOSONE | www.plosone.org 7 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies Thethree§98:3%CI’sforthePra¨sto¨ populationwereusedto least100ð1{2aÞ,forthemoregeneralcasewherethepopulation formacubicregion,whichcorrespondstoa§95%CRforA ,A size cantake anyvalue largerthan or equaltothesample size. 1 2 and A3 [34]. In the same way, the three §98:3% CI’s for the The method we derived was used to show that the sampling Finstro¨m population were used to form a §95% CR for A1, A2 uncertaintyinpi,measuredasthemaximumlengthofthe§95% and A3. Within these two CR’s, the maximum and minimum CI for pi across all possible values of the observed sample allele valuesofDJost(DJostisgivenbyequation(15))werecalculatedand frequency^ppi,N,decreasednon-linearlywithNwhensamplingfrom usedtoderivea§95%CIforDJost,asdescribedinMethods.This a large population (M~1,000) under three archetypal scenarios. CI for DJost was derived as (cid:6)2:34|10{5,0:186(cid:7). If DJost was Thesethreescenariosrepresentthecaseswherethepopulation(1) calculatedsimplyusingthesampleallelefrequenciesandequation isatHardy-Weinbergequilibrium(HWE),(2)hasthelowestvalue (15),thenonly onevaluewould havebeen obtained: 0.043. ofPiiand(3)hasthehighestvalueofPii.Asexpectedfromtheory, for any given N, the maximum CI lengths for Scenario 3 were Discussion always greater than corresponding lengths in Scenarios 1 and 2. However,themaximumCIlengthsforScenario2wasunexpect- In scientific studies that use sample data to estimate unknown edlygreaterthanthatinScenario1forsomevaluesofN,reflecting population parameters, the sampling uncertainty in the estimates the greater possible number of values p can take in a finite i needs to be quantified in order to make reliable inferences on population with minimum homozygosity compared to one at population processes captured by the parameters. This forms an HWE.Thisillustrateshowthefinitesizeofapopulationcangive essential part of scientific hypothesis-testing. Therefore, in studies oppositetrendstothoseobtainedunderanassumptionofinfinite of population genetics, it is essential to quantify the sampling size. According to theory and simulations, Scenario 3 gives CI uncertainty of key population parameters used to infer past and lengths that closely approximate those in the case where P is ii present evolutionary processes. These include allele frequencies, unknown(seeMethods).Thus,ifP isunknown,CIlengthsderived ii which are often used to quantify genetic variation among underScenario 3should be used.This wastheapproach usedin populations,therebyallowinghypothesesonprocessesdrivingthis the application of our method to sample data for two butterfly variationtobetested(e.g.,[1,2,3,4,5]).However,manystudiesdo populations,discussedfurtherbelow.Ontheotherhand,ifthereis not include sampling uncertainty for allele frequencies, instead evidence that the population is at HWE, then the shorter CI’s presenting and/or using single point estimates based on one derived underScenario1 could beused. sampleperpopulation(e.g.,[11,12,13,14,15,16,17,18]).Thus,itis Underthethreescenariosexamined,thenon-lineardecreasesin not possible to assess the accuracy of any inferences from these samplinguncertaintywithincreasingNareconsistentwithresults studies.Thishindersnotonlytheadvanceofscientificknowledge fromthesimulationstudyofHaleetal.[19],whofoundthatthe but also decision-making based on this knowledge, such as for average difference between p and p also exhibited non-linear sustainablemanagementandconservationofnaturalresources.In i,N i decreases with N. However, Hale et al. [19] did not use their thiscontext,theworkpresentedinthispaperisvaluableinthatit simulationresultstoquantifysamplinguncertaintyfortherealistic provides a method of quantifying sampling uncertainty in allele situation where only one sample is taken from a population; this frequencies for diploid populations, in the form of confidence situationwasconsideredinourstudy.Furthermore,asmentioned intervals(CI’s)containingtruevalueswithprobabilityequaltoor in the Introduction, for given N, they only used 100 samples to greater than a desired threshold. numerically construct the distribution for p , resulting in an The method presented pertains to the general case of a locus i,N incomplete distribution that may not closely reflect the true withnalleles,withasampleofsizeNtakenfromapopulationof distribution. This highlights a weakness of a simulation-based size M and any degree of homozygosity with respect to the n approachwithoutarigorousmathematicalunderpinning,whichis alleles.Inthiscase,themethodallowsconstructionofaCIforthe presentinourapproach.Haleetal.[19]concludedthatN=25– populationfrequencyofeachallele,whichcanthenbecombined 30issufficienttogiveaccurateestimatesofp,butthisconclusion tocreateajointconfidenceregion(CR)forallthepopulationallele i has to be interpreted in light of the limitations identified. Our frequenciesatagivenlocus.Itisnotedthatifmorethanonelocus results refine this conclusion by showing that across the three is considered simultaneously, then a joint CR for all population scenariosexamined,N=49–285isrequiredtoensurethat,witha allelefrequenciesatalllocicanbecalculatedbycombiningCI’sin highprobabilityof§0:95,anestimateforp canbederivedfrom an analogous way. For the subcase of an infinite population size i (M??) and a large sample size of N§30, Weir [10] had any one sample that is within 0.05 of the true value; this proposed an approximate 100ð1{2aÞ% CI for the population correspondstoaCIoflengthƒ0:1.Toensurethattheestimateis within0.1ratherthan0.05,correspondingtoaCIoflengthƒ0:2, allele frequency of an allele A, p. This is ½pi,N{z1{ass^½pi,N(cid:2),pi,N{zass^½pi,N(cid:2)(cid:2),wherepi,iN isthiesampleallele our results show that N=22–94 is required. Thus, N~30 is not frequency; z satisfies Wðz Þ~a, with W being the cumulative guaranteed to give ‘‘accurate’’ estimates of pi under all or most a a scenarios, and N values up to 10 times larger could be required. distributionfunction(cdf)ofthestandardNormaldistribution;and ss^½p (cid:2)isanestimateofthestandarddeviationofp usingsample Decreasing the population size M from 1,000 would help to i,N i,N data.ss^½p (cid:2)isspecifiedbyequation(11)withp replacingp and decreasesamplinguncertainty,butresultsshowedthatdecreasing i,N i,N i P , the sample frequency of homozygotes with allele A, Movertwoordersofmagnitudefrom10,000to100onlyresulted reipi,Nlacing P , the corresponding population frequency of homoi- inmodestdecreasesinthemaximumCIlengthofƒ0:06,withno ii zygotes. However, the accuracy of this CI depends both on how decrease when M was decreased from 1,000 to 100. Thus, the close ss^½p (cid:2) istothetrue standard deviation s½p (cid:2) (specified by overallconclusionisthatN~30isofteninsufficienttoguarantee i,N i,N equation (11)) and how close the cdf of pi,N is to a Normal accurateestimatesofpi,inthesensethatpiiswithin0.05or0.1of distributionwiththesamemeanandvariance.Thisaccuracyhas the estimate. Considering that alleles at highly polymorphic loci, notbeenquantified[10]andthus,itisnotknownwhethertheCI such as microsatellites, often occur at population frequencies of actually contains pi with a probability of at least 100ð1{2aÞ, v0:05[35],itmightbedesirabletoderiveCI’sforpi thatareof renderingitsuseproblematic.Inthisstudy,wehaverectifiedthis lengthv0:1.Thus,samplesizesevenlargerthanthevaluesfound problembyconstructingaCIforp withprobabilitycoverageofat inour simulations mightberequired under somecircumstances. i PLOSONE | www.plosone.org 8 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies The application of our method to empirical data for two ofboththebinomialandunivariatehypergeometricdistributions, populations of the checkerspot butterfly [13] demonstrated how when compared with a hypothesis-testing approach analogous to the underlying theory can be applied to construct joint §95% the one used in this paper. If their method could be extended to CR’s for the population frequencies of multiple alleles at a single the population allele frequency parameter for the more complex locus. These CR’s were then used to construct a §95% CI for distribution used in this paper (equation (9)), based on a Jost’s D, which measures genetic differentiation between the two multivariate hypergeometric distribution (equation (1)), then this populations. This illustrates how our method can be used to mightresultintighterCI’sforpopulationallelefrequencieswhen quantify sampling uncertainty in genetic indicators that are a samplingfrom afinite diploid population. function of population allele frequencies, thus facilitating hypoth- In addition, the CI’s derived in our paper were designed to esis-testing and also risk-based natural resource management. In quantifytheuncertaintyinpopulationallelefrequenciesthatarises theexampleconsidered,thesinglepointestimateofJost’sDusing fromtakingarandomsampleofafinitediploidpopulation,which thesampleallelefrequencieswasaboutfourtimeslowerthanthe can exhibit any degree of relatedness. The ‘‘random’’ refers to upper bound of its §95% CI. Thus, use of the single point equalprobabilityofchoosingindividualsthatmaybeofanytype, estimatewithoutaccountingforsamplinguncertaintycouldleadto whichdoesnotimplythatonceanindividualofaparticulartype misleading conclusions. The effectiveness of any management hasbeensampled,thenextindividualsampledisequallylikelyto measures based on such conclusions would be compromised, belong to any of the types (i.e., does not imply individuals in the hinderingtheachievementofobjectivesrelatedtoconservationor population or sample are unrelated). Relatedness among individ- sustainable use. Our example therefore highlights the practical ualsinthepopulationisimplicitlyincludedwithinthepopulation utility of themethodthat wehavederived. frequencies of the different genotypes, and is thus accounted for In conclusion, we have presented a rigorous mathematical when calculating the sampling distribution for the population method for quantifying sampling uncertainty in estimates of frequency of an allele A, as specified by equation (1). However, i population allele frequencies, for a general case that has hitherto this representation does not give an explicit quantification of the notbeenanalyzed.Inaddition,wehavedemonstrateditspractical degree of relatedness among individuals in the population, for application in informing sampling design and determining exampleusingkinshipcoefficients.DeGiorgioandRosenberg[38] uncertainty in genetic indicators. Thus, the method derived used such coefficients to derive an unbiased estimator of advances both theory and practice, with broad implications for a heterozygosity (H~1{Pn p2, for a locus with n alleles) in rangeofdisciplines,including:conservationgenetics,evolutionary i~1 i thecaseofsamplingfromapopulationofdiploidindividualsthat genetics, genetic epidemiology, genome wide association studies couldberelated,andDeGiorgioetal.[39]extendedtheseresults (GWAS),forensicsandmedicalgenetics.Inparticular,themethod to the case of individuals with arbitrary ploidy. In their provides exact answers to the question of how many individuals calculations, the number of copies of allele A in each sampled i needtobesampledfromapopulationinordertoachieveagiven individualkwastreatedasarandomvariableandthecovariances levelofaccuracyinestimatesofpopulationallelefrequencies.This ofthesevariableswerethenrelatedtothekinshipcoefficients.This is a question that has rarely been studied before [19], despite its is different to calculations in our paper, where the number of important practical implications. Previous studies have derived sampledindividualswithallelesA andA wastreatedasarandom i j samplesizesrequiredtosample,withhighprobability,atleastone variable(seeMethods).Themethodinthispapermightberevised copy of all alleles at a locus above a given frequency [35,36,37], byconsideringthenumberofcopiesofalleleA foreachsampled but these do not correspond to sample sizes required to achieve i individual instead, following [38,39]. This could allow explicit accurate estimates of population allele frequencies. Derivation of quantificationofrelatednessinthecontextofderivingCI’sforthe the latter requires explicit quantification of sampling uncertainty, population allele frequencies. as wehavedone inthisstudy. Supporting Information Possible future extensions The CI’s and CR’s constructed using our method are File S1 Derivation of the variance of the sample allele conservative in the sense that they contain the true values with a frequency. probability equal to or greater than a desired threshold. This (PDF) conservative property is useful in hypothesis-testing if there is a needtodecreasetheprobabilityofobtainingafalsepositivebelow Acknowledgments a certain threshold. However, researchers would ideally like to construct CI’s and CR’s containing true values with a known Wewouldliketothanktwoanonymousreviewersforhelpfulandinsightful probability, not just with probability at or above a known comments,whichhaveledtogreatimprovementsinthismanuscript. threshold. Therefore, future research could attempt to tighten the intervals and regions that we have derived, ideally until they Author Contributions coveraknownprobability.Forexample,CaiandKrishnamoorthy Conceived and designed the experiments: TF KK. Performed the [23] devised a method (using their ‘‘Combined Test’’ approach) experiments: TF. Analyzed the data: TF KK. Contributed reagents/ thatwasshowntogiveshorterCI’sfortheprobabilityparameters materials/analysistools:TFKK.Wrotethepaper:TFKK. References 1. BowcockAM,KiddJR,MountainJL,HebertJM,CarotenutoL,etal.(1991) 3. LuikartG,EnglandPR,TallmonD,JordanS,TaberletP(2003)Thepowerand Drift, admixture, and selection in human evolution: A study with DNA promiseofpopulationgenomics:fromgenotypingtogenometyping.Nature4: polymorphisms.ProcNatlAcadSciUSA88:839–843. 981–994. 2. AkeyJM,ZhangG,ZhangK,JinL,ShriverMD(2002)Interrogatingahigh- 4. deKovelCGF(2006)Thepowerofallelefrequencycomparisonstodetectthe densitySNPmapforsignaturesofnaturalselection.GenomeRes12:1805– footprintofselectioninnaturalandexperimentalsituations.GenetSelEvol38: 1814. 3–23. PLOSONE | www.plosone.org 9 January2014 | Volume 9 | Issue 1 | e85925 ConfidenceIntervalsforAlleleFrequencies 5. FriedlanderSM,HerrmannAL,LowryDP,MephamER,LekM,etal.(2013) 21. LaplacePS(1812)The´orieanalytiquedesprobabilite´s.Paris,France,Courcier. ACTN3 allele frequency in humans covaries with global latitudinal gradient. 464p. PLoSONE8(1):e52282. 22. Agresti A, Coull BA (1998) Approximate is better than ‘‘exact’’ for interval 6. NeiM(1972)Geneticdistancebetweenpopulations.AmNat106:283–292. estimationofbinomialproportions.AmStat52:119–126. 7. NeiM,ChesserRK(1983)Estimationoffixationindicesandgenediversities. 23. Cai Y, Krishnamoorthy K (2005) A simple improved inferential method for AnnHumGenet,47:253–259. somediscretedistributions.ComputStatDataAnal48:605–621. 8. Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of 24. ClopperCJ,PearsonES(1934)Theuseofconfidenceorfiduciallimitsillustrated populationstructure.Evolution38:1358–1370. inthecaseofthebinomial.Biometrika26:404–413. 9. JostL(2008)GSTanditsrelativesdonotmeasuredifferentiation.MolEcol17: 25. Frankham R (1996) Relationship of genetic variation to population size in 4015–4026 wildlife.ConservBiol10:1500–1508. 10. WeirBS(1996)GeneticdataanalysisII.Sunderland,USA,SinauerAssociates. 26. GreenDM(2003)Theecologyofextinction:populationfluctuationanddecline 445p. inamphibians.BiolConserv111:331–343. 11. EanesWF,KoehnRK(1978)AnanalysisofgeneticstructureintheMonarch 27. Vredenberg VT,Knapp RA,TunstallTS, BriggsCJ(2010)Dynamicsofan butterfly,DanausplexippusL.Evolution32:784–797. emergingdiseasedrivelarge-scaleamphibianpopulationextinctions.ProcNatl 12. Forbes SH, Hogg JT, Buchanan FC, Crawford AM, Allendorf FW (1995) AcadSciUSA107:9689–9694. Microsatelliteevolutionincongenericmammals:domesticandbighornsheep. 28. JohnsonNL,KotzS,BalakrishnanN(1997)Discretemultivariatedistributions. MolBiolandEvol12:1106–1113. Hoboken,USA,Wiley-Blackwell.328p. 13. Palo J, Varvio S-L, Hanski I, Va¨ino¨la¨ R (1995) Developing microsatellite 29. CoxDR,HinkleyDV(1974)Theoreticalstatistics.London,UK,Chapmanand markers for insect population structure: complex variation in a checkerspot Hall.511p. butterfly.Hereditas123:295–300. 30. TalensE(2005)StatisticalauditingandtheAOQL-Method.Ridderkerk,The 14. LuikartG,AllendorfFW,CournuetJ-M,SherwinWB(1998)Distortionofallele Netherlands,LabyrintPublications.164p. frequencydistributionsprovidesatestforrecentpopulationbottlenecks.JHered 31. WolframResearchInc.(2003)MathematicaEdition:Version5.0.Champaign, 89:238–247. Illinois,,USA,WolframResearchInc. 15. Rank NE, Dahlhoff EP (2002) Allele frequency shifts in response to climate 32. The MathWorks Inc. (2012) MATLAB version 8.0. Natick, Massachusetts, , change and physiological consequences of allozyme variation in a montane USA,TheMathWorksInc. insect.Evolution56:2278–2289. 33. RDevelopmentCoreTeam(2010)R:alanguageandenvironmentforstatistical 16. Hugnet C, Bentjen SA, Mealey KL (2004) Frequency of the mutant MDR1 computing.Vienna,Austria,RFoundationforStatisticalComputing.1731p. alleleassociatedwithmultidrugsensitivityinasampleofcolliesfromFrance. 34. Rice JA (1995) Mathematical statistics and data analysis, second edition. JVetPharmacolTher27:227–229. Belmont,USA,DuxburyPress.651p. 17. SeiderT,FimmersR,BetzP,LedererT(2010)Allelefrequenciesofthefive 35. ChakrabortyR(1992)Samplesizerequirementsforaddressingthepopulation miniSTR loci D1S1656, D2S441, D10S1248, D12S391 and D22S1045 in a geneticissuesofforensicuseofDNAtyping.HumBiol64:141–159. Germanpopulationsample.ForensicSciIntGenet4:e159–e160. 36. GregoriusH-R(1980)Theprobabilityoflosinganallelewhendiploidgenotypes 18. Petrejcˇ´ıkova´E,Sota´kM,Bernasovska´J,Bernasovsky´I,Re˛bałaK,etal.(2011) aresampled.Biometrics36:643–652. Allelefrequenciesandpopulationdatafor11Y-chromosomeSTRsinsamples 37. Sjo¨grenP,Wyo¨niP-I(1994)Conservationgeneticsanddetectionofrarealleles fromEasternSlovakia.ForensicSciIntGenet5:e53–e62. infinitepopulations.ConservBiol8:267–270. 19. Hale ML, Burg TM, Steeves TE (2012) Sampling for microsatellite-based 38. DeGiorgioM,RosenbergNA(2009)Anunbiasedestimatorofgenediversityin population genetic studies: 25 to 30 individuals per population is enough to samplescontainingrelatedindividuals.MolBiolEvol26:501–512. accuratelyestimateallelefrequencies.PLoSONE7(9):e45170. 39. DeGiorgioM,JankovicI,RosenbergNA(2010)Unbiasedestimationofgene 20. Gillespie JH (2004) Population genetics: a concise guide, second edition. diversityinsamplescontainingrelatedindividuals:exactvarianceandarbitrary Baltimore,USA,TheJohnHopkinsUniversityPress.217p. ploidy.Genetics186:1367–1387. PLOSONE | www.plosone.org 10 January2014 | Volume 9 | Issue 1 | e85925