ebook img

Alternative Methods for H1 Simulations in Genome Wide Association Studies PDF

0.35 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Alternative Methods for H1 Simulations in Genome Wide Association Studies

Alternative Methods for H1 Simulations in Genome Wide Association Studies V. Perduca a,∗, C. Sinoquet b, R. Mourad b,c, G. Nuel a,* aMAP5-UMRCNRS8145Universit´eParisDescartes,bLINA-UMRCNRS6241Universit´edeNantes, cEcolePolytechniquedel’Universit´edeNantes 2 ToappearinHumanHeredity. the prevalence, the better the power. Conclusions: Our ap- 1 proachisavalidalternativetoHapgen-typemethods;itisnot 0 2KeyWords: onlydramaticallyfasterbutalsohastwomainadvantages: 1) there is no need for sophisticated genotype models (e.g. ha- nStatistical Power · Disease Model · Rejection · Markov Chain plotype frequencies, or recombination rates); 2) the choice aMonte Carlo (MCMC) · Backward Sampling · Receiver Oper- J of the disease model is completely unconstrained (number atingCharacteristic(ROC)·AreaUndertheCurve(AUC) 4 ofSNPsinvolved,Gene-Environmentinteractions,hybridge- 2 netic models, etc.). Our three algorithms are available in an Rpackagecalledwaffect(“double-uaffect”,forweightedaf- ]Abstract: P fectations). Objective: Assessing the statistical power to detect suscepti- A bility variants plays a critical role in GWA studies both from . Introduction ttheprospectiveandretrospectivepointsofview. Powerisem- a tpiricallyestimatedbysimulatingphenotypesunderadisease Genome-wide association (GWA) studies are a widely-used [smodelH1. Forthispurpose,the“gold”standardconsistsinsi- approach for localizing susceptibility variants responsible for mulatinggenotypesgiventhephenotypes(e.g. Hapgen). We common complex genetic diseases [1]. Such studies involve 1introducehereanalternativeapproachforsimulatingpheno- investigatingahugenumberofgeneticsmarkerssuchassin- v types under H1 that does not require generating new geno- glenucleotidepolymorphisms-SNPs(fromhundredsofthou- 6 4typesforeachsimulation. Methods: Inordertosimulatephe- sands to millions), for cohorts of cases and controls whose 0notypeswithafixedtotalnumberofcasesandunderagiven sizes range from thousands to tens of thousands of individu- 5disease model, we suggest three algorithms: i) a simple re- als. GWAstudieshavemetwithmanysuccesses,mostnotably 1.jectionalgorithm;ii)anumericalMarkovChainMonte-Carlo for type 1 and type 2 diabetes, inflammatory bowel disease, 0(MCMC) approach; iii) and an exact and efficient backward prostatecancerandbreastcancer[2]. 2sampling algorithm. In our study, we validated the three al- InGWAstudies, veryhighfalsepositiveratesareexpected 1gorithms both on a toy-dataset and by comparing them with when there is no correction for multiple testing. Symmetri- : vHapgen on a more realistic dataset. As an application, we cally, control for true negative rate - or power - is necessary. Xithenconductedasimulationstudyona1000GenomesProject PowerestimationisthekeytoevaluatetheefficiencyofGWA dataset consisting of 629 individuals (314 cases) and 8,048 methods [3]. The correct estimation of both rates must take r aSNPsfromChromosomeX.Wearbitrarilydefinedanadditive into account the existence of high-dependency patterns be- disease model with two susceptibility SNPs and an epistatic tweenSNPs,orlinkagedisequilibrium(LD).Theaccuratees- effect. Results: Thethreealgorithmsareconsistent,butback- timation of the family wise type I error risk in the presence wardsamplingisdramaticallyfasterthantheothertwo. Our of LD consists in sampling the H0 distribution through per- approachalsogivesconsistentresultswithHapgen. Usingour mutations of phenotypes [4]. Thus, any association between applicationdata,weshowedthatourlimiteddesignrequires loci and phenotypes is broken. This permutation strategy is a biological a priori to limit the investigated region. We also implementedasagoldstandardinnumerousdedicatedpack- proved that epistatic effects can play a significant role even ages,togetherwithsoftwaresuitesdesignedforGWAstudies when simple marker statistics (e.g. trend) are used. We fi- [5,6,7,8,9]. nally showed that the overall performance of a GWA study Powerisanevenmorecomplicatedfunctionofseveralfac- stronglydependsontheprevalenceofthedisease: thelarger tors: studydesign,correlationpatternsinthegenotypicdata, sizesofcohorts,frequencyofthesusceptibilityallele,relative ∗Corresponding authors: Vittorio Perduca, Gregory Nuel, MAP5 Uni- risk conferred by the causal factor, genetic model (additive, versit´e Paris Descartes, 45 Rue des Saints P`eres, 75006 Paris, France, {vittorio.perduca, gregory.nuel}@parisdescartes.fr dominant, recessive, multiplicative) [10]. As a consequence, 1 the analytical computation of power requires simplified as- thestatusofindividualiaccordingtoboththeprobabilityπ i sumptions, including the approximation of the test statistic that i is a case and to the constraint that the total number distribution under H1 through a probability law [11]. Most of cases is fixed. The probability π is computed from the i powercalculatorsbasedonanalyticalapproachesareusedfor disease model and is proportional to the relative risk for in- two-stage GWA design, e.g. [12, 13]. Recently, an analytical dividuali. Thismethodprovidesseveraladvantages. First,it approach has been proposed to account for LD, under either generalizes permutations, which is the gold standard for ge- H0orH1approximation[14]: afixed-sizeslidingwindowlo- nerating H0 distributions in order to control type I error in callyaccountsfortheinter-markercorrelation. Thisapproach multipletesting. Indeed,permutationsrepresentaparticular bringsanimprovementoverblock-wisestrategiesbyunifying case of our general approach which is obtained by using the H0 and H1 processings in the same framework [15, 16, 17]. same π for all individuals. Secondly, our method is faster i However, with regards to both accuracy and computational than previous ones [3, 21] because it does not require a re- burden, the optimal choice of window size depends on the jectionsamplingsteporthegenerationofnewgenotypesfor structure of the data. Moreover, LD blocks are often am- eachsimulation. Moreover,inthecaseofLDmodeling-based biguous. Thus,thepreviousslidingwindowapproachcannot mapping methods, LD pattern identification needs to be per- account for high-order dependencies between LD blocks. In formed only once. Thirdly, no assumption is made on the particular,thismethodcannotbeusedtoevaluatethepower genotypedistribution,becausethelatteristhesameforeach ofanynovelapproachdesignedtocopewithsuchhighorder simulation. The last advantage is that power can be directly dependencies in GWA studies. In this case, the only solution assessed using real GWA datasets, such as those provided by istousecomputationallyintensivesimulations. the 1000 Genomes Project [22], because disease status can Similar to sampling under H0, the simulation of the H1 be generated according to genotypes without loss due to re- distribution is an appealing way to keep the LD-structured jection sampling. This is most valuable when evaluating the genotypic data. These simulations consist in generating case performanceofnewmappingmethods. and control samples which mimic the LD structure in hu- The paper is organized as follows. We first describe the man genome, i.e. in the creation of, say, k datasets under methoditself,whichwecalledwaffect,thenwevalidateour theH1assumption(atleastoneSNPisasusceptibilitySNP). toolonatoy-exampleandcompareitsresultstoHapgen[20] Nonetheless,breakinganyassociationbetweenalocus(orse- simulations. Wefinallyillustratetheinterestofourapproach veral loci) and the phenotype is far easier than to introduce withapowerstudyconductedonrealGWAdata. suchanassociationinadataset. Two main strategies have been developed to simulate H1. MaterialsandMethods (i)Thefirstone[18]consistsinfirstgeneratingalargesam- ple of haplotypes conditional on reference haplotypes such as HapMap haplotypes [19]. The haplotypes are then paired Detailedaimsandnotations to build diplotypes. The disease status is affected depending WeconsideraGWAstudywithn=n +n individuals(n 0 1 0 onthepenetrancemodelwhichinvolvesasusceptibilitySNP controlsandn cases),andpSNPs. Theith phenotypeisY ∈ 1 i selected at random. (ii) The second strategy [3] consists in {0,1}(Y =1ifindividualiisacase,Y =0ifindividualiisa i i firstrandomlyselectingadiseaseSNPandgeneratingafixed control). Theith genotypeisX ∈{0,1,2}p (thiscorresponds i number of cases and controls. Then diplotypes are assigned tothenumberofrareallelesforeachSNP). at the disease SNP for cases and controls depending on the Ourdiseasemodel,thereafterdenotedH1,isdefinedbythe penetrancemodel. Finally,haplotypesarebuilt(twoforeach probabilityπ =P(Y =1|X )thatindividualiisacasecondi- i i i diplotype)forallremainingSNPsofthechromosome,condi- tionalonher/hisgenotype. Intheparticularcasewhenallπ i tional on reference haplotypes. Nevertheless, both strategies havethesamenon-negativevalue,theobservedgenotypehas entailproblemsofimplementationwhenappliedtopoweres- noeffectonthephenotype. Thiscasecorrespondstothenull timation in GWA studies. The drawback of the first strategy hypothesisH0. Intheparticularcasewhenweconsiderasin- is that it does not make it possible to control the number of gle SNP (p = 1), π takes only three possible values: π = f i i 0 casesandcontrols. Totacklethisissue,rejectionsamplingof ifX =0,π =f =f RR ifX =1,andπ =f =f RR if i i 1 0 1 i i 2 0 2 case-controlsamplescanbeused,butthisleadstoawasteof X =2,wheref ,f andf arethepenetrancesofthedisease i 0 1 2 time and data. An illustration of the first strategy is Fregene model,andRR andRR aretherelativerisks. 1 2 [18]. The second strategy makes it possible to control these It is obviously easy to simulate under H1 using the inde- numbers by first fixing them, but requires generating haplo- pendentBernoullidistributionB(π )foreachindividual. Un- i typesforeachsimulation. Thewidely-usedsimulatorHapgen fortunately, it is quite unlikely that when doing so the case- [20]implementsthesecondstrategy. control design constraint C = {(cid:80)n Y = n } will be ful- i=1 i 1 Tothisaim,weproposeanewmethodwhichrandomlyas- filled. signs n cases and n controls conditional on n = n +n For this reason, the standard strategy (i.e. Hapgen) con- 1 0 0 1 genotypes and a given disease model. Our idea is to affect sistsinsimulatinggenotypesconditionalonphenotypesusing 2 Bayes’formula: stationary (this stage is called burn-in). In order to obtain a set of independent configurations under H1, one can either P(Y |X )P(X ) π P(X ) P(X |Y )= i i i = i i . repeat the burn-in phase of the algorithm as many times as i i (cid:80) P(Y |X =x)P(X =x) P(Y ) x i i i i necessary, or use a sequence of pseudo-independent samples The advantage of this approach is that constraint C is au- by picking phenotype configurations after the burn-in once tomatically fulfilled, whereas the major drawback is that a everyfixednumberofiterations. genotype distribution P(X ) must be introduced. In order to The main advantage of the MCMC approach is that it will i berealistic,thisimposessophisticatedmodelingusinghaplo- eventually work regardless of the probability of event C. Its typedataandrecombinationrates. drawback is that the number of iterations for burn-in and Asanalternative,wesuggestseveralwaystosimulatephe- pseudo-independencenecessaryforgoodmixingincreasesat notypesunderH1conditionalonthegenotypes(whichthere- least linearly with the total number n of individuals. More- foreremainuntouched)whilerespectingconstraintC. Todo over,thecontrolforstationarityandpseudo-independenceis so, we introduce the notation Z = (cid:80)j Y (with Z = 0 by delicateandthepracticalchoiceofburn-inandindependence j i=1 i 0 convention)sothattheconstraintbecomesC ={Z =n }. thresholdsusuallyrequirestediouscalibrationwork. n 1 Reject Backwardsampling The first approach is straightforward: 1) draw (Y ) Our proposed alternative is to turn to exact sampling by i i=1...n using independent Bernoulli distributions; 2) if constraint C introducing the backward quantities Bi(m) = P(C|Zi = m) is fulfilled, retain (Y ) as a valid sample; if not, simply which can be computed recursively like the forward quanti- i i=1...n reject it and return to 1). The problem with this algorithm tiesdefinedabove. is that samples will be rejected at rate 1−P(C). The natural StartingfromBn(m)beingequalto0,exceptforBn(n1)= questionisthen: how(un-)likelyiseventC? 1,wegetthefollowingrecursiveformulafor1(cid:54)i(cid:54)n: P(C) can be computed recursively by introducing the for- B (m)=π B (m+1)+(1−π )B (m). (2) ward quantities: F (m) = P(Z = m). Starting with F (m) i−1 i i i i i i 0 being equal to 0, except for F0(0) = 1, we get the following OneshouldnotethatweobtainP(C)=B0(0)intheprocess. recursiveformulafor1(cid:54)i(cid:54)n: Then we can sample phenotypes Y sequentially with the i followingprobability: F (m)=F (m−1)π +F (m)(1−π ). (1) i i−1 i i−1 i It follows that the probability of the constraint is P(C) = P(Y =1|Z =m,C)= πiBi(m+1). (3) i i−1 B (m) P(Z =n )=F (n ). i−1 n 1 n 1 Thisapproachiseasytounderstandandimplement,howe- Since this expression depends on Z , the corresponding i−1 ver it usually has a very low rate of success. This is the case probability obviously depends on the number of cases seen when the design is large and the π s are low (for instance in i by i−1. In mathematical terms, the sequence (Y ,Z ) i i i=1...n the case of a small prevalence). In practice, its use will be isaheterogeneousMarkovchain. limitedtosmalltoy-examples. H0simulations MCMC H0 simulations are simply performed by permutating In order to overcome this drawback, an interesting idea is the phenotypes Y s. A simple way to do this is to uni- i to turn to Markov Chain Monte Carlo (MCMC) techniques formly choose a permutation σ of {1,...,n} by performing suchastheMetropolis-Hastingsclassofsamplingalgorithms a O(nlogn) quicksort algorithm on a sample of n indepen- [23]. Starting from a configuration of phenotypes fulfilling dent random variables with a uniform distribution on [0,1]. theconstraint(e.g. withtheobservedphenotypes),alternate For example, this can be achieved through the sample com- thefollowingtwosteps: 1)proposalmove: selecttwoindivid- mandinR[24]. Alternatively,onemayusewaffectwiththe uals i and j such that Y = 1 and Y = 0 and then exchange i j sameprobabilityπ forallindividuals. i their phenotypes (Y = 0 and Y = 1); 2) accept this move i j withrateαgivenby: Toy-exampledataset For validation purposes, we first considered a toy-example (1−π )π α= i j. dataset with p = 1 SNP. The n genotypes X have the follo- π (1−π ) i i j wing distribution: 80% have genotype 0, 15% genotype 1, It can be shown that the sequence of phenotype configura- and 5% genotype 2. With n being a multiple of 20, these tions generated by this algorithm is a Markov chain whose counts are always integers. For our disease model, we con- stationary distribution is our targeted constrained distribu- sideredthefollowingrelativerisks: RR =1.5andRR =2.0 1 2 tion H1. In practice, the first configurations which are ge- (additiveeffectof0.5). Wehenceobtainedthefollowingpen- nerated must be discarded since the Markov chain is not yet etrances: f , f = 1.5f and f = 2.0f . In order to explore 0 1 0 2 0 3 together with phenotypes generated as follows. In the sec- n f0 P(C) Rejection MCMC Backward ond dataset (Hapgen - H0 hypothesis), the phenotypes un- 20 0.2 4.5·10−3 0.4s 38s 0.05s dertheH0hypothesisweregeneratedthroughthemethodof 20 0.1 1.7·10−5 1.5min 38s 0.05s permutations. In the third dataset (waffect - H1 hypothe- 20 0.07 6.7·10−7 38.5min 38s 0.05s sis) the phenotypes were affected with waffect . Finally, in 20 0.05 2.9·10−8 11.2h 38s 0.1s the fourth dataset (waffect - H0 hypothesis), we obtained 40 0.2 8.2·10−5 17.4s 1.2min 0.1s the replicates by affecting the phenotypes after specifying in 100 0.2 8.7·10−10 NA 3.2min 0.2s waffectauniformprobabilityπi accrossallindividuals(this 100 0.1 5.8·10−22 NA 3.2min 0.2s is equivalent to permuting the phenotypes). Besides, 1,000 100 0.01 1.1·10−69 NA 3.2min 0.2s replicates were simulated for each of the four datasets and under each condition (MAF range and relative risks). In the end, 64,000 GWA studies were performed (2 MAF ranges × Table1: Runningtimeforgenerating100replicates. 8 (RR , RR ) pairs × 4,000 datasets). Moreover, the com- 1 2 parison of the AUCs obtained for Hapgen and waffect were performed twice, because we used alternatively two distinct the relevance and limits of each of our three approaches, we statisticsofassociationasthediscriminationcriterionforthe consideredawiderangeofvaluesfornandf0. two-classclassifier. Hapgensimulations 1000Genomesdataset In contrast with simulations under H1, simulations under Thedatasetconsistsofthegenotypesofn=629individuals H0areexpectedtoprovidefewstatisticallysignificantresults. (n0 = 315 controls, n1 = 314 cases) from the 1000 Genomes Thiscontrastcanbeobservedthroughabinary(ortwo-class) Project [22]. We focused on the first 100,000 SNPs on chro- classifiersystem. Regardingthisaspect,wecomparedthebe- mosome X. In the pretreatment stage, we filtered out all the havior of waffect with the one of the widely used simulator SNPs with Minor Allele Frequency (MAF) less than or equal Hapgen [25] under various conditions. The comparison re- to5%,endingupwithn=8,048SNPs. liedontheso-calledReceiverOperatingCharacteristic(ROC) We arbitrarily considered a disease model with two inter- curve, which is a graphical plot of the true positive classi- acting disease SNPs. The two SNPs respectively have the fication rate (sensitivity) as a function of the false positive base-pair positions 627,641 and 1,986,325, MAFs 0.26 and rate (specificity), for a continuum of values of a given dis- 0.23, and display no linkage disequilibrium (r2 = 0.02). We crimination criterion. The performance of each simulator is considered an additive effect β > 0 and an epistatic effect η summarized using the area under the ROC curve (AUC): the suchthat: larger the AUC, the more performant the simulator in gene-  1.0+βX1 ifX2 =0 rating H1 data contrasting with H0 data (more details about  i i π =f × 1.0+βX2 ifX1 =0 ROCcurvescanbefoundbelow). i 0 i i  1.0+η+β(X1+X2) ifX1X2 (cid:54)=0 i i i i ThreeinputparametersmustbespecifiedinHapgen: i)the range of the Minor Allele Frequency (MAF) of the simulated whereX1 andX2 correspondtothegenotypeofindividuali i i susceptibility SNP; here we considered [0.2,0.3] or [0.1,0.2]; atthetwosusceptibilitySNPs. ii) the disease prevalence, which we set to 0.01; iii) and the severity of the disease expressed through the relative risks Statisticsofassociation RR =f /f forindividualsheterozygousatthediseaseSNP, The association signal of each SNP (single marker analy- 1 1 0 and RR = f /f for homozygous individuals (rare allele). sis) is computed through the PLINK software [26] using the 2 2 0 We considered a total of n = 629 subjects (n = 315 con- trend statistic. We denote p the p-value of SNP j. In order 0 j trols, n = 314 cases) and p = 9,579 SNPs in the region to avoid the multi-testing issue, we considered the following 1 delimited by loci 558,390 and 13,930,000 on chromosome singlerealvaluedstatistic: 1 (Reference: HapMap files [19], CEU population reference, http://hapmap.ncbi.nlm.nih.gov/). Sρ =max−log10pj j∈Jρ For each condition, the penetrances f , f and f used 0 1 2 with waffect for comparison purposes were obtained per- where J denotes the subset of SNPs such that the distance ρ forming several simulations using Hapgen and then averag- betweentheirlociandthelocusofadiseaseSNPislessthan ingthecorrespondingpenetrances. Fourdatasetsweregene- aradiusρ rated. Eachwasthensubmittedtoagenome-wideassociation Forexample,S correspondstotakingthebestp-valuep ∞ j study. In the first dataset (Hapgen - H1 hypothesis), Hapgen in negative (decimal) log-scale on the whole dataset. With provided the genotypes together with the phenotypes. All ρ = 5 kb, we consider the best p-value of the SNPs within 5 the other datasets consisted in the aforesaid genotypic data kb of one of the two susceptibility SNPs. The radius part of 4 n f Rejection MCMC Backward 0 20 0.2 0.60 0.59 0.61 [0.53,0.67] [0.52,0.66] [0.54,0.68] 0 20 0.1 0.59 0.59 0.58 1. [0.52,0.66] [0.52,0.66] [0.51,0.65] 20 0.07 0.62 0.56 0.56 q q q q q [0.55,0.69] [0.49,0.63] [0.49,0.63] 9 0. q q 20 0.05 0.44 0.58 0.53 [0.37,0.51] [0.51,0.65] [0.47,0.60] 40 0.2 0.58 0.62 0.59 [0.50,0.65] [0.54,0.69] [0.52,0.67] AUC 0.8 100 0.2 NA 0.75 0.72 [0.68,0.81] [0.65,0.79] 100 0.1 NA 0.64 0.68 [0.57,0.72] [0.61,0.75] 0.7 q 100 0.01 NA 0.65 0.59 q Hapgen [0.58,0.73] [0.51,0.67] Waffect 6 Table2: AUCand95%confidenceintervals,100replicates. 0. 0kb 1kb 5kb 10kb 20kb 50kb 100kb inf radius the statistic is somewhat unusual since it takes advantage of informationwhichisrarelyavailableinpractice: thelocation Figure1: Comparisonoftheareaunderthecurves(AUCs)for of the disease SNPs. However, in our simulation framework, Hapgen and Waffect - 1,000 replicates under H1 and 1,000 this information helped us discuss the power of the design replicates under H0, respectively, for MAF in [0.2,0.3], ad- dependingontheapriorilocationofthesusceptibilitySNPs. ditive model, RR1 = 1.6, RR2 = 2.2. The 95% confidence Fromageneralpointofview,startingwithnoapriorileadsto intervalsareindicated. achoiceofρ=∞;afterasuccessfullinkageanalysis,ρ=100 kbmightbemorerelevant,andifacandidategenehasbeen identified,ρ=5kbcouldbeagoodchoice. Results ROCcurvesandAUC Validation: toy-exampledataset In order to measure the performance of a given statistic In Table 1 are depicted the execution times for generating for a given design and H1 model, it is natural to compare 100 replicates through the rejection, MCMC and backward the sensitivity and specificity of the testing framework for a sampling algorithms on the toy-example dataset, together giventhresholdofsignificance. Forexample,aglobalthresh- withthecorrespondingprobabilitiesP(C)ofrandomlydraw- old of 5% is commonly accepted as a reasonable choice. It ing a complete configuration of phenotypes with exactly n1 might be interesting to avoid doing this arbitrary choice by cases. Different values of n and f0 are considered. P(C) dra- studying instead the Receiver Operator Characteristic (ROC) maticallydecreaseswhenngrowsandf0becomessmaller. As curve which is nothing but a graphical representation of the aconsequence,therunningtimeoftherejectionalgorithmis specificities and sensitivities that we can obtain for all possi- already important (more than 10 hours) even for n as small blevaluesofthethresholdofsignificance[27]. as 40 and f0 as large as 0.2. The rejection sampling algo- rithmwasnotrunforgreatervaluesofnasthecorresponding TheROCcurveitselfcanthenbefurthersummarizedbythe probability of a successful finding is negligible. The MCMC AreaUndertheCurve(AUC)whichcanbequalitativelyinter- andbackwardsamplingalgorithmshaveshortrunningtimes pretedasfollows: AUC(cid:54)0.6means“fail”;0.6<AUC(cid:54)0.70 whichgrowwithn(verygentlyforthebackwardalgorithm), means “poor”; 0.7 < AUC (cid:54) 0.80 means “fair”; 0.8 < AUC (cid:54) and,mostimportantly,whichdonotdependonP(C). Forthe 0.9means“good”;0.9<AUC(cid:54)1.0means“excellent”. MCMCalgorithm,weconsideredaburn-inof105×n. The AUC can easily be estimated from two samples of the The Area Under the Curves and the corresponding 95% statistic: one sample under H0, and the other under H1. All confidence intervals computed on the toy-example dataset ROCcurvesandAUCcomputations(includingconfidencein- replicates are shown in Table 2. The values found with the tervals)wereperformedusingtheRpackagepROC[28]. threemethodsareconsistent. 5 β AUC[95% CI] f0 = 0.1, Additive effect = 0.3, Epistasis = 0.3 0.1 0.62[0.55,0.70] 0 1. 0.2 0.72[0.64,0.79] 0.3 0.80[0.74,0.86] 0.4 0.90[0.86,0.94] 8 0. Table 3: AUCs at different values β of additive effect. Fixed 0.6 values: f0 = 0.1, epistasis β = 0.3, radius of the candidate nsitivity regionρ=5kb. e S 4 η AUC[95% CI] 0. 0 0.74[0.67,0.81] Radius and AUC 0.3 0.80[0.74,0.86] 0.2 01kkbb,, AAUUCC == 00..9856 ((00..982−−00.9.918) ) 0.6 0.85[0.80,0.91] 5kb, AUC = 0.8 (0.74−0.86) 0.9 0.88[0.83,0.93] 20kb, AUC = 0.69 (0.61−0.76) 100kb, AUC = 0.57 (0.49−0.65) 0.0 infkb, AUC = 0.49 (0.41−0.57) Table4: AUCsatdifferentvaluesη ofepistasis. Fixedvalues: 1.0 0.8 0.6 0.4 0.2 0.0 f0 = 0.1, additive effect β = 0.3, radius of the candidate Specificity regionρ=5kb. Figure 2: ROC curves and AUCs at different values ρ of the increasesastheregionaroundthetwosusceptibilitySNPsbe- radiusoftheintervalscenteredinthetwodiseaseSNPs. ing investigated gets smaller. In particular, the performance is good for ρ (cid:54) 5 kb. This roughly corresponds to the size of a gene. These results show that given the design and the Validation: Hapgensimulations disease model under consideration, the statistic which con- First, it has to be noted that even if the number of SNPs sists in simply taking the best p-value is successful in detect- specified in Hapgen is constant across the replicates, the ob- ing a positive signal only in presence of a biological a pri- served number of SNPs varies across these replicates. More- ori,therebynarrowingtheinvestigationtothecandidategene over, the disease SNP’s MAF and the relative risks observed level. inthesimulateddatafluctuatefromonereplicatetoanother We fixed a candidate region corresponding to the radius (seetheSupplementaryMaterialforexplanation). ρ = 5kbandinvestigatedthedependencyoftheAUConthe Figure1illustratesthecasewhentheMAFrangeis[0.2,0.3] following parameters: epistasis η, additive effect β, probabi- and the genetic model is additive with RR and RR values 1 2 lityf ofbeingacaseinabsenceoftherarealleles,andtotal respectively set to 1.6 and 2.2. Despite the fact that Hapgen 0 numbernofindividuals. andourapproacharenotbasedonthesameexactmodel,we Notunexpectedly,theAUCgrowswiththeadditiveeffectβ canseeaverygoodcorrelationbetweenthetwoapproaches, (Table3). Moresurprisingly,italsogrowswhentheepistatic suggestingthattheygivesimilarresults. effectincreases(Table4)despitethefactthatourstatisticS In our study (629 subjects, 9,579 SNPs specified in Hap- ρ onlyusessimplemarkerstatistics. gen,PowerEdgeR900XEONX74602.6GHz,RAM128GB), thegenerationandprocessingof1,000replicatesrequired75 We then focused on the design of the GWA study itself minutesorsousingthebackwardsamplingalgorithm(under by studying the effect of the total number of individuals n H1 or H0) or Hapgen under H0. In contrast, under H1 with on the AUC. This was simply done by doubling (tripling, Hapgen,110hoursorsowerenecessarytogenerateandpro- quadrupling) the original dataset before applying our di- cessdatawiththesamedimension. sease model. In the case of our configuration of reference More comprehensive comparisons with Hapgen are avail- η =0.3,β =0.3,f0 =0.1andρ=5kb,weobservedthatthe ableintheSupplementaryMaterial. performanceisexcellentwhenthepopulationistwiceaslarge as the original one (AUC = 0.94 [0.90,0.97] for n = 1258, Application: Powerstudyonthe1000Genomesdataset AUC = 0.80 [0.74,0.86] for n = 629). This gain in power is The ROC curves of the statistics S on the 1000 Genomes evenmoredramaticwhenalltheSNPsaretakenintoaccount ρ Project dataset for different values of the radius ρ are dis- (ρ=∞): theAUCdoublesinvaluewhenpassingfrom629to played in Figure 2. The AUC is very low (95% CI = 1258individuals(seeTable5). [0.41,0.57]) when considering all SNPs (ρ = ∞). The AUC Finally, we investigated the effect of f , the probability of 0 6 n AUC[95% CI] our approach is a valid alternative to this software. It also 629 0.49[0.41,0.57] sheds light on the major differences between the two ap- 1258 0.78[0.71,0.84] proaches. First,inordertoyieldarealisticmodelforsimula- 1887 0.92[0.88,0.96] tinggenotypes,Hapgenrequiresadditionalinformationsuch 2516 0.93[0.90,0.97] asHapMapfrequenciesandrecombinationrateswhereasour approach only needs the original genotypes. Regarding our simulations performed with Hapgen and waffect, running Table 5: AUCs for different sample sizes and on the whole thenewestHapgenversionwouldhaveentailednochangein dataset. Fixed values: epistasis η = 0.3, additive effect β = theresults: thenewestHapgenversionisanextensionofthe 0.3,f0 =0.1. former; it is not an improvement (at least regarding single susceptibility SNP simulation). Indeed, in its initial version, f0 AUC[95% CI] Hapgen limits the disease model to only one susceptibility 0.001 0.73[0.66,0.80] SNP. The most recent version can now simulate multiple di- 0.01 0.71[0.64,0.78] seaseSNPs,undertheassumptionthateachdiseaseSNPacts 0.1 0.80[0.74,0.86] independently and that all such SNPs are in Hardy-Weinberg 0.25 0.94[0.90,0.97] equilibrium, which are two strong constraints. Besides, this novelsoftware,unfortunatelyonlyavailableinthe64bitver- sion,requirestheuseofanRpackagetosimulateinteractions Table 6: AUCs at different values of penetrance f0. Fixed betweenthediseaseSNPs. Incontrast,simulatinganydisease values: epistasisη =0.3,additiveeffectβ =0.3,radiusofthe model is straightforward with our approach, as long as it re- candidateregionρ=5kb. sultsinanindividualprobabilityforeachindividualtobeaf- fected. Forexample,wecanconsidertwoormoresusceptibil- ity SNPs, and easily add epistatic effects, Gene-Environment being affected without any exposure, in the design of the interactions,thecontributionofrarevariants,etc. Moreover, GWAstudy(notethatforsmallvalues,f isclosetothepreva- 0 duetothemultipleconstraintsitmusttakeintoaccount,Hap- lence). WecanseeinTable6thatsmallvaluesoff resultin 0 gen only respects the specified model on average. As a con- smaller AUCs (i.e. AUC = 0.73 [0.66,0.80] for f = 0.001), 0 sequence, the effective MAF and relative risks can vary from while large values of f result in much better results (e.g. 0 oneruntoanother. Besides, selectingforeachHapgenrepli- AUC=0.94[0.90,0.97]forf =0.25). cate a different disease SNP entails variations in the number ofSNPsavailablefortheassociationtest(seetheSupplemen- Discussion tary Material for more details). Finally, since our approach only generates a fraction of the data (the phenotypes), it is outstandingly faster than Hapgen. Furthermore, since with Validation our approach the genotype remains untouched, any numeri- The threealgorithms we suggestedgive similar resultsbut cally intensive analysis which is performed on the genotype their performances are quite different. The rejection algo- table(g.e. LDcomputations)needstobeperformedonlyonce rithm has a complexity of O(N/P(C)) to obtain N samples. whilewithHapgenithastobedoneforeachreplicate. This prevents its use in practical cases when P(C) can easily reach 10−50 or worse. The MCMC approach has the advan- Effects tage of removing this dependence, and is simple to imple- In our application study, we showed how our strategy can ment. The drawback is that the burn-in parameter requires some calibration work (we suggest to use burn-in = 105n). beusedtoinvestigatevariouseffectsontheperformanceofa GWA study. We first pointed out that with the suggested de- Moreover this heavy numerical method is rather slow. In sign, our modestdisease model(β = η = 0.3, f = 0.1), and ourstudy,thebackwardsamplingapproachprovedtobethe 0 the analysis we chose (simple marker using PLINK), the sig- fastest one, dramatically outperforming the MCMC alterna- nalisonlystatisticallydetectablebylimitingtheregiontobe tive (i.e. 500 times faster). From a theoretical point of view, investigated to a 5 kb radius around the susceptibility SNPs. the backward sampling algorithm has a space complexity of This clearly shows that the present design is practical only O(n ×n),andatimecomplexityofO(n ×n+N×n)where 1 1 if previous work (e.g. linkage analysis) suggests candidate N is the desired sample size. Our C implementation makes genes. it possible to generate a configuration of n = 20,000 phe- notypes with n = 10,000 cases in 0.2 seconds on a 1.86 Gz Then we proved that the AUC increases when either the 1 workstation(XeonE5320,8GoRAM,Linux2.6.35). additiveeffect(β)ortheepistaticeffect(η)increases. While theresultfortheadditiveeffectisnotsurprising,theresultfor ComparisonwithHapgen theepistaticeffectisworthcommentingupon. Indeed,inour Our comparison with Hapgen clearly demonstrates that application,wedeliberatelystucktoaclassicalsimplemarker 7 analysis (PLINK using the trend test) which is not designed (including several SNPs, epistasis, covariates, rare variants, to detect epistatic effects. However, our study showed that etc.). Our approaches are available in an R package called the corresponding marginal effects can improve the overall waffect (“double-u affect”, for weighted affectations). The statistical performance of the analysis even when no specific betaversioncanbedownloadedfrom effortforthedetectionofepistasisisperformed. Itwouldbe www.math-info.univ-paris5.fr/~vperduca. interesting to do further research to compare simple marker analysistomorecomplexanalysisespeciallydesignedtotake Authors’contributions intoaccountepistaticeffects. GN suggested the original idea and the backward sampling Design algorithm. GN and VP implemented the R package. RM and Wefinallybrieflyinvestigatedtheinfluenceofthedesignof CS wrote the state-of-the-art on the subject in the Introduc- theGWAstudyonitsperformance. Ascouldbeexpected,the tion. CS took in charge the comparison with Hapgen and total number n of individuals plays a critical role. It might theeditingofthecorrespondingsupplementarymaterial. VP be interesting however to refine this analysis by also consid- performedalltheothersimulationsandeditedtheAppendix. ering different ratios n /n . This is left for further investiga- TheeditingofthepaperwasdonebyGNandVP.Allauthors 1 0 tion. We also showed that for a fixed design (fixed number readandapprovedthefinalmanuscript. ofcasesandcontrols)theprevalenceofthedisease(through f0, which is closely related) has an unexpected effect on the Appendix performance: higher prevalence gives better results. More investigations might be necessary to understand the reason In this section we give the proofs of the mathematical re- behindthisobservation. sultsfromtheMaterialandMethodssection,namelyofequa- tions (1), (2) and (3). Recall that Z = (cid:80)j Y is the j i=1 i Extensions numberofcasesamongindividuals1,...,j andthatwedeal with constraint C = {Z = n }. The forward and back- One should note that our approach can be extended in se- n 1 ward quantities are the probabilities F (m) = P(Z = m) veral directions. First, we can easily consider more than two i i and B (m) = P(C|Z = m). Let V be the set of variables classes, thus complexifying constraint C. In this case, the re- i i {Y ,...,Y ,Z ,...,Z }. jection and MCMC algorithms remain exactly the same. For 1 n 1 n The key property is that the conditional dependencies backward sampling, one should first affect one class against amongallthevariablesdeterminethefollowingfactorization the others, and then perform the affectation recursively on ofthejointprobabilitydistribution the remaining unaffected classes. The resulting complexity is O(n×(K −1)) for the affectation of n individuals into K (cid:89) P(V)= P(Z |Z ,Y )P(Y ), (4) classes. As explained above, we can also consider sophisti- j j−1 j j cated genetic models that take into account additional cova- j∈I riates such as environmental exposure or rare variants. Fi- whereI ={1,...,n}isthesetofalltheindividualsandZ = 0 nally, the extension of our approach to other fields is quite 0byconvention. straightforward. For example, in the analysis of gene ex- pression,ourapproachmightbeusedtoaffectsamplestatus Theorem1. Foreachindividuali=1,...,n: (e.g. cancer or healthy) conditional on the level of expres- sion rather than generating expressions through parametric P(Zi =m,C)=Fi(m)Bi(m). (5) ornon-parametricquestionablemodelsasisusuallydone. Moreover Conclusions P(Y =1,Z =m+1,Z =m,C)= In this paper, we present an alternative to classical simu- i i i−1 lations under H1 in GWA studies. The idea is to generate F (m)π B (m+1), (6) i−1 i i thephenotypesconditionalontheexistinggenotypeswithre- specttothestudydesign(numberofcases). Forthatpurpose, wesuggestthreealgorithmsincludingthebackwardsampling P(Yi =0,Zi =m,Zi−1 =m,C)= which mimics Hidden Markov Models to provide a fast sam- F (m)(1−π )B (m). (7) i−1 i i pling conditional on the constraint. Our study shows that our algorithms are valid and that their results are consistent Proof. We will only prove Eq. (5): similar arguments hold with reference software such as Hapgen. Moreover, our ap- for equations (6) and (7). The marginal probability P(Z = i proach has many advantages: it is much faster; it does not m,C)isobtainedfromthejointprobabilitydistributionP(V) need any genotyping model (genotypes remain untouched); multiplyingitbytheindicatorfunction1 (bydefinition1 = C C it makes it possible to consider any complex genetic model 1ifandonlyifC istrue)andsummingoutallthevariablesin 8 V \{Z } = {Z ,...,Z ,Z ,...,Z ,Y ,...,Y }. Because References i 1 i−1 i+1 n 1 n ofEq. (4),wehave [1] A. P. Morris and L. R. Cardon. Handbook of statisti- (cid:88) (cid:89) P(Z =m,C)= 1 Q (Z ,Z ,Y ), (8) cal genetics, volume 2, chapter Whole genome associa- i C j j j−1 j V\{Zi} j∈I tion,pages1238–1263. WileyInterscience,3rdedition, where for convenience we denote Q (Z ,Z ,Y ) = 2007. j j j−1 j P(Z |Z ,Y )P(Y ). j j−1 j j [2] L.A.Hindorff,P.Sethupathy,H.A.Junkins,E.M.Ramos, Nowconsiderthesets J.P.Mehta,F.S.Collins,andT.A.Manolio. Potentialeti- V :={Z ,Y ,...,Z ,Y }, ologic and functional implications of genome-wide as- 1:i 1 1 i i sociationlociforhumandiseasesandtraits. Proceedings V :={Z ,Y ,...,Z ,Y }. (i+1):n i+1 i+1 n n of the National Academy of Sciences of the United States V1:i∪V(i+1):n isapartitionofV andV \{Zi}=V1:j−{Zi}∪ ofAmerica,106(23):9362–9367,June2009. V . (i+1):n NotethatEq. (8)isequalto [3] C.C. Spencer, Z. Su, P. Donnelly, and J. Marchini. De- signing genome-wide association studies: sample size,   power, imputation, and the choice of genotyping chip. (cid:88) (cid:89) Qj (cid:88) 1C (cid:89) Qj= PLoSGenetics,5(5):e1000477+,May2009. V1:i\{Zi}1≤j≤i V(i+1):n i+1≤j≤n [4] Q.ZhangandJ.Ott.HandbookonAnalyzingHumanGe-     netic Data, chapter Multiple comparison/testing issues, (cid:88) (cid:89) (cid:88) (cid:89)  Qj· 1C Qj, pages277–287. SpringerBerlinHeidelberg,2010. V1:i\{Zi}1≤j≤i V(i+1):n i+1≤j≤n [5] Y.S.Aulchenko,S.Ripke,A.Isaacs,andC.M.vanDuijn. wherethelastequalityholdsbecausetheonlyvariableshared Genabel: an R library for genome-wide association betweenthetwofactorsisZi whichissettobeZi =mfrom analysis. Bioinformatics,23(10):1294–1296,2007. thebeginning. Itisnoweasytoseethatthetwofactorsinthe productaboveareF (m)andB (m)respectively. [6] B.L. Browning. PRESTO: rapid calculation of order i i statistic distributions and multiple-testing adjusted p- Corollary 2. The equations (6) and (7) make it possible to values via permutation for one and two-stage genetic computetheforwardandbackwardquantitiesrecursively: association studies. BMC Bioinformatics, 13(9):309, F (m)=F (m−1)π +F (m)(1−π ), 2008. i i−1 i i−1 i B (m)=π B (m+1)+(1−π )B (m). i−1 i i i i [7] J.R. Gonzalez, L. Armengol, X. Sole, E. Guino, J.M. Proof. We will prove only the recursive formula for B , a si- Mercader, X. Estivill, and V. Moreno. SNPassoc: an R i milarargumentholdsforF . Notethat package to perform whole genome association studies. i Bioinformatics,23:644–5,2007. P(Z =m,C)=F (m)π B (m+1) i i i+1 i+1 [8] K.S.Pollard,S.Dudoit,andM.J.vanderLaan. Multiple +F (m)(1−π )B (m)=F (m)B (m) i i+1 i+1 i i testing procedures: the multtest package and applica- By dividing by F (m), we obtain the iterative formula for tions to genomics. In R. C. Gentleman, V. J. Carey, W. i thebackwardquantities. Huber, R. Irizarry, and S. Dudoit (eds.), Bioinformatics and Computational Biology Solutions Using R and Bio- Corollary3. WecannowproveEq. (3): conductor, Springer, New York, Chapter 15, 2005. Bio- π B (m+1) conductorRpackage. P(Y =1|Z =m,C)= i i , i i−1 B (m) i−1 [9] S.Purcell,B.Neale,K.Todd-Brown,L.Thomas,M.A.R. foreachindividuali=1,...,n. Ferreira,D.Bender,J.Maller,P.Sklar,P.I.W.deBakker, Proof. Apply the definition of conditional probability and di- M.J. Daly, and P.C. Sham. PLINK: a toolset for whole- videEq. (6)byEq. (5). genomeassociationandpopulation-basedlinkageanal- ysis. American Journal of Human Genetics, page 81, Corollary4. 2007. (cid:88) P(Y =1|C)∝ F (m)π B (m+1), i i i i [10] G.Lettre,C.Lange,andJ.N.Hirschhorn. Geneticmodel m testing and statistical power in population-based asso- (cid:88) P(Y =0|C)∝ F (m)(1−π )B (m). ciation studies of quantitative traits. Genetic Epidemiol- 1 i−1 i i m ogy,31(4):358–362,2007. 9 [11] R.J.Klein. Poweranalysisforgenome-wideassociation [25] J.Marchini,B.Howie,S.Myers,G.McVean,andP.Don- studies. BMCGenetics,8(58),2007. nelly. A new multipoint method for genome-wide as- sociation studies via imputation of genotypes. Nature [12] I.Menashe,P.S.Rosenberg,andB.E.Chen. PGA:power Genetics,39:906–913,2007. calculatorforcase-controlgeneticassociationanalyses. BMCGenetics,9(1):36,2008. [26] S.Purcell,B.Neale,K.Todd-Brown,L.Thomas,M.A.R. Ferreira,D.Bender,J.Maller,P.Sklar,P.I.W.deBakker, [13] J.P. Steibel and G.R. Abecasis. QpowR: Interactive M.J. Daly, and P.C. Sham. PLINK: a toolset for whole- powercalculatorfortwo-stagegeneticassociationstud- genomeassociationandpopulation-basedlinkageanal- iesofquantitativetraits,2008. ysis. AmericanJournalofHumanGenetics,81,2007. [14] B. Han, H.M. Kang, and E. Eskin. Rapid and ac- [27] C. E. Metz. Basic principles of roc analysis. Sem. Nuc. curate multiple testing correction and power estima- Med.,8,1978. tion for millions of correlated markers. PLoS Genetics, 5(4):e1000456,2009. [28] X.Robin,N.Turck,A.Hainard,N.Tiberti,F.Lisacek,J.- C.Sanchez,andM.Mu¨ller.pROC:anopen-sourcepack- [15] K.N. Conneely and M. Boehnke. So many correlated age for R and S+ to analyze and compare ROC curves. tests, so little time! rapid adjustment of p-values for BMCBioinformatics,12:77,2011. multiple correlated tests. American Journal of Human Genetics,81:1158–1168,2007. [16] D. Lin. An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformat- ics,6:781–787,2005. [17] S.R. Seaman and B. Mu¨ller-Myhsok. Rapid simulation of p values for product methods and multiple-testing adjustmentsinassociationstudies. AmericanJournalof HumanGenetics,76:399–408,2005. [18] M.C. Hyam, C. Hoggart, P. O’Reilly, J. Whittaker, M. De Iorio, and D. Balding. Fregene: Simulation of realistic sequence-level data in populations and as- certained samples. BMC Bioinformatics, 9(1):364+, september2008. [19] The International HapMap Consortium. A second gen- erationhumanhaplotypemapofover3.1millionsnps. Nature,449(7164):851–861,October2007. [20] Z.Su,J.Marchini,andP.Donnelly. Hapgenv2,2010. [21] B.PengandC.I.Amos. Forward-timesimulationofreal- isticsamplesforgenome-wideassociationstudies. BMC bioinformatics,11(1):442+,September2010. [22] The 1000 Genomes Project Consortium. A map of hu- man genome variation from population-scale sequenc- ing. Nature,467:1061–1073,2010. [23] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. MarkovChainMonteCarloinPractice. Chapman&Hall, 1996. [24] RDevelopmentCoreTeam. R:ALanguageandEnviron- mentforStatisticalComputing.RFoundationforStatisti- calComputing,Vienna,Austria,2011. ISBN3-900051- 07-0. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.