ebook img

Meta-analysis of haplotype-association studies: comparison of methods and empirical evaluation of the literature. PDF

0.45 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Meta-analysis of haplotype-association studies: comparison of methods and empirical evaluation of the literature.

BagosBMCGenetics2011,12:8 http://www.biomedcentral.com/1471-2156/12/8 METHODOLOGY ARTICLE Open Access Meta-analysis of haplotype-association studies: comparison of methods and empirical evaluation of the literature Pantelis G Bagos Abstract Background: Meta-analysis is a popular methodology in several fields of medical research, including genetic association studies. However, the methods used for meta-analysis of association studies that report haplotypes have not been studied in detail. In this work, methods for performing meta-analysis of haplotype association studies are summarized, compared and presented in a unified framework along with an empirical evaluation of the literature. Results: We present multivariate methods that use summary-based data as well as methods that use binary and count data in a generalized linear mixed model framework (logistic regression, multinomial regression and Poisson regression). The methods presented here avoid the inflation of the type I error rate that could be the result of the traditional approach of comparing a haplotype against the remaining ones, whereas, they can be fitted using standard software. Moreover, formal global tests are presented for assessing the statistical significance of the overall association. Although the methods presented here assume that the haplotypes are directly observed, they can be easily extended to allow for such an uncertainty by weighting the haplotypes by their probability. Conclusions: An empirical evaluation of the published literature and a comparison against the meta-analyses that use single nucleotide polymorphisms, suggests that the studies reporting meta-analysis of haplotypes contain approximately half of the included studies and produce significant results twice more often. We show that this excess of statistically significant results, stems from the sub-optimal method of analysis used and, in approximately half of the cases, the statistical significance is refuted if the data are properly re-analyzed. Illustrative examples of code are given in Stata and it is anticipated that the methods developed in this work will be widely applied in the meta-analysis of haplotype association studies. Background studies [10], as well as for genetic association studies for The continuously increasing number of published gene- which specialized methodology has been developed disease association studies made imperative the need of [5,11-18]. collecting and synthesizing the available data [1,2]. The Most of the genetic association studies (and hence the statistical procedure with which data from multiple stu- meta-analyses derived from them) are performed using dies are synthesized is known as meta-analysis [3-5]. In single markers, usually Single Nucleotide Polymorph- meta-analysis, a set of original studies is synthesized and isms (SNPs). However, the SNP that is under investiga- the potential heterogeneity is explored using formal sta- tion is not always the true susceptibility allele. Instead, it tistical methods [3,4,6,7]. In the medical literature, may be a polymorphism which is in Linkage Disequili- meta-analysis was initially applied in the field of rando- brium (LD) with the unknown disease-causing locus mized clinical trials [8,9], but nowadays it is considered [19]. In such cases, the single marker tests may be a valuable tool for the combination of observational underpowered, depending on the degree of LD and the allele frequencies [20]. Haplotypes, which are the combi- Correspondence:[email protected] nation of closely linked alleles on a chromosome, are DepartmentofComputerScienceandBiomedicalInformatics,Universityof therefore important in the study of the genetic basis of CentralGreece,Papasiopoulou2-4,Lamia35100,Greece ©2011Bagos;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. BagosBMCGenetics2011,12:8 Page2of16 http://www.biomedcentral.com/1471-2156/12/8 diseases and thus, they are extensively used [21,22]. The findings are interesting. Even though the methods pre- importance of studying haplotypes ranges from elucidat- sented in this work could be derived in a straightfor- ing the exact biological role played by neighbouring ward manner from extending previous works on amino-acids on the protein structure, to providing infor- multivariate meta-analysis [33-37], the majority of the mation about ancient ancestral chromosome segments published meta-analyses did not use optimal methods that harbour alleles influencing human traits [23]. More- for analyzing the data. Moreover, in several circum- over, haplotype association methods are considered to stances the results of some studies are shown to be be more powerful compared to single marker analyses severely flawed. The manuscript is organized as follows: [24,25], even though this is questioned by some Initially, the commonly used methods for haplotype ana- researchers [26]. lysis for a single study are reviewed in order to establish A major problem in haplotype analyses is that in order notation. Afterwards, the methods of meta-analysis are for the analysis to be performed we need to reconstruct presented. In particular, we present the standard method or infer the haplotypes, usually with an approach based of univariate meta-analysis and its limitations, which on missing data imputation [27-29]. This uncertainty in leads to a more powerful multivariate approach based imputing the haplotypes poses some problems in the on summary-data. Accordingly, a general framework analysis [30] that are to be discussed later in this work. based on generalized linear mixed models (GLMMs) is Nevertheless, studies that investigate the association of presented and the approaches based on logistic regres- haplotypes with diseases are increasingly being pub- sion, multinomial logistic regression and Poisson regres- lished (Figure 1), with an even more increasing rate sion are discussed. We also discuss continuous traits after 2003, when the HapMap project was initiated [31]. and details of the implementation of the models. Finally, This exponential increase follows the general pattern of we present the results of the empirical evaluation of the gene-disease association studies [1,2,32] and naturally, literature and compare the results reported in these ana- the obvious extension would be to use meta-analysis in lyses with the ones obtained using the methods devel- order to increase the power of individual studies and to oped here. resolve the reasons of heterogeneity and inconsistency. This work has two primary goals. First, to perform a Methods detailed literature search and an empirical evaluation of Methods for haplotype association the published studies that report meta-analyses of haplo- Let’s assume we have n biallelic markers that form a type associations; and second, to present a concise over- haplotype. If the alleles in position m (m = 1, 2... n) are view of the statistical methods that could and should be denoted by Am and Bm the possible haplotypes would be used in such meta-analyses. These two important issues r = 2n. In a case-control study, a cross-tabulation of were not previously studied in the literature and the haplotypes by disease status, that ignores the individuals and counts only the total number of haplotypes observed in the analysis, would result in data arranged in the form of a 2 × r contingency table (Table 1). This cross-tabulation is somehow simplistic since it assumes a multiplicative (co-dominant) model of inheritance [38]. However, it is the most commonly reported form of haplotype data and thus, it is more suitable for meta- analysis of published studies as we will discuss later. Assuming a binomial sampling scheme where fixed numbers of cases and controls are sampled indepen- dently, we can model the structure of the table using logistic regression methods where the status (case/ Table 1Cross-tabulation ofhaplotypes by disease status Haplotype(z) Cases(y=1) Controls(y=0) Figure1Agraphicalrepresentationoftheincreasingnumber j ofpublishedhaplotype-associationstudies.Asearchwas 1 89 183 performedinPubmedusingtheterms“haplotype”and“association” 2 14 26 from1997to2009.Eventhoughthereferencelistmayinclude 3 24 22 reviewarticles,methodologicalpapersorevenirrelevantworks,the 4 3 3 trendisobvious,especiallyafter2003whentheHapMapproject waspresented.ThesearchwasconductedduringDecember2009 Thehaplotypedataobtainedinacase-controlstudyon182caucasianwomen andthusthecountfor2009maybeanunderestimate. concerningtheassociationofp53haplotypeswithbreastcancer[108].The dataarepresentedintheformdescribedbyWallensteinandcoworkers[38]. BagosBMCGenetics2011,12:8 Page3of16 http://www.biomedcentral.com/1471-2156/12/8 control) is the dependent variable and the haplotypes variable is the counts and thus, the studies, the type of are treated as covariates. This corresponds to the so- haplotypesandthecase/controlstatusaretreatedasinde- called “prospective likelihood”, the likelihood based on pendentvariables. Log-linear models are widely used for the probability of the disease given the exposure. Thus, haplotype analysis, for instance,for detecting LD [41,42] we denote by πj = P(yj = 1) the underlying risk (i.e. the andforhaplotype-diseaseassociation[43,44].Thismodel probability of being a case) of a person carrying a single canbeformulatedintermsofaPoissonregressionmodel copy of the jth haplotype. A reasonable choice would be inthecontextofgeneralizedlinearmodels,as: to consider the most common haplotype (i.e. h ) as the 1 reference category and create r-1 dummy variables tak- ( ) ∑r ∑r ing values zj = 1 for haplotype j and 0 otherwise. This log nj =0 +0yj + ajzj + jzjyj (4) model can be formulated as: j=2 j=2 This is the standard saturated model for describing ( ) ( ) ∑r logit j =logit⎡⎣P yj =1|j ⎤⎦ =0 + jzj (1) the 2 × r contingency table of haplotypes by disease. j=2 The bj’s are the coefficients that correspond to the hap- lotype by disease interaction and are equivalent to those This model was proposed initially by Wallenstein and obtained by fitting the models in Eq. (1) and (2). It is co-workers and as we already mentioned, assumes a easily verified that the coefficients a’s and b’s are identi- multiplicative genetic model of inheritance [38]. More- cal across the three models. The overall hypothesis for over, the haplotypes are assumed known quantities, association (b = 0) can be tested by performing a multi- which may not always be the case (see below). variate Wald test using the estimated covariance matrix, Alternatively, assuming a multinomial sampling cov(b). Then, the test statistic (score) U = b’cov(b)-1b, scheme where the total sample size is considered fixed, will have asymptotically a c2 distribution on r-1 degrees a multinomial logistic regression model would be appro- of freedom. Alternatively, a likelihood ratio test compar- priate, where the different haplotypes would be the ing the saturated model against the model with no inter- dependent variables. This corresponds to the well- action can be performed. Similar tests can be performed known “retrospective likelihood” (i.e. the likelihood for the models in Eq. (1) and (2). based on the probability of exposure given disease sta- Whatever the assumed sampling scheme that gave rise tus) applicable in case control studies. In this case, the to the data of Table 1 may be, it is well known that the haplotypes are treated as dependent variables and the results of fitting each one of the three models are nearly case/control status as the predictor in a multinomial identical [45]. For instance, it has been shown that max- (polytomous or polychotomous) logistic regression [39]: imum likelihood estimates obtained from the “retrospec- tive” likelihood are the same as those obtained from the ( ) ( ) exp  +y “prospective” likelihood [46,47]. The equivalence of pj = P j|yj = ∑r (j j j ) (2) logistic regression and Poisson modelling has been also exp  +y exploited in the past for deriving methods for detecting j j j j=1 gene-environment interactions [48]. The methods discussed above are simple applications By observing that the linear predictor becomes: of the generalized linear model extending the analysis of single markers to haplotypes and assume that, i) the U =log⎛⎜ pj ⎞⎟=a +y ,j =1,2,...,r (3) haplotype risk follows a multiplicative model of inheri- j ⎝ p ⎠ j j j tance, ii), the haplotype phase is known and, iii) the 1 population is in Hardy-Weinberg Equilibrium (HWE). it is easy to understand that the bj coefficients The genetic model of inheritance can be handled simply obtained by fitting the model are estimates of the log- by using in the analysis the so-called haplo-genotypes or Odds Ratios (i.e. for comparing hj vs. h1) in equivalence diplotypes, instead of the genotypes. This is easily per- to the respective coefficients of the model in Eq. (1). formed with all the previously presented methods by Obviously, b = 0 for identifiability since haplotype j = 1 using the pairwise combinations of haplotypes (h h , 1 1 1 (i.e. h ) is used as the reference category. The particular h h and so on). In case-control association studies, 1 1 2 model was first used for haplotype analysis by Chen and however, with the exception of some cases where direct Kao [40]. genotyping of the haplotypes is applicable (i.e. [38]), the Lastly, assuming that the observed counts are realiza- haplotypes (and the haplo-genotypes) are usually not tions of a Poisson random variable, one can fit log-linear known, but are inferred from the data using statistical models (Poisson regression), where the dependent methods for missing data, usually with an EM or EM- BagosBMCGenetics2011,12:8 Page4of16 http://www.biomedcentral.com/1471-2156/12/8 like algorithm [27-29]. Thus, treating them as known controls and cases of the ith study respectively, given quantities has been shown to be problematic [30]. More by: advanced methods have been developed in order to account for these limitations, for instance weighting the ∑r ∑r n = n −n = n haplotypes by their probability [49,50]. Score methods ic0 ij’0 ij0 ij’0 based on the prospective likelihood [51] or the retro- j’=1 j’=1|j’≠j (7) spective likelihood [52], have also been developed, as ∑r ∑r well as methods for allowing for gene-environment nic1 = nij’1−nij11 = nij’1 interaction [53]. A comparison of methods has shown j’=1 j’=1|j’≠j that the approaches are roughly comparable when the In a standard univariate random-effects model we haplotype effect on disease odds follows a multiplicative assume that the logarithm of the OR of each study i, is model. However, for dominant and recessive models, distributed normally as: the retrospective-likelihood method has increased effi- ciency with respect to the prospective methods [54]. ( ) Graphical models have been proposed by Thomas [55] yi −N ,si2 +2 (8) and log-linear models by Baker [56]. Lin and co-workers extended the previously presented methods by including Thus, the combined logarithm of the Odds Ratio various sampling schemes in a unified framework [57]. (logOR) would be given by: Even though a large body of the genetic epidemiology literature is dedicated to such methods, their application ∧ =∑k w ∑k w , with w =1 (s2 +2)(9) in meta-analysis is problematic since in most cases the i=1 i i i=1 i i i original data are not available to the analyst. Thus, in The between-studies variance (τ2), could be easily the following sections where the methods for meta- computed by the non-iterative method of moments pro- analysis are summarized we also assume that the posed by Dersimonian and Laird [58], even though haplotypes are known. An extension when the posterior there are several alternatives that use iterative proce- probabilities of haplotypes are given from the output of dures (i.e. Maximum Likelihood (ML) or Restricted the haplotype inference software would then be Maximum Likelihood (REML) [33]). Apparently, by set- straightforward. ting τ2 = 0 in Eq. (9) corresponds to the well known fixed-effects estimator with inverse variance weights. Methods for meta-analysis of haplotype association The particular approach is very easily implemented, In this section the methods for meta-analysis are pre- intuitive and it can be performed in a standard univari- sented. Initially we will discuss simple methods using ate meta-analysis framework. In the results section we summary data, whereas in the next sub-section more will see that several already published meta-analyses advanced methods that use generalized linear models on used this method. However, the method has some draw- grouped or Individual Patients Data (IPD) are presented. backs. The most important is that it is prone to an increased type I error rate due to multiple comparisons. Meta-analysis using summary-data Multiple comparisons constitute an important problem A commonly used approach that is based on traditional in haplotype analysis, especially as the number of haplo- methods and uses solely summary data is to consider types increases [59,60]. The model implied by Eqs. (5) - separately the effect of the jth haplotype against the j-1 (8), is conceptually similar to collapsing the genotypes remaining ones. That is, for each study i (i = 1,2,...,k) we in a single-marker analysis, an approach that has been will compute a log-Odds Ratio (logOR): shown to increase the power as well the type I error rate [61]. Thus, the particular approach can be justified, ⎛ ⎞ n n y =logOR =log⎜ ij1 ic0 ⎟ (5) only when there is strong prior knowledge concerning a i i ⎜ ⎟ ⎝ nij0nic1 ⎠ particular haplotype and this haplotype is the only one that is being tested. with an asymptotic variance given by: To overcome the multiple comparisons problem, a straightforward alternative would be to extend the var(y )= s2 = 1 + 1 + 1 + 1 (6) model in a multivariate framework modelling simulta- i i n n n n neously the logORs derived from comparing haplotypes ij1 ic0 ij0 ic1 j = 2,3,...,r against a reference haplotype (j = 1). Follow- In this notation, nic0 an nic1, are the counts of the ing the general framework for multivariate meta-analysis remaining haplotypes (excluding haplotype j) for [37,62], we denote by yi the vector containing the r-1 BagosBMCGenetics2011,12:8 Page5of16 http://www.biomedcentral.com/1471-2156/12/8 different estimates, and by b, the vector of the overall Weshouldmentionthatfromstandardnormaltheoryit means given by: is known that the multivariate test for b = 0, based on b’cov(b)-1b,couldyieldsignificantresultsevenifallther-1 ⎛y ⎞ ⎛ ⎞ univariateWaldtestsarenon-significant.Thus,themulti- ⎜ 2i ⎟ ⎜ 2 ⎟ yi =⎜⎜.y..3i ⎟⎟, and =⎜⎜...3 ⎟⎟ (10) vnaifriicaatenttersetsushltoiusldfobuendpewrfeorcmanedprinoicteiaeldlybayndcoolnlalpysiifnagstihge- ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎟ haplotypesandperformastandardunivariatemeta-analysis. ⎝y ⎠ ⎝ ⎠ ri r The model can be fittedin any statistical package cap- ableoffittingrandom-effectsweightedregressionmodels These logORs similarly to Eq. (5) will be given by: with an arbitrary covariance matrix, such as SAS (using PROCMIXEDorPROCNLMIXED),R(usinglme)orStata ⎛ ⎞ n n y =logOR =log⎜ ij1 i10 ⎟ (11) (using mvmeta). In this work, we used mvmeta which ji ji ⎜ ⎟ ⎝ nij0ni11 ⎠ performsinferencesbasedoneitherMaximumLikelihood (ML) or Restricted Maximum Likelihood (REML), by with an asymptotic variance given by: directmaximizationoftheapproximatelikelihoodusinga Newton-Raphson algorithm [63]. Alternatively, mvmeta ( ) 1 1 1 1 var y = s2 = + + + (12) canalsoimplementthemultivariateversionoftheDerSi- ji ji nij1 ni10 nij0 ni11 monian and Laird’s method of moments [64]. The last option, being non-iterative, is very attractive in case of In the multivariate random-effects meta-analysis, we large number of haplotypes and/or large number of stu- assume that yi is distributed following a multivariate dies. A major disadvantage of the methods proposed in normal distribution around the true means b, according this section is the assumptions of normality that are to the marginal model: employedandtheneedforcorrectionwhentherearerare haplotypes(i.e.addingapseudocountof0.5tothehaplo- y  MVN(,+C ) (13) i i types with zero counts). These limitations are surpassed byusingthemethodsdiscussedinthenextsection. In the above model, we denote by Ci the within-stu- dies covariance matrix: Meta-analysis using binary data In this section, methods that use directly the binary nat- ⎛ s2  s s ...  s s ⎞ ⎜ 2i W23 2i 3i W2r 2i ri ⎟ ure of the data, within a generalized linear mixed model ⎜ s s s2 ...  s s ⎟ (GLMM) are presented. These methods are usually C =⎜ W23 2i 3i 3i W3r 3i ri ⎟ (14) i termed IPD methods [33-37] although in many real-life ⎜ .... ... ... ... ⎟ ⎜ ⎟ applications, individual data may not be literally avail- ⎝ W2rs2isri W3rs3isri ... sr2i ⎠ able. Instead, extending the models described for a sin- gle study, only summary counts of individuals carrying andbyΣthebetween-studiescovariancematrix,givenby: the respective haplotypes will normally be used. Logistic regression ⎛ 2   ...   ⎞ ⎜ 2 B23 2 3 Br2 2 r ⎟ Usingtheprospectivelikelihoodwecanextendthelogis- ⎜  2 ...   ⎟ tic regression model of Eq. (1) in order to incorporate Σ=⎜ B23 2 3 3 Br3 3 r ⎟ (15) study specific effects and perform a stratified analysis ⎜ ... ... ... ... ⎟ ⎜ ⎟ (fixed effectsmeta-analysis).Todoso,weneed tointro- ⎝Br22r Br33r ... r2 ⎠ ducek-1dummyvariablesdi(takingvaluesequaltozero ThediagonalelementsofCiarethestudy-specificesti- sotrudoyn-es)pewciitfhiccfioxeefdfi-ceifefencttss.bT0hi uths,atthaeremionddeilciastoarsstroafigthhte- mates of the variance that are assumed known, whereas forwardextension tothe model describedpreviously for the off-diagonal elements correspond to the pairwise meta-analysis of genetic association studies for single within-studiescovariances,forinstancerw23s2is3i=cov(y2i, nucleotidepolymorphisms[16]andisformulatedas: y3i). Since the logORs derived for each haplotype are comparedagainstthesamereferencecategory,theirpair- logit( )=logit⎡P(y =1|j)⎤ wisecovarianceswillbegiven[12],by: ij ⎣ ij ⎦ ∑r (17) ( ) 1 1 = + d + z cov y ,y = + ,∀j,j’=2,3,...r,j≠ j’ (16) 0 0i i j jj ji j’i ni10 ni11 j=2 BagosBMCGenetics2011,12:8 Page6of16 http://www.biomedcentral.com/1471-2156/12/8 Here, the bj obtained by fitting the model are the to look at multiple indices of heterogeneity arising from overall estimates of the logORs (i.e. for comparing hj vs. multiple haplotype contrasts. h ). An overall test for the association of haplotypes In order to account for an additive component of het- 1 with disease can be performed if we denote by b the erogeneity and perform a random-effects logistic regres- vector of the estimated coefficients and by cov(b) its sion allowing the haplotype effects to vary between estimated variance-covariance matrix. Then, the test sta- studies, the most suitable way is to introduce a set of tistic U = b’cov(b)-1b will have asymptotically a c2 dis- study-specific random coefficients, representing the tribution (U~c2r-1) [65]. The particular model has been deviation of study’s true effect from the overall mean used in several meta-analyses of haplotype association effect for each haplotype. Thus, the model becomes: studies [66-69] (see in the results section, the empirical ( ) ( ) evaluation of the literature). This fixed effects model logit  =logit⎡P y =1|j ⎤ ij ⎣ ij ⎦ assumes homogeneity of ORs between studies. This assumption can be tested by adding the interaction ∑r ( ) (22) = + d +  + z between the study effect and the haplotypes into the 0 0i i j ji j model:: j=2 logit( )=logit⎡P(y =1|j)⎤ In this model, the random terms bi are distributed as: ij ⎣ ij ⎦ ( ) ∑r   MVN 0, (23) = + d + zz i 0 0i i j j (18) j=2 where ∑r ∑k + ijdizj ⎛ ⎞ j=2 i=2 ⎜ 2i ⎟  =⎜3i ⎟ This is the analogue to the Cochran’s test for hetero- i ⎜ ... ⎟ ⎜ ⎟ ⎜ ⎟ geneity in the univariate meta-analysis. The hypothesis ⎝  ⎠ ri can be tested by performing a multivariate Wald test, (24) where the null hypothesis is: ⎛⎜ 22 B2323 ... Br22r ⎞⎟ H : = 0 ( = 0,∀i = 2,3,...,k;j = 2,3,...,r) =⎜⎜B23223 32 ... Br33r ⎟⎟ 0 ij ⎜ ... ... ... ... ⎟ ⎜⎜ ⎟ The test statistic can be constructed analogously to ⎝Br22r Br33r ... r2 ⎠ the one used for b. If we denote byg the vector of the estimated coefficients, by V the estimated variance-cov- The between studies variances and covariances have ariance matrix and by Rg = r the vector of the (r-1)(k-1) the same interpretation as the ones obtained by the linear hypotheses, then the statistic: summary-data methods of Eq. (13) and (15). Multinomial logistic regression W =(R-r)′(RVR′)−1(R-r)=′cov()−1 (19) Alternatively, the model may be parameterized assuming a multinomial sampling scheme utilizing the retrospec- 2 tive likelihood. In this case, an extension of the model will have asymptotically a c distribution [65] of Eq. (2), which incorporates fixed-study effects, would W (2r−1)(k−1) (20) be: ( ) Moreover, the value of W could be used in order to p = P( j|y )= exp j +ijdi +jyij calculate a modified version of the overall inconsistency ij ij ∑r ( ) (25) index I2 [70]: exp  + d +y j ij i j ij j=1 ⎧⎪ W −(r −1)(k−1)⎫⎪ I2 =max⎨0, ⎬ (21) The linear predictor in the above model becomes: ⎩⎪ W ⎭⎪ ⎛ p ⎞ This measure is quite useful, since it enables us to U =log⎜ ij ⎟= + d +y ,j =1,2,...,r (26) ij j ij i j ij summarize the overall heterogeneity, instead of having ⎝ p ⎠ i1 BagosBMCGenetics2011,12:8 Page7of16 http://www.biomedcentral.com/1471-2156/12/8 Similar to the model based on prospective likelihood, ( ) log n = +d + y the variables di are indicators of the study-specific fixed- ij 0 i i 0 ij effects. An overall test for the association of haplotypes ∑r ∑∑r with disease (b = 0) can be performed similarly to the + ajzj + jzjyij logistic regression model (U). Introducing the study by j=2 j=2 (29) disease interaction terms can form a test for homogene- ∑r ∑k ∑k ity of ORs across the k studies: + a z d +  y d ij j i 0i ij i j=2 i=2 i=2 ⎛ p ⎞ ∑k ∑r Uij =log⎝⎜ pij ⎠⎟=j +ijdi +jyij + ijdiyij (27) In this model, the coefficients aj, aij, b0, b0j and bj i1 i=2 j=2 correspond to the ones obtained by fitting the models in j=1,22,...,r Eq. (17) and Eq. (15). The overall test for the association of the haplotypes with the disease (b = 0), is known in The statistics for heterogeneity (W) as well as the I2 the context of log-linear models as the test of “partial index derived from it are identical to the one presented association“ [72,73]. The model in Eq. (29), assumes in Eq. (19) - (21). homogeneity of ORs across studies. Thus, in order to A random-effects extension to the model can be for- test this assumption we need to include additional terms mulated if in the above model, we introduce a haplo- for the three-way interaction (study x disease x haplo- type-specific random coefficient bij (for haplotypes j = type). This is accomplished by fitting the saturated 2,3, ...,r), in which case the linear predictor becomes model: [71]: ( ) log n = +d + y ij 0 i i 0 ij ⎛ p ⎞ U =log⎜ ij ⎟= + d +y + y ∑r ∑∑r ij ⎝ p ⎠ j ij i j ij ij ij (28) + a z + z y i1 j j j j ij j =2,3,...,r j=2 j=2 ∑r ∑k ∑k (30) and the model is completely specified as a random + a z d +  y d ij j i 0i ij i effects multivariate meta-analysis, with random terms bi j=2 i=2 i=2 distributed similarly as bi~MVN(0,Σ). The interpretation ∑r ∑k of the variances and covariances of the random terms is +  z dy ij j i ij identical to the ones presented in Eq. (13). A version of j=2 i=2 this model has been used previously for meta-analysis of genetic association studies involving single nucleotide The test with the null hypothesis Η0: g = 0 (gij = 0,i = polymorphisms [12], but according to the author’s 2,3,...,k, j = 2,3,...,r) is identical to the ones obtained by knowledge it has never been used for meta-analysis of fitting the models in Eq. (18) and (27). The three-way haplotypes. interaction model and its interpretation in terms of test- Poisson regression ing the homogeneity of ORs has been discussed in detail Lastly, we can extend the log-linear model of Eq. (4) in in the past [45,74-76]. Log-linear models have been order to perform a fixed effects meta-analysis allowing employed in several meta-analyses of haplotype associa- for the study-specific effects. The major difference tion [77,78] (see in the results section). However, even compared to the previous approaches lies in the struc- though not described in detail, it is apparent from the ture of the log-linear model and the interpretation of results reported, that in these analyses the log-linear the main effects and interactions. Having in mind that model was not applied in an appropriate manner. we want to model a 2 × r × k contingency table, the Although the authors stated that they performed stratifi- appropriate choice would be to include in the model cation by study, they probably included only the main of Eq. (4) the study specific main effects as well as the effect of the study and not the interaction terms with two-way interactions (study x disease and study x hap- both haplotypes and disease. As we will see in the lotype). Thus, we would have a model containing all results section, when the correct model is applied, the the main effects as well as all the two-way interactions, originally drawn conclusions are compromised. a model known as the “no three-factor interaction In analogy to models in Eq. (22) and (28), a random model“ [45]: coefficient for the disease by haplotype interaction can BagosBMCGenetics2011,12:8 Page8of16 http://www.biomedcentral.com/1471-2156/12/8 be applied in order to perform a random-effects meta- Implementation analysis: The models presented in this section can be easily fitted in Stata using gllamm, or in SAS using PROC ( ) log n = +d + y NLMIXED. These models are expected to perform better ij 0 i i 0 ij compared to the models presented in the previous sec- ∑r ∑r ( ) + a z +  + z y tion, in case the normality assumption for logORs does j j j ji j iij (31) not hold. Furthermore, a major advantage of these mod- j=2 j=2 els is that they can directly be used for pooled meta- ∑r ∑k ∑k analyses performed under large collaborative projects. + a z d +  y d ij j i 0i ij i This is why these models are usually termed Individual j=2 i=2 i=2 Patients Data (IPD) methods [36]. However, a disadvan- with random terms bi distributed similarly as bi~MVN tage is that these methods are computational intensive, (0,Σ). Similarly to the multinomial logistic regression especially when the number of haplotypes is large. A sometimes useful simplification can be made in Eq. model, the interpretation of the variances and covar- (15) if we assume that the between-studies variances are iances of the random terms is identical to the ones pre- sented in Eq. (12). equal [34]. In such case by letting τ = τ2 = τ3 = ... = τr, Σ reduces to: Continuous traits The methods discussed so far assume we are dealing ⎛⎜ 2 2 ... 2 ⎞⎟ with a binary trait, usually in a case-control setting. ⎜2 2 ... 2 ⎟ =⎜ ⎟⎟ (35) However, continuous traits are not uncommon in ⎜ ... ... ... ... ⎟ genetic association studies and these should be easily ⎜ ⎟ accommodated using a linear model (linear regression). ⎝2 2 ... 2 ⎠ For instance, denoting by yij the continuous trait for a person carrying the jth haplotype in the ith study, the Another approximation would be to impose a single between studies correlation, but allow for different model would be: between-studies variances [79]: ∑r yij =0 +0idi + jzj (32) ⎛⎜ 22 23 ... 2r ⎟⎞ j=2 ⎜ 2 ...  ⎟ =⎜ 2 3 3 3 r ⎟ (36) The homogeneity of haplotype effects across studies ⎜ ... ... ... ... ⎟ can be subsequently checked using a model with a hap- ⎝⎜  .... 2 ⎠⎟ 2 r 3 r r lotype x study interaction term: In this work however, we chose to use a different ∑r ∑r ∑k y = + d + z +  d z (33) approximation that can be obtained if the number of ij 0 0i i j j ij i j random effects is reduced by decomposing the random j=2 j=2 i=2 terms using factor loadings such as: τ 2 = l 2τ2, τ 2 = 2 2 3 Finally, a random effects model could be formulated l32τ2, ..., τr2 = lr2τ2, and letting l2 = 1 for identification. Thus, the covariance matrix becomes now: using a liner mixed model: ∑r ( ) ⎛⎜ 2 32 ... r2 ⎞⎟ yij =0 +0idi + j +ji zj (34) ⎜2 22 ... 2 ⎟ j=2 =⎜ 3 3 3 r ⎟ (37) ⎜ ... ... ... ... ⎟ ⎜ ⎟ with random terms bi distributed similarly as ⎝2 2 ... 22 ⎠ bi~MVN(0,Σ). Similarly to the previously described r 3 r r models, the interpretation of the variances and covar- The particular approximation is conceptually similar iances of the random terms is identical to the ones to the one used previously for the so-called “genetic presented in Eq. (12). In case where individual data are model-free approach” for meta-analysis of genetic asso- not available, the above models could be easily fitted ciation studies [14,80], even though the motivation was using summary data (mean values and standard devia- different. The model imposes a single between studies tions) per haplotype. variance τ2 thus, it is much faster since the factor BagosBMCGenetics2011,12:8 Page9of16 http://www.biomedcentral.com/1471-2156/12/8 loadings lj with j = 3,4,...,r are treated as fixed-effects associations. The initial search in PUBMED using the parameters. By observing also the off-diagonal elements term “haplotype” combined with “meta-analysis” or “col- of the covariance matrix in Eq. (37), we can see that the laborative analysis” or “pooled analysis” yielded 282 stu- model restricts the between-studies correlations (rBjj’) to dies. Of these, 35 studies could have been identified be equal to ±1 (depending on the sign of ljlj’). Never- using solely the terms “collaborative analysis” or “pooled theless, the between-studies correlations are usually analysis” and “haplotype”. After careful screening, 207 poorly estimated especially when the number of studies studies were excluded as irrelevant ones (they were not is small (<20) and in such cases they are usually esti- meta-analyses of haplotypes), 36 studies were excluded mated to be equal to ±1 [81,82]. Thus, the particular for various reasons (family based-studies, meta-analyses approach seems to be a good compromise between of SNPs with the term “haplotype” appearing in the speed and precision and we expect to perform well. abstract or haplotype analyses in which the term “meta- Using this approach, the computational complexity as analysis” appeared in the abstract etc). Finally, we came well as the execution time is reduced drastically but the up with 39 published papers containing data for 43 obtained estimates agree up to the fourth decimal place associations. Some studies reported different sets of hap- in most of the experiments conducted. lotypes from the same gene (Auburn et al, 2008; Zint- Afinal commenthastobe madeconcerningthe iden- zaras et al, 2009), haplotypes from different genes tifiability of the models presented in the previous sec- (Thakkinstian et al, 2008), or distinct outcomes mea- tions, especially when it comes to the log-linear models sured on different subsets of patients (Kavvoura et al, which are the ones that contain the largest number of 2007) and thus, they were included twice, whereas from parameters. Concerning the fixed effects methods, the studies that reported different outcomes measured on numberofparametersofthesaturatedmodelofEq.(30) the same set of individuals we kept only one. There is equal to 2rk, a number that is equal to the number of were also some pairs of studies that evaluated the same observations[45].Forthemodel ofEq.(29),thenumber association and from these we kept only the largest one. of freely estimated parameters is equal to rk + r + k-1, 10 out of the 39 published papers could have been iden- which is obviously smaller than 2rk (since r > 1 and k > tified using solely the terms “collaborative analysis” or 1).TherandomeffectsmodelofEq.(31)hasatotalnum- “pooled analysis” coupled with the term “haplotype”. ber of parameters equal to rk + r + k-1 + r(r-1)/2 since The 43 studies and their characteristics are presented in weneed toestimateadditionallyr(r-1)/2elementsofthe Table 2. covariance matrix (the variances and the covariances of The average number of polymorphisms included in the random effects). Thus, in order for the model to be the haplotypes was 3.19 (SD = 1.37, median = 3, range identifiableweneedtoensurethatrk+r+k-1+r(r-1)/2 from 2 to 7), whereas the sample size was 5,017.81 (SD ≤2rkwhichisaccomplished ifk;≥1+r/2.Intuitively,we = 4,703.24, median = 3,004, range from 348 to 23,309). need a relatively larger number of studies compared to The average number of included studies was 5.14 (SD = the number of haplotypes. If on the other hand, we fit 3.06, median = 4, range from 2 to 13). Twenty seven the model of Eq. (31) using Eq. (35) for restricting the studies (62.79%) were conducted in a collaborative set- covariances, we only need rk + r + k parameters and ting, whereas sixteen (37.21%) were performed using whenweuseEq.(36)orEq.(37),weneedtoestimaterk data derived from the literature. Twenty seven of the +2r + k-1 parameters, numbers which both are smaller meta-analyses (62.79%) reported significant results and than2rk.Nevertheless,forpracticalapplications,wewill the majority (22 studies, 51.16%) were analysed under normally use the logistic regression model of Eq. (22) the “1 vs. others” approach using standard summary coupledwithparameterizationofEq.(37),andthusiden- based meta-analysis techniques (with fixed or random tifiabilityissueswillneverariseinpractice. effects), 11 studies (25.58%) were analysed by pooling In Additional file 1, Stata programs for fitting the the data inappropriately, 6 studies (13.95%) did not models developed in this section are presented. The report the method or did not perform pooling at all and models were fitted using the gllamm module for Stata 4 analyses (9.30%) were performed using a fixed effects [83,84]. gllamm uses numerical integration by adaptive logistic regression model. Only 13 studies (30.23%) quadrature in order to integrate out the latent variables reported the complete data that suffice for the analysis and obtain the marginal log-likelihood. Afterwards, the to be replicated (Table 2 and 3). log-likelihood is maximized by Newton-Raphson using There was only some weak evidence where collabora- numerical first and second derivatives. tive meta-analyses contained larger number of studies compared to literature-based ones (5.67 vs. 4.25), larger Results sample size (5,651 vs. 3,948) and produced significant We initially performed a literature search for identifying results more frequently (66.67% vs. 56.25%). However, studies that report meta-analyses of haplotype these differences did noreach statistical significance BagosBMCGenetics2011,12:8 Page10of16 http://www.biomedcentral.com/1471-2156/12/8 Table 2Listofthe 43meta-analyses thatwere used inthe empirical evaluation ID Reference Gene/ Disease/ SNPsin Numberof Sample Methodof Data Collaborative Significant Locus Outcome haplotype studies Size analysis availability analysis results 1 [109] DRD3 Schizophrenia 4 5 7551 1vs.others No No No 2 [98] ITGAV Rheumatoid 3 3 6851 N/A Yes Yes Yes Arthritis 3 [110] IL1A/IL1B/ Osteoarthritis 7 4 2908 1vs.others No Yes Yes IL1RN 4 [111] FRZB Osteoarthritis 2 10 12380 1vs.others No Yes No 5 [99] CX3CR1 CAD 2 6 2912 1vs.others Yes No Yes 6 [112] ALOX5AP Stroke 4 5 5765 1vs.others No No No 7 [112] ALOX5AP Stroke 4 3 3004 1vs.others No No No 8 [113] GNAS Malaria 3 7 8154 1vs.others No Yes Yes 9 [113] GNAS Malaria 7 6 7632 1vs.others No Yes Yes 10 [114] PDLIM5 Bipolar 2 3 1208 1vs.others No No No Disorder 11 [115] PDE4D Stroke 2 4 4961 1vs.others No No Yes 12 [116] TGFB1 Renal 2 4 438 pooled No No Yes Transplantation 13 [116] IL10 Renal 3 4 348 pooled No No No Transplantation 14 [117] 9p21.3 CAD 4 5 7838 1vs.others No Yes Yes 15 [118] HLA SLE 2 3 527 1vs.others No No Yes 16 [94] CTLA4 GravesDisease 2 10 2564 1vs.others Yes Yes Yes 17 [94] CTLA4 Hashimoto 2 5 1210 1vs.others Yes Yes Yes Thyroiditis 18 [119] ENPP1 T2DM 3 3 8676 1vs.others No No No 19 [77] MTHFR ALL 2 4 894 Log-linear No No Yes model 20 [97] CAPN10 T2DM 3 11 5862 1vs.others Yes Yes Yes 21 [93] ADAM33 Asthma 5 3 1899 pooled Yes No No 22 [120] NRG1 Schizophrenia 6 11 8722 1vs.others No No Yes 23 [121] RGS4 Schizophrenia 4 8 7243 1vs.others No Yes No 24 [122] ADRB2 Asthma 2 3 2060 N/A No No Yes 25 [123] ESR1 Fractures 3 8 14622 1vs.others No Yes Yes 26 [78] VDR Osteoporosis 3 4 2335 Log-linear Yes No Yes model 27 [95] ACE Alzheimer’s 3 4 1619 pooled Yes Yes Yes Disease 28 [124] IGF-I IGF-Ilevels 3 3 1929 1vs.others No Yes Yes 29 [125] TF Stroke 2 2 818 N/A No Yes No 30 [92] FcgammaR CelliacDisease 2 2 1057 N/A Yes Yes No 31 [69] VDR Fractures 3 9 23309 Logistic No Yes No regression 32 [96] G72/G30 Schizophrenia 2 2 1541 N/A Yes Yes Yes 33 [68] VEGF ALS 3 4 1912 Logistic Yes Yes Yes regression 34 [126] BANK1 Rheumatoid 3 4 4445 1vs.others No Yes Yes Arthritis 35 [67] CYP19A1 Endometrial 2 10 13283 Logistic No Yes Yes Cancer regression 36 [127] CRP T2DM 3 3 11876 N/A No No Yes 37 [66] 8q24 Colorectal 4 3 5385 Logistic No Yes Yes Adenoma regression 38 [128] CYP1A1 LungCancer 2 13 2151 Pooled No Yes Yes

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.