Statistical issues in Mendelian randomization: use of genetic instrumental variables for assessing causal associations Stephen Burgess MRC Biostatistics Unit Emmanuel College, University of Cambridge A thesis submitted for the degree of Doctor of Philosophy 22nd August 2011 “And furthermore, my son, be admonished: of making many books there is no end; and much study is a weariness of the flesh.” Ecclesiastes 12:12. Statement of Collaboration and Acknowledgements I hereby declare that this thesis is the result of my own work, includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text and bibliography, is not substantially the same as any other work that I have submitted or will be submitting for a degree or diploma or other qualification at this or any other university, and does not exceed the prescribed word limit. I would like to thank my supervisor, Simon G. Thompson, for guidance and support throughout this period of my life, and my advisors, Julian Higgins, Jack Bowden and Shaun Seaman, for helpful discussions. In particular, I ac- knowledge contributions from Shaun Seaman, Debbie Lawlor and Juan Pablo Casas, who are co-authors of the paper included as Appendix E, which forms the basis for Chapter 7. This paper was conceived, undertaken and written by me under the supervision of Simon Thompson, with editorial input from Shaun Seaman. Debbie Lawlor and Juan Pablo Casas provided data and comments on the manuscript. I also acknowledge the contribution of the CRP CHD Genetics Collaboration (CCGC), specifically of Frances Wensley, who coordinated the collaboration, Mat Walker and Sarah Watson, who managed the data, and John Danesh, who oversaw the project. The papers included as Appendices B and C were con- ceived, undertaken and written by me under the supervision of Simon Thomp- son; several comments on penultimate versions of these manuscripts were pro- vided by members of the collaboration. These papers form the basis of work in Chapters3and5. ThepreliminaryanalysesundertakeninChapter8werecon- ceived, undertaken and written by me independently of the parallel analyses in the paper included as Appendix F, for which I contributed the instrumental variable analysis. I would also like to thank my wife, Nina, all those with whom I have shared an office (Aidan, Alex, Ben, Dennis, Emma, Graham, Verena) and those who have brightened up the journey thus far, my family and friends. Stephen Burgess: Statistical issues in Mendelian randomization: use of genetic instrumental variables for assessing causal associations Mendelian randomization is an epidemiological method for using genetic vari- ation to estimate the causal effect of the change in a modifiable phenotype on an outcome from observational data. A genetic variant satisfying the assump- tions of an instrumental variable for the phenotype of interest can be used to divide a population into subgroups which differ systematically only in the phenotype. This gives a causal estimate which is asymptotically free of bias from confounding and reverse causation. However, the variance of the causal estimate is large compared to traditional regression methods, requiring large amounts of data and necessitating methods for efficient data synthesis. Addi- tionally, if theassociationbetweenthegeneticvariantandthephenotypeisnot strong, then the causal estimates will be biased due to the “weak instrument” in finite samples in the direction of the observational association. This bias may convince a researcher that an observed association is causal. If the causal parameter estimated is an odds ratio, then the parameter of association will differ depending on whether viewed as a population-averaged causal effect or a personal causal effect conditional on covariates. We introduce a Bayesian framework for instrumental variable analysis, which is less susceptible to weak instrument bias than traditional two-stage methods, has correct coverage with weak instruments, and is able to efficiently combine gene–phenotype–outcome data from multiple heterogeneous sources. Methods forimputingmissinggeneticdataaredeveloped, allowingmultiplegeneticvari- ants to be used without reduction in sample size. We focus on the question of a binary outcome, illustrating how the collapsing of the odds ratio over hetero- geneous strata in the population means that the two-stage and the Bayesian methods estimate a population-averaged marginal causal effect similar to that estimated by a randomized trial, but which typically differs from the condi- tional effect estimated by standard regression methods. We show how these methods can be adjusted to give an estimate closer to the conditional effect. We apply the methods and techniques discussed to data on the causal effect of C-reactive protein on fibrinogen and coronary heart disease, concluding with an overall estimate of causal association based on the totality of available data from 42 studies. Abbreviations 2SLS two-stage least squares 2SPS two-stage predictor substitution 2SRI two-stage residual inclusion ACE average causal effect BMI body mass index CCGC CRP CHD Genetics Collaboration CRP C-reactive protein CHD coronary heart disease CI /CrI confidence / credible interval COR causal odds ratio (Chapter 2) C(L)OR conditional (log) odds ratio (Chapter 4) CRR causal risk ratio DIC deviance information criterion FE / RE fixed-effects / random-effects GMM generalized method of moments GWAS genome-wide association study (or studies) HDL-C high-density lipoprotein cholesterol HR hazard ratio HWE Hardy–Weinberg equilibrium I(L)OR individual (log) odds ratio IL6 interleukin-6 IPD individual participant data IV instrumental variable LIML limited information maximum likelihood LD linkage disequilibrium LDL-C low-density lipoprotein cholesterol lp(a) lipoprotein(a) MAB median absolute bias MAF minor allele frequency MAR missing at random MCAR missing completely at random MCMC Monte Carlo Markov chain MCSE Monte Carlo standard error MI myocardial infarction MNAR missing not at random M(L)OR marginal (log) odds ratio P(L)OR population (log) odds ratio RCT randomized controlled trial SE standard error (G)SMM (generalized) structural mean model SNP single nucleotide polymorphism Abbreviations for the various studies in the CCGC are given in Appendix H. iv Notation Throughout this dissertation, we use the notation: X phenotype: the risk factor, or protective factor, or intermediate pheno- type of interest Y outcome U confounder in the X-Y association V unmeasured confounder (Chapter 3); covariate for Y (Chapters 4 and 6) G instrumental variable α parameter of genetic association: regression parameter in the G-X re- gression β regression parameter in the X-Y regression β causal effect of X on Y: the main parameter of interest 1 γ parameter of genetic association for haplotypes: regression parameter in the G-X regression where G represents a haplotype or diplotype ρ correlation parameter σ2 variance parameter τ2 between-study heterogeneity variance parameter ψ2 genetic between-study heterogeneity variance parameter F F statistic from regression of X on G i subscript indexing individuals j subscript indexing genotypic subgroups J total number of genotypic subgroups k subscript indexing genetic variants (SNPs) K total number of genetic variants m subscriptindexingstudiesinameta-analysis, orimputeddatasets(Chap- ter 7) M total number of studies, or imputed datasets (Chapter 7) N total number of individuals n total number of cases (individuals with a disease event) t time-to-event N normal distribution U uniform distribution We follow the usual convention of using upper-case letters for random variables and lower-case letters for data values (except for N and n). v Contents 1 Introduction to Mendelian randomization 1 1.1 The rise of genetic epidemiology . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Historical background . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Shortcomings of classical epidemiology . . . . . . . . . . . . . . . . 3 1.1.3 The need for an alternative . . . . . . . . . . . . . . . . . . . . . . 4 1.2 What is Mendelian randomization? . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Instrumental variables . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Analogy with randomized controlled trials . . . . . . . . . . . . . . 8 1.2.4 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.5 Reverse causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Genetic markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Examples of Mendelian randomization . . . . . . . . . . . . . . . . . . . . 11 1.5 The CRP CHD Genetic Collaboration dataset . . . . . . . . . . . . . . . . 13 1.5.1 Study design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5.2 Phenotype data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.3 Genetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.4 Outcome data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5.5 Covariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5.6 The need for Mendelian randomization . . . . . . . . . . . . . . . . 18 1.5.7 Statistical issues and difficulties in CCGC . . . . . . . . . . . . . . 18 1.6 Overview of dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6.1 Chapter structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6.2 Novelty and publications . . . . . . . . . . . . . . . . . . . . . . . . 21 vi CONTENTS 2 Existing statistical methods for Mendelian randomization 22 2.1 Review strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Finding a valid instrumental variable . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Parallel with non-compliance . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Violations of the IV assumptions . . . . . . . . . . . . . . . . . . . 25 2.3 Testing for a causal effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Estimating the causal effect . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Additional IV assumptions . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.2 Causal parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.3 Collapsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Ratio of coefficients method . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Two-stage methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.1 Continuous outcome - two-stage least squares . . . . . . . . . . . . 31 2.6.2 Binary outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7 Likelihood-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7.1 Limited information maximum likelihood method . . . . . . . . . . 32 2.7.2 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 Semi-parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8.1 Generalized method of moments . . . . . . . . . . . . . . . . . . . . 34 2.8.2 Structural mean models . . . . . . . . . . . . . . . . . . . . . . . . 35 2.9 Method of Greenland and Longnecker . . . . . . . . . . . . . . . . . . . . . 37 2.10 Comparison of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.11 Efficiency and validity of instruments . . . . . . . . . . . . . . . . . . . . . 38 2.11.1 Use of measured covariates . . . . . . . . . . . . . . . . . . . . . . . 38 2.11.2 Overidentification tests . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.12 Meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.13 Weak instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.14 Computer implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.15 Mendelian randomization in practice . . . . . . . . . . . . . . . . . . . . . 42 2.16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Weak instrument bias for continuous outcomes 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Demonstrating the bias from IV estimators . . . . . . . . . . . . . . . . . . 47 3.2.1 Bias of IV estimates in small studies . . . . . . . . . . . . . . . . . 47 vii CONTENTS 3.2.2 Simulation with one IV . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Explaining the bias from IV estimators . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Correlation of associations . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 Finite sample violation of IV assumptions . . . . . . . . . . . . . . 51 3.3.3 Sampling variation within genotypic subgroups. . . . . . . . . . . . 52 3.4 Quantifying the bias from IV estimators . . . . . . . . . . . . . . . . . . . 54 3.4.1 Simulation of 2SLS bias with different strengths of 1 and 3 IVs . . . 54 3.4.2 Comparison of bias using different IV methods . . . . . . . . . . . . 55 3.5 Choosing a suitable IV estimator . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.1 Multiple candidate IVs . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.2 Overidentification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.3 Multiple instruments in the Framingham Heart Study . . . . . . . . 61 3.5.4 Model of genetic association . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Minimizing the bias from IV estimators . . . . . . . . . . . . . . . . . . . . 63 3.6.1 Increasing the F statistic . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6.2 Adjustment for measured covariates . . . . . . . . . . . . . . . . . . 66 3.6.3 Borrowing information across studies . . . . . . . . . . . . . . . . . 67 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7.1 Key points from chapter . . . . . . . . . . . . . . . . . . . . . . . . 72 4 Collapsibility for IV analyses of binary outcomes 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Collapsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.1 Collapsibility across a covariate . . . . . . . . . . . . . . . . . . . . 75 4.2.2 Collapsibility across the risk factor distribution . . . . . . . . . . . 76 4.3 Exploring differences in odds ratios . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 Individual and population odds ratios . . . . . . . . . . . . . . . . . 77 4.3.2 Marginal and conditional estimates . . . . . . . . . . . . . . . . . . 80 4.3.3 Population and individual odds ratios in simulated data . . . . . . . 80 4.3.4 Population and individual odds ratios in five studies . . . . . . . . . 82 4.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4 Instrumental variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Relation of the two-stage IV estimator and population odds ratio . 85 4.4.2 IV estimation in simplistic simulated scenarios . . . . . . . . . . . . 87 4.4.3 IV estimation in more realistic simulated scenarios . . . . . . . . . . 89 4.4.4 Interpretation of the adjusted two-stage estimand . . . . . . . . . . 91 viii CONTENTS 4.4.5 IV estimation in five studies . . . . . . . . . . . . . . . . . . . . . . 92 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.1 Connection to existing literature and novelty . . . . . . . . . . . . . 94 4.5.2 Choice of target effect estimate . . . . . . . . . . . . . . . . . . . . 95 4.5.3 “Forbidden” regressions . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5.4 Different designs, different parameters . . . . . . . . . . . . . . . . 96 4.5.5 Key points from chapter . . . . . . . . . . . . . . . . . . . . . . . . 97 5 A Bayesian framework for instrumental variable analysis 99 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 Continuous outcome — A single genetic marker in one study . . . . . . . . 100 5.2.1 Conventional methods . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.2 A Bayesian method . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 Continuous outcome — Multiple genetic markers in one study . . . . . . . 104 5.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.2 Application to C-reactive protein and fibrinogen . . . . . . . . . . . 105 5.4 Continuous outcome — Multiple genetic markers in multiple studies . . . . 108 5.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4.2 Application to C-reactive protein and fibrinogen . . . . . . . . . . . 110 5.5 Binary outcome — Genetic markers in one study . . . . . . . . . . . . . . 112 5.5.1 Conventional methods . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.2 A Bayesian method . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6 Dealing with issues of evidence synthesis in meta-analysis . . . . . . . . . . 117 5.6.1 Cohort studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.6.2 Common SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.6.3 Common haplotypes . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6.4 Lack of phenotype data . . . . . . . . . . . . . . . . . . . . . . . . 120 5.6.5 Tabular data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7.1 Bayesian methods in IV analysis . . . . . . . . . . . . . . . . . . . . 120 5.7.2 Bayesian analysis as a likelihood-based method . . . . . . . . . . . 121 5.7.3 Meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7.5 Key points from chapter . . . . . . . . . . . . . . . . . . . . . . . . 123 ix
Description: