n w o T Network-based approach for Post e Genome-Wide Assocpiation Study a Analysis in AdmixCed Populations f o y t i s r e By Mamana Mbiyavanga v ([email protected]) i Supervision: Ass. Prof. Nicola Mulder n U ([email protected]) Computational Biology Group Institute for Infectous Diseases and Molecular Medicine University of Cape Town A thesis submitted for the degree of Master’s in Medicine/Bioinformatics April 2014 n w The copyright of this thesis vests in the author. No o T quotation from it or information derived from it is to be published without full acknowledgeement of the source. p The thesis is to be used for private study or non- a C commercial research purposes only. f o Published by the Universit y of Cape Town (UCT) in terms y t of the non-exclusive license granted to UCT by the author. i s r e v i n U Abstract The rapid advances in genotyping technology has increased the power to identify loci associ- atedwithacomplextraitthroughgenome-wideassociation(GWA)studies. Despitethesuccess of GWA studies that has enabled the identification of associations between common genetic variants and complex diseases, which generally consider only the most significant SNPs/genes, suchasingle-marker-basedapproachhasshowncertainlimitations. Thereforeitmightnotpos- sess adequate power to detect important genetic variants, mostly with relatively small effect sizes in complex diseases. As a consequence, the estimated heritability remains unexplained by these variants for most complex diseases covered by published GWAS so far. Extending GWA study findings and exploring related discoveries in a more mechanistic way, taking into account the genetic architecture of the disease as well as the genetic mechanism involved in the disease pathogenesis, to uncover biological pathways and biological networks relevant to phenotypes, should involve combining weak signals from the individual variants from GWA studies to approximate the true disease process more closely and provide biological insights. This would facilitate better understanding of the biological basis of disease susceptibility and using these genetic risk factors to make predictions about who is at risk, in order to develop newpreventionandtreatmentstrategiessuchasnewpharmacologictherapiesandpersonalized medicine. Recently, several studies have explored the feasibility of pathway or network-based analysis approaches for GWA studies and they have proposed several methods to summarize thesignificanceofasetofgenesorabiologicalpathwayfromacollectionofSNPsandtoadjust for multiple testing at both the gene and pathway levels. In this project, we review some existing pathway-based approaches for GWA study analyses, by exploring different implemented methods for combining effects of multiple modest genetic variants at gene and pathway levels. We then propose a graph-based method, ancGWAS, that incorporates the signal from GWA study, and the locus-specific ancestry into the human protein-protein interaction (PPI) network to identify significant sub-networks or pathways as- sociated with the trait of interest. This network-based method applies centrality measures within linkage disequilibrium (LD) on the network to search for pathways and applies a scoring summary statistic on the resulting pathways to identify the most enriched pathways associ- ated with complex diseases. In addition, the proposed method also tests for possible signals of unusualdifferencesinexcess/deficiencyofancestryatgeneandpathwaylevelsinadmixedpop- ulations. Through simulations of well-characterized heterogeneous populations, we evaluated and compared the developed method with some existing methods, and demonstrated that this approach may enable more efficient and more powerful analysis of GWA studies, compared to most existing methods, as it incorporates topological properties of networks that provide more information on the relatedness and interconnectivity of genes. We applied the new approach to the GWA study dataset from the Cancer Genetic Markers of Susceptibility (CGEMS) for postmenopausal women of European ancestry with invasive breast cancer. Our analysis on the CGEMS breast cancer data revealed some previously targeted breast cancer pathways, and many others believed to be involved in breast carcinogenesis including, the Proteaglycan syndecan-1-mediated signaling pathway, the ErbB receptor signaling network, the Regulation of Androgen receptor activity pathway, and the Integrin family cell surface interactions pathway. The results suggest that genetic alterations in these pathways may contribute to breast cancer susceptibility. Declaration This thesis: • is my own work and contains nothing which is the outcome of work done in collaboration with others, except where specified in the text; • is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other university. I empower the university to reproduce for the purpose of research either the whole or any portion of the contents in any manner whatsoever. Signature: ................................. Date: ........................................ Mamana Mbiyavanga Acknowledgements Foremost, I would like to express my sincere gratitude to my advisor Prof. Nicola Mulder for the continuous support of my master study and research, for her patience, enthusiasm, motiva- tion, and immense knowledge. Her guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a such better supervisor and mentor for my master study. My sincere thanks also goes to Prof. Darren Martin, Dr. Chimusa Emile, and Dr. Mazandu Gaston, forthesupportontheusefulcomments, remarksandengagementthroughthelearning process of this master thesis. My master studies was supported by a UCT-AIMS joint grant from the African Institute for Mathematical Sciences and the University of Cape Town. Travel grants from the University of Cape Town allowed me to present some of this work at an international conference as well as to attend practical workshops related to my study and research. I thank my fellow labmates in the Computational Biology Group of the University of Cape Town for the stimulating discussions, for the sleepless nights we were working together before deadlines, and for all the fun we have had in the last two years. I would also like to thank my friends and my loved ones, Nkosi Dimba, Kilandamoko Ephraim, Gracia Nginamau, Matondo Christelle and Luzolawu Prisca, who have supported me through- out entire process, both by keeping me harmonious and helping me putting pieces together. I will be grateful forever for your love. Last but not the least, I would like to thank my family: my parents Nginamau Filipe and Kisuvidi Lando, for giving birth to me at the first place and supporting me spiritually throughout my life. Contents Abstract 2 Acknowledgments 5 List of Figures 8 List of Tables 9 Motivation and outline 10 1 Motivation and purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Proposed methodology and study outline . . . . . . . . . . . . . . . . . . . . . . 11 1 Introduction: Genetic association studies and Pathway-based approaches for GWAS analysis 14 1.1 Population-based association studies . . . . . . . . . . . . . . . . . . . . . . . . 15 1.1.1 Case-control design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1.2 Measure of genetic risks in case-control studies . . . . . . . . . . . . . . 20 1.1.3 Linkage disequilibrium and indirect association . . . . . . . . . . . . . . 22 1.2 Population structure and impact of population stratification . . . . . . . . . . . 26 1.2.1 Population structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2.2 Correcting for population stratification and cryptic relatedness . . . . . 29 1.3 Genome-wide association studies: progress and limitations . . . . . . . . . . . . 37 1.3.1 Overview of GWA studies . . . . . . . . . . . . . . . . . . . . . . . . . . 38 1.3.2 Limitations and future directions of GWA studies . . . . . . . . . . . . . 41 1.4 Overview on biological networks and pathway databases . . . . . . . . . . . . . 43 1.4.1 Analysis of biological networks . . . . . . . . . . . . . . . . . . . . . . . 45 1.4.2 Protein-protein interaction databases . . . . . . . . . . . . . . . . . . . . 47 1.4.3 Pathway annotation databases . . . . . . . . . . . . . . . . . . . . . . . 50 1.5 Pathway-based approaches for GWAS analysis . . . . . . . . . . . . . . . . . . . 51 1.5.1 Linking pathways to complex diseases . . . . . . . . . . . . . . . . . . . 52 1.5.2 Methods for pathway-based analysis of GWA studies . . . . . . . . . . . 53 1.5.3 Testing the null hypothesis: competitive and self-contained methods . . 55 1.5.4 Accounting for gene-level association: one-step and two-step methods . . 57 1.5.5 Impact of LD and adjustment of association significance for pathway size 58 1.6 Challenges and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2 ancGWAS:animprovedmethodforgene-geneinteractionanalysisofgenome- wide association study data for homogeneous and admixed populations 66 2.1 Methods and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.1.1 Combining p−values at the gene level . . . . . . . . . . . . . . . . . . . 67 2.1.2 Constructing the LD-weighted PPI network . . . . . . . . . . . . . . . . 69 2.1.3 Searching for sub-networks using centrality measures . . . . . . . . . . . 71 2.1.4 Combining p−values at the subnetwork level . . . . . . . . . . . . . . . 73 2.1.5 Combining local ancestry at the gene and sub-network levels . . . . . . 75 2.1.6 Testing case-control ancestry difference for gene or sub-network levels. . 77 2.1.7 Characterization of enriched sub-networks . . . . . . . . . . . . . . . . . 78 2.2 Implementation and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3 Evaluation of ancGWAS through simulation of disease in non-admixed and admixed populations 81 3.1 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.1.1 Overview of the dmGWAS method . . . . . . . . . . . . . . . . . . . . . 82 3.1.2 Simulation of non-admixed pathway-based case-control population . . . 84 3.1.3 Simulation of admixed case-control population . . . . . . . . . . . . . . 85 3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2.1 Assessing ancGWAS on a simulated pathway-based association study . . 86 3.2.2 Evaluating ancGWAS on a simulated disease in an admixed population 92 4 Application of ancGWAS: Identification of enriched pathways for sporadic postmeno-pausal breast cancer 100 4.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5 Conclusion 110 7 Bibliography 111 A Supplementary materials 127

