ebook img

Reducing the complexity of OMICS data analysis PDF

220 Pages·2017·9.9 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reducing the complexity of OMICS data analysis

Julius-Maximilians-Universität Würzburg Reducing the complexity of OMICS data analysis Dissertation zur Erlangung des naturwissenschaftlichen Doktorgrades der Julius-Maximilians-Universität Würzburg Vorgelegt von Beat Wolf aus Fribourg, CH, 2017 Eingereicht am: 5 April 2017 bei der Fakultät für Mathematik und Informatik 1. Gutachter: Prof. Dr. Thomas Dandekar 2. Gutachter: Prof. Dr. Pierre Kuonen Tag der mündlichen Prüfung: 31 August 2017 Summary The field of genetics faces a lot of challenges and opportunities in both research and diag- nostics due to the rise of next generation sequencing (NGS), a technology that allows to sequence DNA increasingly fast and cheap. NGS is not only used to analyze DNA, but also RNA, which is a very similar molecule also present in the cell, in both cases producing large amounts of data. The big amount of data raises both infrastructure and usability problems, as powerful computing infrastructures are required and there are many manual steps in the data analysis which are complicated to execute. Both of those problems limit the use of NGS in the clinic and research, by producing a bottleneck both computationally and in terms of manpower, as for many analyses geneticists lack the required computing skills. Over the course of this thesis we investigated how computer science can help to improve this situation to reduce the complexity of this type of analysis. We looked at how to make the analysis more accessible to increase the number of people that can perform OMICS data analysis (OMICS groups various genomics data-sources). To approach this problem, we developed a graphical NGS data analysis pipeline aimed at a diagnostics environment while still being useful in research in close collaboration with the Human Genetics Depart- ment at the University of Würzburg. The pipeline has been used in various research papers on covering subjects, including works with direct author participation in genomics, tran- scriptomics as well as epigenomics. To further validate the graphical pipeline, a user survey was carried out which confirmed that it lowers the complexity of OMICS data analysis. We also studied how the data analysis can be improved in terms of computing infrastruc- ture by improving the performance of certain analysis steps. We did this both in terms of speed improvements on a single computer (with notably variant calling being faster by up to 18 times), as well as with distributed computing to better use an existing infrastructure. The improvements were integrated into the previously described graphical pipeline, which itself also was focused on low resource usage. As a major contribution and to help with future development of parallel and distributed applications, for the usage in genetics or otherwise, we also looked at how to make it easier to develop such applications. Based on the parallel object programming model (POP), we created a Java language extension called POP-Java, which allows for easy and transpar- ent distribution of objects. Through this development, we brought the POP model to the cloud, Hadoop clusters and present a new collaborative distributed computing model called FriendComputing. The advances made in the different domains of this thesis have been published in various works specified in this document. i Zusammenfassung Das Gebiet der Genetik steht vor vielen Herausforderungen, sowohl in der Forschung als auch Diagnostik, aufgrund des "next generation sequencing" (NGS), eine Technologie die DNA immer schneller und billiger sequenziert. NGS wird nicht nur verwendet um DNA zu analysieren sondern auch RNA, ein der DNA sehr ähnliches Molekül, wobei in beiden Fällen große Datenmengen zu erzeugt werden. Durch die große Menge an Daten entstehen Infrastruktur und Benutzbarkeitsprobleme, da leistungsstarke Computerinfrastrukturen er- forderlich sind, und es viele manuelle Schritte in der Datenanalyse gibt die kompliziert auszuführen sind. Diese beiden Probleme begrenzen die Verwendung von NGS in der Klinik und Forschung, da es einen Engpass sowohl im Bereich der Rechnerleistung als auch beim Personal gibt, da für viele Analysen Genetikern die erforderlichen Computerkenntnisse fehlen. In dieser Arbeit haben wir untersucht wie die Informatik helfen kann diese Situation zu verbessern indem die Komplexität dieser Art von Analyse reduziert wird. Wir haben angeschaut, wie die Analyse zugänglicher gemacht werden kann um die Anzahl Personen zu erhöhen, die OMICS (OMICS gruppiert verschiedene Genetische Datenquellen) Daten- analysendurchführenkönnen.InengerZusammenarbeitmitdemInstitutfürHumangenetik der Universität Würzburg wurde eine graphische NGS Datenanalysen Pipeline erstellt um dieseFragezuerläutern.DiegraphischePipelinewurdefürdenDiagnostikbereichentwickelt ohne aber die Forschung aus dem Auge zu lassen. Darum warum die Pipeline in verschiede- nen Forschungsgebieten verwendet, darunter mit direkter Autorenteilname Publikationen in der Genomik, Transkriptomik und Epigenomik, Die Pipeline wurde auch durch eine Be- nutzerumfrage validiert, welche bestätigt, dass unsere graphische Pipeline die Komplexität der OMICS Datenanalyse reduziert. Wir haben auch untersucht wie die Leistung der Datenanalyse verbessert werden kann, damit die nötige Infrastruktur zugänglicher wird. Das wurde sowohl durch das optimieren der verfügbaren Methoden (wo z.B. die Variantenanalyse bis zu 18 mal schneller wurde) als auch mit verteiltem Rechnen angegangen, um eine bestehende Infrastruktur besser zu verwenden. Die Verbesserungen wurden in der zuvor beschriebenen graphischen Pipeline integriert, wobei generell die geringe Ressourcenverbrauch ein Fokus war. Um die künftige Entwicklung von parallelen und verteilten Anwendung zu unterstützen, ob in der Genetik oder anderswo, haben wir geschaut, wie man es einfacher machen könnte solche Applikationen zu entwickeln. Dies führte zu einem wichtigen informatischen Result, in dem wir, basierend auf dem Model von „parallel object programming“ (POP), eine Erweiterung der Java-Sprache na- mens POP-Java entwickelt haben, die eine einfache und transparente Verteilung von Ob- jekten ermöglicht. Durch diese Entwicklung brachten wir das POP-Modell in die Cloud, Hadoop-Cluster und präsentieren ein neues Model für ein verteiltes kollaboratives rechnen, FriendComputing genannt. Die verschiedenen veröffentlichten Teile dieser Dissertation werden speziel aufgelistet und diskutiert. ii Acknowledgment For this thesis to happen and finish I have to thank numerous people and institutions. First and foremost I would like to thank Prof. Pierre Kuonen for not only giving me the opportunity to make this dissertation, but encouraging me to do so and giving me the best environment possible. I would also like to thank Prof. Thomas Dandekar for supervising my thesis, giving me precious advice and guidance in the field of bioinformatics. A big thanks goes also to Dr. David Atlan, that gave me the opportunity to perform this thesis with a very practical oriented approach, making it possible for much of my work being used in real laboratories across Europe. Having my work being used on a daily basis in a diagnostics environment was a major motivational force throughout the thesis. I would also like to thank Prof. Clemens Müller Reible and Prof. Simone Rost of the Institute of Human Genetics in Würzburg, for following my thesis with so much interest, giving me advice and most importantly for their trust in my work, introducing it in their laboratory to be used for the regular data analysis. I would like to thank the co-authors with which I had the opportunity to write various papers, through which I could learn a lot and get familiarized with many topics. Without them, much of my work would be theoretical with no practical implications. Having me supported me throughout the thesis, I also want to thank especially my girlfriend Gaëlle Kolly. A special thanks also goes to my parents, which made it possible to follow a research career. Last but not least I would also like to thank the University of Würzburg and the Univer- sity of Applied Sciences and Arts Western Switzerland for accepting me for my PhD. I’m grateful for having had the opportunity to make my PhD through a collaboration of two Universities, one more focused on the academic side and the other on the practical side. iii Contents 1. Introduction 1 1.1. Motivation and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3. Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 I. Foundations 5 2. Genetics 6 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1. Genetic code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2. Next generation sequencing . . . . . . . . . . . . . . . . . . . . . . . 12 2.2. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3. OMICs data analysis 18 3.1. Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2. Transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3. Epigenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4. File-formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4. Diagnostics 39 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2. Genetic disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3. Software requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5. Parallel & distributed computing 46 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2. History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.1. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.2. GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.3. Distributed computing . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 iv Contents CONTENTS II. Methods 54 6. Graphical pipeline 55 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2. Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.4. User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.5. Project management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.6. Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.7. Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.7.1. Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.7.2. Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.7.3. Coverage analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.7.4. Variant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.7.5. Variant comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.7.6. Copy number variations . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.7.7. Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7. Data analysis 73 7.1. Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1.2. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.1.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2. Meta-Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2.2. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.2.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3. Variant calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.3.2. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4. RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.4.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.4.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.5. Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5.2. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.5.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.6. Genome browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 v Contents CONTENTS 7.6.2. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.6.3. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.6.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.6.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.7. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8. POP-Java 127 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.2. State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.2.1. Parallel computing in Java . . . . . . . . . . . . . . . . . . . . . . . . 128 8.2.2. Distributed computing . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.2.3. Language extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.3. POP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.4. POP Java prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.4.1. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.5. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.5.1. Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.5.2. Additional changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.6. Usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.6.1. Distributed matrix multiplications . . . . . . . . . . . . . . . . . . . . 137 8.6.2. Distributed mandelbrot . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.7. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.7.1. Cloud integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.7.2. Hadoop cluster integration . . . . . . . . . . . . . . . . . . . . . . . . 142 8.7.3. TrustedFriendComputing . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.8. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.8.1. Distributed sequence alignment . . . . . . . . . . . . . . . . . . . . . 146 8.9. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.10.Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 III. Applications 151 9. Graphical pipeline applications 152 9.1. Author participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 9.1.1. Deep intronic variants in the factor VIII gene . . . . . . . . . . . . . 152 9.1.2. Myofibrillar myopathies . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.1.3. Transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.1.4. Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.2. Indirect participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.3. User survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.3.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.3.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 10.Conclusion & Future works 165 vi Contents CONTENTS 11.Publications 168 11.1.Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.1.1. Journal papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.1.2. Conference proceedings . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.1.3. Misc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 11.2.Posters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Appendices 194 A. Artistic data visualization 195 B. Meta-alignment 200 C. Custom file formats 201 D. Polling 205 E. Downloads 210 F. Declaration of Authorship 211 vii 1. Introduction 1.1. Motivation and scope The field of genetics received a huge boost with the development of next generation se- quencing (NGS) techniques which allow sequencing DNA (deoxyribonucleic acid) at speeds never seen before. A distinct research field, called bioinformatics, is dedicated to support geneticists to cope with the analysis of this data. While the domain of bioinformatics has been a part of biology and genetics in particular for a long time, it became an integral and indispensable part once the new sequencing started to create increasingly big amounts of data. Bioinformatics combines multiple fields, such as computer science, mathematics and statistics to solve the issues faced in biology. In the case of NGS data analysis, bioinformat- ics tools allow to automate a lot of analysis steps and to extract information from the data that would be impossible to do manually. This can happen in various domains using NGS data such as Genomics, Transcriptomics and Epigenomics, all grouped under the name of OMICS (alongside other -omics). While many tools exist to do various types of analysis for OMICS data analysis, their usage remains complicated and thus restricted to resourceful institutions. More often than not, the analysis of OMICS data requires the collaboration of bioinformaticians and geneticists, as both depend on the skill set of the other to analyze the data. This requirement for collaboration restricts the number of people that can work with NGS data and thus slows down the scientific progress of the field and keeps the costs high. Developing tools which allow geneticists to work independently from bioinformati- cians becomes increasingly important as this dependency is a serious bottleneck in today’s data analysis. While the acquiring of data continues to get cheaper (see Figure 1.1) and faster, the interpretation of the data did not yet follow at the same pace. This increase in time and cost to analyze the data comes not only from the complexity of the analysis itself, but also from the fact that the amount of data created increases faster than the comput- ing capacities are predicted to improve. In 1965, Gordon Moore predicted the doubling of computing power in a single computer every 18 months. This is also known as Moore’s Law and has been remarkably accurate (although arguably a self-fulfilling prophecy) ever since then. The evolution of Moore’s Law and the reduction in sequencing cost is also shown in Figure 1.1, highlighting the problem of data analysis faced today. To solve the computing power challenges associated with this evolution, a big focus in research is put on speeding up existing methods. When this is not possible, increasingly parallel and distributed computing is used to handle the large amounts of data produced. With the arrival of cloud computing this type of approach has been democratized by not limiting it to big organizations with access to grid environments or clusters. But the setup and usage of said tools is often complicated and requires a considerable amount of expertise in the domain. This again limits the number of people able to analyze genetic data and creates delays for the analysis. The problem of accessibility is accentuated with genetic data analysis becoming increas- ingly common and affordable. Today, even private individuals can get their genetic data 1

Description:
Tab. 2.1.: Codon table (DNA to Protein), taken from wikipedia 4. The codon the Turing machine, a now fundamental theoretical model of modern VCF file from dbSNP [SWK+01] or the output file of the Alamut batch analysis.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.