Making sense of the human genome using machine learning Fredrik Haaland Master Thesis Spring 2013 Making sense of the human genome using machine learning Fredrik Haaland April 30, 2013 Abstract Machine learning enables a computer to learn a relationship between two assumingly related types of information. One type of information could thus be used to predict any lack of information in the other using the learned relationship. During the last decades, it has become cheaper to collect biological information, which has resulted in increasingly large amounts of data. BiologicalinformationsuchasDNAiscurrentlyanalyzedbyavarietyoftools. Although machine learning has already been used in various projects, a flexible tool for analyzing generic biological challenges has not yet been made. The challenges of representing biological data in a generic way that permits machine learning is here discussed. A flexible machine learning application is presented for working on currently available biological DNA. Also, it targets biological challenges in an abstract manner, so that it may become useful for both current and future challenges. TheapplicationhasbeenimplementedinTheGenomicHyperBrowserandis publicly available. An use case inspired by a biological challenge demonstrates theapplicationusage. Amachinelearnedmodelisanalyzedandusedformaking predictions. The results are discussed and further actions of how to improve the model is proposed. The application offers a new way for researchers to investigate and analyze biological data using machine learning. Contents Contents i List of Figures vi List of Tables vii I Introduction 1 1 Background 7 1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.2 Binary- and multi-class classification . . . . . . . . . . . . . . . 8 1.1.3 Unsupervised Learning. . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.4 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.5 Concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.6 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.7 The learnable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.8 Data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.9 Learning curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 DNA and the human genome . . . . . . . . . . . . . . . . . . . . 13 1.2.2 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3 Reference genomes . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.4 Genomic annotation tracks . . . . . . . . . . . . . . . . . . . . . 14 1.2.5 Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.6 Analyzing genomic data . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Challenges and solutions in related research. . . . . . . . . . . . . . . 17 1.3.1 Imbalanced data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.2 Over and under-sampling . . . . . . . . . . . . . . . . . . . . . . 17 1.3.3 Over and under-fitting . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.4 Synthetic sampling and data generation . . . . . . . . . . . . . 18 1.3.5 Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . . . . . . 19 i II Work 21 2 Methods 23 2.1 Representinggenomicdataforusewithmachinelearningalgorithms 23 2.1.1 Abstracting and grouping genomic data challenges . . . . . . 24 2.1.2 Strategy for representing samples and features uniformly . . 24 2.1.3 Representing samples. . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.4 The track structure . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.5 The track elements . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Creating measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Capturing properties . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Using properties to create measures . . . . . . . . . . . . . . . 26 2.2.3 Measurement utilization . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.5 Transformation utilization. . . . . . . . . . . . . . . . . . . . . . 28 2.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Implementation 31 3.1 Machine learning components . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 The machine learning track . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 The machine learning track state . . . . . . . . . . . . . . . . . 35 3.1.3 The machine learning measure . . . . . . . . . . . . . . . . . . . 35 3.1.4 The machine learning transformation . . . . . . . . . . . . . . 35 3.1.5 Situation dependence . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.6 Combining transformations . . . . . . . . . . . . . . . . . . . . . 36 3.1.7 Feature measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.8 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1.9 Response measures . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Genomic machine learning data representation . . . . . . . . . . . . . 45 3.2.1 Example translation . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Post processing translation data . . . . . . . . . . . . . . . . . . 47 3.2.3 Feature similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.4 Track combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Adapting machine learning algorithms . . . . . . . . . . . . . . . . . . 52 3.3.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.2 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . 56 3.3.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.4 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . 58 3.3.5 K-nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.6 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.7 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.8 Comparing probabilities . . . . . . . . . . . . . . . . . . . . . . . 64 3.4 Application design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.1 Design goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.4 Usage and interaction . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.5 Learning tool pipeline. . . . . . . . . . . . . . . . . . . . . . . . . 73 ii 3.5 The machine learning language . . . . . . . . . . . . . . . . . . . . . . . 73 3.5.1 Structure and syntax . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.2 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 Results 77 4.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1 Data access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.2 Walk-through . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.3 Improving the model . . . . . . . . . . . . . . . . . . . . . . . . . 86 III Discussion 87 5 Challenges 89 5.1 Translation and representation . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2.1 Overdesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 Selection of programming language . . . . . . . . . . . . . . . . . . . . 90 5.3.1 Working with large datasets. . . . . . . . . . . . . . . . . . . . . 92 5.3.2 The struggle of legacy code . . . . . . . . . . . . . . . . . . . . . 96 5.3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6 Conclusion 99 7 Future work 101 A Application implementation 107 A.1 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.1.1 MLTrackState . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.1.2 MLMeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1.3 MLFeature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.1.4 MLResponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.1.5 MLTransformation . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.1.6 MLAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.2.1 Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.2.2 Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.2.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 iii iv
Description: