ebook img

Data Analytics in Bioinformatics: A Machine Learning Perspective PDF

521 Pages·2021·16.644 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Analytics in Bioinformatics: A Machine Learning Perspective

Data Analytics in Bioinformatics Scrivener Publishing 100 Cummings Center, Suite 541J Beverly, MA 01915-6106 Publishers at Scrivener Martin Scrivener ([email protected]) Phillip Carmical ([email protected]) Data Analytics in Bioinformatics A Machine Learning Perspective Edited by Rabinarayan Satpathy Tanupriya Choudhury Suneeta Satpathy Sachi Nandan Mohanty and Xiaobo Zhang This edition first published 2021 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA © 2021 Scrivener Publishing LLC For more information about Scrivener publications please visit www.scrivenerpublishing.com. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or other- wise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. Wiley Global Headquarters 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley prod- ucts visit us at www.wiley.com. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no rep- resentations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchant- ability or fitness for a particular purpose. No warranty may be created or extended by sales representa- tives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further informa- tion does not mean that the publisher and authors endorse the information or services the organiza- tion, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Library of Congress Cataloging-in-Publication Data ISBN 978-1-119-78553-8 Cover image: Pixabay.Com Cover design by Russell Richardson Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines Printed in the USA 10 9 8 7 6 5 4 3 2 1 Contents Preface xix Acknowledgement xxi Part 1 The Commencement of Machine Learning Solicitation to Bioinformatics 1 1 Introduction to Supervised Learning 3 Rajat Verma, Vishal Nagar and Satyasundara Mahapatra 1.1 Introduction 4 1.2 Learning Process & its Methodologies 5 1.2.1 Supervised Learning 7 1.2.2 Unsupervised Learning 8 1.2.3 Reinforcement Learning 9 1.3 Classification and its Types 10 1.4 Regression 12 1.4.1 Logistic Regression 15 1.4.2 Difference between Linear & Logistic Regression 16 1.5 Random Forest 18 1.6 K-Nearest Neighbor 20 1.7 Decision Trees 21 1.8 Support Vector Machines 22 1.9 Neural Networks 24 1.10 Comparison of Numerical Interpretation 26 1.11 Conclusion & Future Scope 27 References 28 2 Introduction to Unsupervised Learning in Bioinformatics 35 Nancy Anurag Parasa, Jaya Vinay Namgiri, Sachi Nandan Mohanty and Jatindra Kumar Dash 2.1 Introduction 36 v vi Contents 2.2 Clustering in Unsupervised Learning 37 2.3 Clustering in Bioinformatics—Genetic Data 38 2.3.1 Microarray Analysis 38 2.3.2 Clustering Algorithms 40 2.3.3 Partition Algorithms 41 2.3.3.1 k-Means Clustering 41 2.3.3.2 Cluster Center Initialization Algorithm (CCIA) 41 2.3.3.3 Intelligent Kernel k-Mean (IKKM) 41 2.3.3.4 Clustering Large Applications (CLARA) 42 2.3.4 Hierarchical Clustering Algorithms 42 2.3.4.1 AGNES (Agglomerative Nesting) 43 2.3.4.2 DIANA (Divisive Analysis) 43 2.3.4.3 CURE (Clustering Using Representatives) 43 2.3.4.4 CHAMELEON 43 2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies) 44 2.3.5 Density-Based Approach 44 2.3.5.1 DBSCAN 44 2.3.6 Model-Based Approach 45 2.3.6.1 SOM (Self-Organizing Maps) 45 2.3.7 Grid-Based Clustering 45 2.3.7.1 STING (Statistical Information Grid-Based Algorithm) 46 2.3.8 Soft Clustering 46 2.3.8.1 FCM (Fuzzy Class Membership) 46 2.4 Conclusion 46 References 47 3 A Critical Review on the Application of Artificial Neural Network in Bioinformatics 51 Vrs Jhalia and Tripti Swarnkar 3.1 Introduction 52 3.1.1 Different Areas of Application of Bioinformatics 52 3.1.2 Bioinformatics in Real World 53 3.1.3 Issues with Bioinformatics 56 3.1.3.1 Issues Related to Structure 56 3.1.3.2 Sequence Analysis 56 3.2 Biological Datasets 57 3.3 Building Computational Model 58 3.3.1 Data Pre-Processing and its Necessity 58 3.3.2 Biological Data Classification 59 Contents vii 3.3.3 ML in Bioinformatics 60 3.3.4 Introduction to ANN 61 3.3.5 Application of ANN in Bioinformatics 63 3.3.6 Broadly Used Supervised Machine Learning Techniques 64 3.4 Literature Review 64 3.4.1 Comparative Analysis of ANN With Broadly Used Traditional ML Algorithms 67 3.5 Critical Analysis 72 3.6 Conclusion 73 References 73 Part 2 Machine Learning and Genomic Technology, Feature Selection and Dimensionality Reduction 77 4 Dimensionality Reduction Techniques: Principles, Benefits, and Limitations 79 Hemanta Kumar Palo, Santanu Sahoo and Asit Kumar Subudhi 4.1 Introduction 80 4.2 The Benefits and Limitations of Dimension Reduction Methods 81 4.3 Components of Dimension Reduction 83 4.3.1 Feature Selection 84 4.3.2 Feature Reduction 86 4.4 Methods of Dimensionality Reduction 86 4.4.1 Principal Component Analysis (PCA) 88 4.4.2 Missing Values Ratio (MVR) 89 4.4.3 Linear Discriminant Analysis (LDA) 90 4.4.4 Backward Feature Elimination (BFE) 92 4.4.5 Forward Feature Construction (FFC) 93 4.4.6 Independent Component Analysis (ICA) 94 4.4.7 Low Variance Filter (LVF) 95 4.4.8 High Correlation Filter 97 4.4.9 Random Forests (RF)/Ensemble Trees 97 4.4.10 t-Distributed Stochastic Neighbor Embedding (t-SNE) 99 4.4.11 Autoencoder 100 4.4.12 Factor Analysis (FA) 100 4.4.13 Uniform Manifold Approximation and Projection (UMAP) 101 4.4.14 Information Gain (IG) 101 4.4.15 Vector Quantization (VQ) 102 viii Contents 4.5 Conclusion 104 References 105 5 Plant Disease Detection Using Machine Learning Tools With an Overview on Dimensionality Reduction 109 Saurav Roy, Ratula Ray, Satya Ranjan Dash and Mrunmay Kumar Giri 5.1 Introduction 110 5.2 Flowchart 112 5.3 Machine Learning (ML) in Rapid Stress Phenotyping 113 5.4 Dimensionality Reduction 114 5.4.1 Feature Extraction 114 5.4.1.1 PCA (Principal Component Analysis) 115 5.4.1.2 LDA (Linear Discriminant Analysis) 115 5.4.1.3 SIFT (Scale Invariant Feature Transform) 115 5.4.1.4 SURF (Speeded Up Robust Features) 116 5.4.1.5 ORB (Oriented FAST and Rotated BRIEF) 116 5.5 Literature Survey 116 5.6 Types of Plant Stress 128 5.6.1 Biotic Stress 128 5.6.1.1 Fungal Pathogen 129 5.6.1.2 Bacterial Pathogen 129 5.7 Implementation I: Numerical Dataset 130 5.7.1 Dataset Description 130 5.7.2 Results 131 5.7.3 Discussion 133 5.8 Implementation II: Image Dataset 134 5.8.1 Dataset Description 134 5.8.2 Method Used 134 5.8.3 Results 134 5.8.3.1 Results of ORB Feature Extraction and Brute Force Matching 134 5.8.3.2 Color Histogram Comparison: Using Correlation Method 138 5.8.4 Discussions 138 5.9 Conclusion 140 References 141 6 Gene Selection Using Integrative Analysis of Multi-Level Omics Data: A Systematic Review 145 S. Mahapatra and T. Swarnkar 6.1 Introduction 146 Contents ix 6.2 Approaches for Gene Selection 147 6.3 Multi-Level Omics Data Integration 152 6.3.1 Horizontal Integration 153 6.3.2 Vertical Integration 153 6.4 Machine Learning Approaches for Multi-Level Data Integration 153 6.4.1 Unsupervised Integration of Omics Data 159 6.4.2 Supervised Integration of Omics Data 163 6.5 Critical Observation 165 6.6 Conclusion 166 References 166 7 Random Forest Algorithm in Imbalance Genomics Classification 173 Sudhansu Shekhar Patra, Om Praksah Jena, Gaurav Kumar, Sreyashi Pramanik, Chinmaya Misra and Kamakhya Narain Singh 7.1 Introduction 173 7.2 Methodological Issues 175 7.2.1 Decision Tree (DT) Classifier 175 7.2.2 Ensemble Techniques 177 7.2.3 Mathematical Formulation of Ensemble Technique 177 7.2.4 Bagging 178 7.2.5 Bagging Pseudocode 179 7.2.6 Random Forest 180 7.3 Biological Terminologies 181 7.3.1 DNA 181 7.3.2 Genomics 181 7.3.3 Proteins 183 7.4 Proposed Model 183 7.4.1 Balancing the Data 184 7.4.2 Ensembling of Trees 185 7.5 Experimental Analysis 186 7.6 Current and Future Scope of ML in Genomics 188 7.6.1 Gene Sequencing 188 7.6.2 Services to Consumer 188 7.6.3 Gene Editing 188 7.6.4 Pharmacy Genomics 188 7.6.5 Newborn Genetic Screening 188 7.7 Conclusion 189 References 189

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.