ebook img

Bioinformatics Database Systems PDF

290 Pages·2016·6.416 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bioinformatics Database Systems

Bioinformatics Database Systems Bioinformatics Database Systems Kevin Byron New Jersey Institute of Technology, Newark, USA Katherine G. Herbert Montclair State University, New Jersey, USA Jason T. L. Wang New Jersey Institute of Technology, Newark, USA CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20161122 International Standard Book Number-13: 978-1-4398-1247-1 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Byron, Kevin, 1954- , author. | Herbert, Katherine G., author. | Wang, Jason T. L., author. Title: Bioinformatics database systems / Kevin Byron, Katherine G. Herbert, Jason T. L. Wang. Description: Boca Raton : Taylor & Francis, 2017. | Includes bibliographical references and index. Identifiers: LCCN 2016027400 | ISBN 9781439812471 (hardback : alk. paper) Subjects: LCSH: Bioinformatics. | Biology--Databases. Classification: LCC QH324.2 .B97 2017 | DDC 570.285--dc23 LC record available at https://lccn.loc.gov/2016027400 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents LIST OF FIGURES vii LIST OF TABLES xi PREFACE xv ACKNOWLEDGMENTS xix Chapter 1 (cid:4) Overview of Bioinformatics Databases 1 1.1 INTRODUCTION 1 1.2 SEQUENCE DATABASES 2 1.3 PHYLOGENETIC DATABASES 7 1.4 STRUCTURE AND PATHWAY DATABASES 12 1.5 MICROARRAY AND BOUTIQUE DATABASES 14 Chapter 2 (cid:4) Biological Data Cleaning 17 2.1 INTRODUCTION 17 2.2 GENERAL DATA CLEANING 24 2.3 CASE STUDY IN BIOLOGICAL DATA CLEANING 34 Chapter 3 (cid:4) Biological Data Integration 47 3.1 INTRODUCTION 47 3.2 GENERAL DATA INTEGRATION 48 3.3 TOPICS IN BIOLOGICAL DATA INTEGRATION 50 Chapter 4 (cid:4) Biological Data Searching 57 4.1 INTRODUCTION 57 4.2 BIOLOGICAL DATA SEARCHING USING BLAST 58 4.3 BIOLOGICALDATASEARCHINGUSINGTHEUCSC GENOME BROWSER 58 v vi (cid:4) Contents 4.4 CASESTUDYINPHYLOGENETICTREEDATABASE SEARCH 59 4.5 CASE STUDY IN RNA PSEUDOKNOT DATABASE SEARCH 73 Chapter 5 (cid:4) Biological Data Mining 91 5.1 INTRODUCTION 91 5.2 GENERAL DATA MINING 93 5.3 BIOLOGICAL DATA MINING 94 5.4 CASE STUDY IN BIOLOGICAL MOTIF DISCOVERY 101 5.5 CASE STUDY IN BIOLOGICAL DATA MINING 111 Chapter 6 (cid:4) Biological Network Inference 133 6.1 INTRODUCTION 133 6.2 GENE REGULATORY NETWORK INFERENCE 135 Chapter 7 (cid:4) Cloud-Based Biological Data Processing 193 7.1 INTRODUCTION 193 7.2 DATA PROCESSING IN THE CLOUD 198 Bibliography 217 Index 261 List of Figures 2.1 A framework for biological data cleaning. 23 3.1 Scientific data integration. 52 3.2 Data integration in a single repository. 54 3.3 Data integration in multiple repositories. 55 4.1 UpDistance. 61 4.2 Up operations and down operations. 62 4.3 Arbitrary five-node tree and up and down matrices. 63 4.4 Arbitrary tree with eight nodes. 65 4.5 Arbitrary tree with seven nodes. 66 4.6 Example of a query tree P and a data tree D. 67 4.7 Further examples of a query tree and a data tree. 68 4.8 PDB crystal structures 1E8O and 2B57. 77 4.9 Histograms for the base mismatch ratios of the align- ments produced by RKalign, CARNA, RNA Sampler, DAFS, R3D Align and RASS. 80 4.10 Boxplotsforthebase mismatchratiosofthealignments produced by RKalign, CARNA, RNA Sampler, DAFS, R3D Align and RASS. 81 4.11 Exampleshowingbasemismatchesinanalignmentpro- duced by DAFS, R3D Align, and RKalign. 82 4.12 Histograms for the base mismatch ratios of the align- ments produced by RKalign, CARNA, RNA Sampler and DAFS. 83 4.13 Boxplotsforthebase mismatchratiosofthealignments producedbyRKalign,CARNA,RNASamplerandDAFS. 84 4.14 Comparison of the stem mismatch ratios yielded by RKalign and CARNA. 85 vii viii (cid:4) LIST OF FIGURES 4.15 Multiple alignment: PDB crystal structures 1E8O, 1XP7, 2F4X, 2OOM, 2D19 and 1BAU. 87 5.1 Three RNA topological families, A, B, and C, of three- way RNA junctions containing coaxial helical stacking. 113 5.2 Secondary structure plot produced by VARNA of nu- cleotides 5 through 49 of chain A from PDB molecule 3E5C. 116 5.3 Three-dimensional plot produced by Jmol of nucleotides 5 through 49 of chain A from PDB molecule 3E5C. 117 5.4 Hypothetical three-way RNA junction to illustrate fea- tures used by random forests classifier. 118 5.5 Stockholm format multiple sequence alignment of ncRNAmoleculesfromsixsamplesrecordedinthe PDB with identifiers 2GDI, 2CKY, 2AVY, 1S72, 2AW4 and 2J01. 122 5.6 CSminer’s prediction result on the genome of D. radio- durans. 123 5.7 Junction Explorer Screenshot 1: home page. 127 5.8 Junction Explorer Screenshot 2: sample input in dot- bracket format. 128 5.9 JunctionExplorerScreenshot3:details ofjunction type, junction location,junctionloopregionsandcoaxialheli- cal stacking prediction. 129 5.10 Junction Explorer Screenshot 4: visualizations of pre- dicted topology (family). 130 6.1 Hypothetical five-gene network. 137 6.2 Another hypothetical five-gene network. 138 6.3 Goldstandard-directedgraphforDREAM410-genenet- work 1. 154 6.4 Top 10 GRN edges as predicted by ARACNE algorithm. 158 6.5 Top 10 GRN edges as predicted by CLR algorithm. 160 6.6 Top 10 GRN edges as predicted by GENIE3 algorithm. 161 6.7 Top 10 GRN edges as predicted by MRNET algorithm. 163 6.8 GRN edges as predicted by Inferelator algorithm. 166 6.9 GRN edges as predicted by Jump3 algorithm. 168 6.10 GRN edges as predicted by NetworkBMA algorithm. 170 LIST OF FIGURES (cid:4) ix 6.11 GRN edges as predicted by TimeDelay-ARACNE algo- rithm using DREAM4 10-gene dataset 1. 171 6.12 Diagramshowingtrue regulatoryrelationshipsbetween E. coli transcriptionfactor FNR and severaltargetgenes.175 6.13 Algorithm of semi-supervised link prediction methods. 178 6.14 Performancecomparisonofthetransductiveandinduc- tive learning approaches based on the E. coli transcrip- tion factorARCA anddatasetGSE21869with the SVM algorithm and the RF algorithm. 182 6.15 Performancecomparisonofthetransductiveandinduc- tive learningapproachesbasedonthe S. cerevisiae tran- scription factor REB1 and dataset GSE12222 with the SVM algorithm and the RF algorithm. 183 6.16 PerformancecomparisonoftheSVMandRFalgorithms with the transductive learning approach on five gene expression datasets GSE10158, GSE12411, GSE33147, GSE21869and GSE17505,andtwo transcriptionfactors of E. coli. 184 6.17 PerformancecomparisonoftheSVMandRFalgorithms with the transductivelearning approachonfive gene ex- pression datasets and two other transcription factors of E. coli. 185 6.18 PerformancecomparisonoftheSVMandRFalgorithms with the transductive learning approach on five gene expression datasets GSE30052, GSE12221, GSE12222, GSE40817 and GSE8799, and two transcription factors of S. cerevisiae. 187 6.19 PerformancecomparisonoftheSVMandRFalgorithms with the transductivelearning approachonfive gene ex- pression datasets and two other transcription factors of S. cerevisiae. 188 7.1 MapReduce example 1. 206 7.2 MapReduce example 2. 214

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.