ebook img

Scalable Big Data Analytics for Protein Bioinformatics: Efficient Computational Solutions for Protein Structures PDF

331 Pages·2018·11.69 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Scalable Big Data Analytics for Protein Bioinformatics: Efficient Computational Solutions for Protein Structures

Computational Biology Dariusz Mrozek Scalable Big Data Analytics for Protein Bioinformatics Efficient Computational Solutions for Protein Structures Computational Biology Volume 28 Editors-in-Chief Andreas Dress, CAS-MPG PartnerInstitute for Computational Biology,Shanghai, China Michal Linial,Hebrew University of Jerusalem,Jerusalem, Israel OlgaTroyanskaya, Princeton University, Princeton, NJ, USA Martin Vingron, MaxPlanckInstitute for Molecular Genetics, Berlin, Germany Editorial Board RobertGiegerich, University of Bielefeld,Bielefeld, Germany Janet Kelso, MaxPlanckInstitute for Evolutionary Anthropology, Leipzig, Germany Gene Myers,MaxPlanck Institute of Molecular Cell Biology andGenetics, Dresden, Germany PavelA.Pevzner,University ofCalifornia, SanDiego, CA, USA Advisory Board Gordon Crippen, University of Michigan,Ann Arbor,MI,USA Joe Felsenstein,University of Washington,Seattle, WA,USA Dan Gusfield,University of California, Davis, CA,USA Sorin Istrail, Brown University, Providence,RI, USA ThomasLengauer, MaxPlanckInstitute forComputer Science, Saarbrücken, Germany Marcella McClure, MontanaState University, Bozeman,MO, USA Martin Nowak, HarvardUniversity, Cambridge, MA, USA DavidSankoff, University of Ottawa,Ottawa, ON,Canada RonShamir, TelAvivUniversity, TelAviv, Israel Mike Steel,University ofCanterbury, Christchurch, NewZealand Gary Stormo,Washington University in St.Louis, St.Louis, MO, USA Simon Tavaré,University ofCambridge, Cambridge, UK Tandy Warnow,University of Illinois at Urbana-Champaign, Champaign, IL, USA LonnieWelch, OhioUniversity, Athens, OH, USA The Computational Biology series publishes the very latest, high-quality research devotedtospecificissuesincomputer-assistedanalysisofbiologicaldata.Themain emphasis is on current scientific developments and innovative techniques in computational biology (bioinformatics), bringing to light methods from mathemat- ics, statistics and computer science that directly address biological problems currently under investigation. The series offers publications that present the state-of-the-art regarding the problemsinquestion;showcomputationalbiology/bioinformaticsmethodsatwork; and finally discuss anticipated demands regarding developments in future methodology. Titles can range from focused monographs, to undergraduate and graduate textbooks, and professional text/reference works. More information about this series at http://www.springer.com/series/5769 Dariusz Mrozek Scalable Big Data Analytics for Protein Bioinformatics fi Ef cient Computational Solutions for Protein Structures 123 Dariusz Mrozek Silesian University of Technology Gliwice, Poland ISSN 1568-2684 Computational Biology ISBN978-3-319-98838-2 ISBN978-3-319-98839-9 (eBook) https://doi.org/10.1007/978-3-319-98839-9 LibraryofCongressControlNumber:2018950968 ©SpringerNatureSwitzerlandAG2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland For my always smiling and beloved wife Bożena, and my lively and infinitely active sons Paweł and Henryk, with all my love. To my parents, thank you for your support, concern and faith in me. Foreword High-performance computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. This timely book by Dariusz Mrozek gives youa quick introductionto thearea of proteins and their structures, protein structure similarity searching carried out at main representation levels, and various techniques that can be used to accelerate similarity searches using high-performance Cloud computing and Big Data con- cepts.Itpresentsintroductoryconceptsofformalmodelof3Dproteinstructuresfor functional genomics, comparative bioinformatics, and molecular modeling and the useofmulti-threadingfortheefficientapproximatesearchingonproteinsecondary structures.Inaddition,thereisamaterialonfinding3Dproteinstructuresimilarities accelerated with high-performance computing techniques. The book is required readingto help inunderstandingfor anyone working with areaofdataanalyticsforstructuralbioinformaticsandtheuseofhigh-performance computing. It explores area of proteins and their structures in depth and provides practical approaches to many problems that may be encountered. It is especially useful to applications developers, scientists, students, and teachers. I have enjoyed and learned from this book and feel confident that you will as well. Knoxville, USA Jack Dongarra June 2018 University of Tennessee vii Preface Internationaleffortsfocusedonunderstandinglivingorganismsatvariouslevelsof molecular organization, including genomic, proteomic, metabolomic, and cell signalinglevels,leadtohugeproliferationofbiologicaldatacollectedindedicated, and frequently, public repositories. The amount of data deposited in these reposi- tories increases every year, and cumulated volume has grown to sizes that are difficult to handle with traditional analysis tools. This growth of biological data is stimulated by various international projects, such as 1000 Genomes. The project aimsatsequencinggenomesofatleastonethousandanonymousparticipantsfrom anumberofdifferentethnicgroupsinordertoestablishadetailedcatalogofhuman genetic variations. As a result, it generates terabytes of genetic data. Apart from international initiatives and projects, like the 1000 Genomes, the proliferation of biological data is further accelerated by newly developed technologies for DNA sequencing, like next-generation sequencing (NGS) methods. These methods are gettingfasterandlessexpensiveeveryyear.Theyproducehugeamountsofgenetic data that require fast analysis in various phases of molecular profiling, medical diagnostics, and treatment of patients that suffer from serious diseases. Indeed, for the last three decades we have been witnesses of the continuous exponential growth of biological data in repositories, such as GenBank, Sequence Read Archive (SRA), RefSeq, Protein Data Bank, UniProt/SwissProt. The speci- ficityofthedatahasinspiredthescientificcommunitytodevelopmanyalgorithms that can be used to analyze the data and draw useful conclusions. A huge volume ofthebiologicaldatacausedthatmanyoftheexistingalgorithmsbecameinefficient due to their computational complexity. Fortunately, the rapid development of computer science in the last decade has brought many technological innovations thatcanbealsousedinthefieldofbioinformaticsandlifesciences.Thealgorithms demonstratingasignificantutilityvalue,whichhaverecentlybeenperceivedastoo time-consuming, can now be efficiently used by applying the latest technological achievements,likeHadoopandSparkforanalyzingBigDatasets,multi-threading, graphics processing units (GPUs), or cloud computing. ix x Preface Scope of the Book The book focuses on proteins and their structures. It presents various scalable solutionsforproteinstructuresimilaritysearchingcarriedoutatmainrepresentation levels and for prediction of 3D structures of proteins. It specifically focuses on various techniques that can be used to accelerate similarity searches and protein structuremodelingprocesses.But,whyproteins?somebodycanask.Icouldanswer the question by following Arthur M. Lesk in his book entitled Introduction to ProteinScience.Architecture,Function,andGenomics.Becauseproteinsarewhere the action is. Understanding proteins, their structures, functions, mutual interac- tions, activity in cellular reactions, interactions with drugs, andexpression inbody cells is a key to efficient medical diagnosis, drug production, and treatment of patients. I have been fascinated with proteins and their structures for fifteen years. I have fallen in love with the beauty of protein structures at first sight inspired by theresearchconductedbyR.I.P.LechZnamirowskifromtheSilesianUniversityof Technology, Gliwice, Poland. I decided to continue his research on proteins and development of new efficient tools for their analysis and exploration. I believe this book will be interesting for scientists, researchers, and software developers working in the field of structural bioinformatics and biomedical data- bases. I hope that readers of the book will find it interesting and helpful in their everyday work. Chapter Overview The content of the book is divided into four parts. The first part provides back- ground information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of technologies used in the solutions presented in this book. (cid:129) Chapter 1: Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling Thischaptershowshowproteinscanberepresentedincomputationalprocesses performed in scientific fields, such as functional genomics, comparative bioin- formatics,andmolecularmodeling.Thechapterprovidesageneraldefinitionof protein spatial structure that is then referenced to four representation levels of protein structure: primary, secondary, tertiary, and quaternary structures. (cid:129) Chapter 2: Technological Roadmap This chapter provides a technological roadmap for solutions presented in this book. It covers a brief introduction to the concept of Cloud computing, cloud service, and deployment models. It also defines the Big Data challenge and Preface xi presentsthebenefitsofusing multi-threading in scientific computations.It then explains graphics processing units (GPUs) and CUDA architecture. Finally, it focuses on relational databases and the SQL language used for declarative querying. ThesecondpartofthebookisfocusedonCloudservicesthatareutilizedinthe development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction. (cid:129) Chapter 3: Azure Cloud Services Microsoft Azure Cloud Services support development of scalable and reliable cloudapplicationsthatcanbeusedtoscientificcomputing.Thischapterprovides abriefintroductiontoMicrosoftAzurecloudplatformanditsservices.Itfocuses onAzureCloudServicesthatallowbuildingacloud-basedapplicationwiththe use of Web roles and Worker roles. Finally, it shows a sample application that canbequicklydevelopedonthebasisofthesetwotypesofrolesandtheroleof queues inpassingmessages between components of thebuilt system. (cid:129) Chapter 4: Scaling 3D Protein Structure Similarity Searching with Cloud Services In this chapter, you will see how the Cloud computing architecture and Azure Cloud Services can be utilized to scale out and scale up protein similarity searches by utilizing the system, called Cloud4PSi, that was developed for the Microsoft Azure public cloud. The chapter presents the architecture of the system, its components, communication flow, and advantages of using a queue-based model over the direct communication between computing units. It also shows results of various experiments confirming that the similarity searching can be successfully scaled on cloud platforms by using computation units of different sizes and by adding more computation units. (cid:129) Chapter 5: Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures Inthischapter,youwillseehowCloudServicesmayhelptosolveproblemsof protein structure prediction by scaling the computations in a role-based and queue-based Cloud4PSP system, deployed in the Microsoft Azure cloud. The chapter shows the system architecture, the Cloud4PSP processing model, and resultsofvariousscalabilityteststhatspeakinfavorofthepresentedarchitecture. The third part of the book shows the utilization of scalable Big Data compu- tational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures. (cid:129) Chapter 6: Foundations of the Hadoop Ecosystem At the moment, Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable dataanalytics.ThischapterbrieflydescribestheHadoopecosystemandfocuses ontwoelementsoftheecosystem—theApacheHadoopandtheApacheSpark.

Description:
This book presents a focus on proteins and their structures. The text describes various scalable solutions for protein structure similarity searching, carried out at main representation levels and for prediction of 3D structures of proteins. Emphasis is placed on techniques that can be used to accel
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.