Ka-Chun Wong Editor Big Data Analytics in Genomics Big Data Analytics in Genomics Ka-Chun Wong Big Data Analytics in Genomics 123 Ka-ChunWong DepartmentofComputerScience CityUniversityofHongKong KowloonTong,HongKong ISBN978-3-319-41278-8 ISBN978-3-319-41279-5 (eBook) DOI10.1007/978-3-319-41279-5 LibraryofCongressControlNumber:2016950204 ©SpringerInternationalPublishingSwitzerland(outsidetheUSA) 2016 Chapter12completedwithinthecapacityofanUSgovernmentalemployment.UScopy-rightprotection doesnotapply. Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAGSwitzerland Preface At the beginning of the 21st century, next-generation sequencing (NGS) and third-generation sequencing (TGS) technologies have enabled high-throughput sequencing data generation for genomics; international projects (e.g., the Ency- clopedia of DNA Elements (ENCODE) Consortium, the 1000 Genomes Project, TheCancerGenomeAtlas(TCGA),Genotype-TissueExpression(GTEx)program, and the Functional Annotation Of Mammalian genome (FANTOM) project) have been successfully launched, leading to massive genomic data accumulation at an unprecedentedlyfastpace. To reveal novel genomic insights from those big data within a reasonable time frame, traditional data analysis methods may not be sufficient and scalable. Therefore,bigdataanalyticshavetobedevelopedforgenomics. Asanattempttosummarizethecurrenteffortsinbigdataanalyticsforgenomics, anopenbookchaptercallismadeattheendof2015,resultingin40bookchapter submissions which have gone throughrigoroussingle-blindreview process. After the initial screening and hundreds of reviewer invitations, the authors of each eligiblebookchaptersubmissionhavereceivedatleast2anonymousexpertreviews (atmost,6reviews)forimprovements,resultinginthecurrent13bookchapters. Those book chapters are organized into three parts (“Statistical Analytics,” “ComputationalAnalytics,”and“CancerAnalytics”)inthespiritthatstatisticsform the basis for computation which leads to cancer genome analytics. In each part, the book chapters have been arrangedfrom generalintroductionto advanced top- ics/specificapplications/specificcancersequentially,fortheinterestsofreadership. In the first part on statistical analytics, four book chapters (Chaps. 1–4) have beencontributed.InChap.1,Yangetal.havecompiledastatisticalintroductionfor the integrativeanalysisof genomicdata.After that,we go deepintothe statistical methodology of expression quantitative trait loci (eQTL) mapping in Chap. 2 written by Cheng et al. Given the genomic variants mapped, Ribeiro et al. have contributedabookchapteronhowtointegrateandorganizethosegenomicvariants intogenotype-phenotypenetworksusingcausalinferenceandstructurelearningin Chap.3.Attheendofthefirstpart,LiandTonghavegivenarefreshingstatistical v vi Preface perspectiveongenomicapplicationsoftheNeyman-Pearsonclassificationparadigm inChap.4. In the second part on computational analytics, four book chapters (Chaps. 5–8) have been contributed. In Chap. 5, Gupta et al. have reviewed and improved the existing computational pipelines for re-annotating eukaryotic genomes. In Chap. 6, Rucci et al. have compiled a comprehensive survey on the computational acceleration of Smith-Waterman protein sequence database search which is still central to genome research. Based on those sequence database search techniques, protein function prediction methods have been developed and demonstrated promising. Therefore, the recent algorithmic developments, remaining challenges, and prospects for future research in protein function prediction are discussed in great details by Shehu et al. in Chap. 7. At the end ofthepart,NagarajanandPrabhuprovidedareviewonthecomputationalpipelines forepigeneticsinChap.8. In the third part on cancer analytics, five chapters (Chaps. 9–13) have been contributed. At the beginning, Prabahar and Swaminathan have written a reader- friendlyperspectiveonmachinelearningtechniquesincanceranalyticsinChap.9. Toprovidesolidsupportsforthe perspective,TongandLisummarizetheexisting resources, tools, and algorithms for therapeutic biomarker discovery for cancer analytics in Chap.10. The NGS analysis of somatic mutations in cancer genomes arethendiscussedbyPrietoetal.inChap.11.Toconsolidatethecanceranalytics part further, two computationalpipelines for cancer analytics are described in the last two chapters, demonstrating concrete examples for reader interests. In Chap. 12,Leungetal.haveproposedanddescribedanovelpipelineforstatisticalanalysis ofexonicvariantsincancergenomes.InChap.13,Yotsukuraetal.haveproposed anddescribedauniquepipelineforunderstandinggenotype-phenotypecorrelation inbreastcancergenomes. KowloonTong,HongKong Ka-ChunWong April2016 Contents PartI StatisticalAnalytics Introductionto StatisticalMethods for Integrative Data AnalysisinGenome-WideAssociationStudies ............................... 3 CanYang,XiangWan,JinLiu,andMichaelNg RobustMethodsforExpressionQuantitativeTraitLociMapping......... 25 WeiCheng,XiangZhang,andWeiWang Causal Inference and Structure Learning ofGenotype–PhenotypeNetworksUsingGeneticVariation................ 89 AdèleH.Ribeiro,JúliaM.P.Soler,EliasChaibubNeto,andAndré Fujita GenomicApplicationsoftheNeyman–PearsonClassificationParadigm.. 145 JingyiJessicaLiandXinTong PartII ComputationalAnalytics ImprovingRe-annotationofAnnotatedEukaryoticGenomes.............. 171 Shishir K. Gupta, Elena Bencurova, Mugdha Srivastava, PirastehPahlavan,JohannesBalkenhol,andThomasDandekar State-of-the-ArtinSmith–WatermanProteinDatabaseSearch onHPCPlatforms................................................................ 197 EnzoRucci, Carlos García,GuillermoBotella, ArmandoDe Giusti,MarceloNaiouf,andManuelPrieto-Matías ASurveyofComputationalMethodsforProteinFunctionPrediction.... 225 AmardaShehu,DanielBarbará,andKevinMolloy Genome-WideMappingofNucleosomePositionandHistone CodePolymorphismsinYeast .................................................. 299 MuniyandiNagarajanandVandanaR.Prabhu vii viii Contents PartIII CancerAnalytics PerspectivesofMachine LearningTechniques inBigData MiningofCancer................................................................. 317 ArchanaPrabaharandSubashiniSwaminathan MiningMassiveGenomicDataforTherapeuticBiomarker DiscoveryinCancer:Resources,Tools,andAlgorithms .................... 337 PanTongandHuaLi NGSAnalysisofSomaticMutationsinCancerGenomes................... 357 T.Prieto,J.M.Alves,andD.Posada OncoMiner:APipelineforBioinformaticsAnalysisofExonic SequenceVariantsinCancer ................................................... 373 Ming-Ying Leung, Joseph A. Knapka, Amy E. Wagler, GeorgialinaRodriguez,andRobertA.Kirken A Bioinformatics Approach for Understanding Genotype–PhenotypeCorrelationinBreastCancer......................... 397 SohiyaYotsukura,MasayukiKarasuyama,IchigakuTakigawa, andHiroshiMamitsuka Part I Statistical Analytics Introduction to Statistical Methods for Integrative Data Analysis in Genome-Wide Association Studies CanYang,XiangWan,JinLiu,andMichaelNg Abstract Scientists in the life science field have long been seeking genetic variants associated with complex phenotypes to advance our understanding of complex genetic disorders. In the past decade, genome-wide association studies (GWASs) have been used to identify many thousands of genetic variants, each associated with at least one complex phenotype. Despite these successes, there is one major challenge towards fully characterizing the biological mechanism of complex diseases. It has been long hypothesized that many complex diseases are driven by the combined effect of many genetic variants, formally known as “polygenicity,”eachofwhichmayonlyhaveasmalleffect.Toidentifythesegenetic variants,largesamplesizesarerequiredbutmeetingsucharequirementisusually beyond the capacity of a single GWAS. As the era of big data is coming, many genomicconsortia are generatingan enormousamountof data to characterizethe functionalrolesofgeneticvariantsandthesedataarewidelyavailabletothepublic. Integratingrich genomic data to deepenour understandingof genetic architecture callsforstatisticallyrigorousmethodsinthebig-genomic-dataanalysis.Inthisbook chapter, we present a brief introduction to recent progresses on the development of statistical methodology for integrating genomic data. Our introduction begins with the discovery of polygenic genetic architecture, and aims at providing a unifiedstatistical frameworkofintegrativeanalysis.Inparticular,we highlightthe C.Yang((cid:2))•M.Ng DepartmentofMathematics,HongKongBaptistUniversity,KowloonTong,HongKong e-mail:[email protected];[email protected] X.Wan DepartmentofComputerScience,HongKongBaptistUniversity,KowloonTong,HongKong e-mail:[email protected] J.Liu CenterofQuantitativeMedicine,Duke-NUSGraduateMedicalSchool,Singapore,Singapore e-mail:[email protected] ©SpringerInternationalPublishingSwitzerland2016 3 K.-C.Wong(ed.),BigDataAnalyticsinGenomics, DOI10.1007/978-3-319-41279-5_1
Description: