Translational Bioinformatics 11 Series Editor: Xiangdong Wang, MD, Ph.D., Prof Xiangdong Wang Christian Baumgartner Denis C. Shields Hong-Wen Deng Jacques S. Beckmann Editors Application of Clinical Bioinformatics Translational Bioinformatics Volume 11 Series editor Xiangdong Wang, MD, Ph.D. Professor of Medicine, Zhongshan Hospital, Fudan University Medical School, China Director of Shanghai Institute of Clinical Bioinformatics, (www.fuccb.org) AimsandScope The Book Series in Translational Bioinformatics is a powerful and integrative resource for understanding and translating discoveries and advances of genomic, transcriptomic, proteomic andbioinformatictechnologiesintothestudyofhumandiseases.TheSeriesrepresentsleading global opinions on the translation of bioinformatics sciences into both the clinical setting and descriptions to medical informatics. It presents the critical evidence to further understand the molecular mechanisms underlying organ or cell dysfunctions in human diseases, the results of genomic,transcriptomic,proteomicandbioinformaticstudiesfromhumantissuesdedicatedtothe discoveryandvalidationofdiagnosticandprognosticdiseasebiomarkers,essentialinformationon the identification and validation of novel drug targets and the application of tissue genomics, transcriptomics,proteomicsandbioinformaticsindrugefficacyandtoxicityinclinicalresearch. The Book Series in Translational Bioinformatics focuses on outstanding articles/chapters presenting significant recent works in genomic, transcriptomic, proteomic and bioinformatic profiles related to human organ or cell dysfunctions and clinical findings. The Series includes bioinformatics-driven molecular and cellular disease mechanisms, the understanding of human diseasesandtheimprovementofpatientprognoses.Additionally,itprovidespracticalanduseful studyinsightsintoandprotocolsofdesignandmethodology. SeriesDescription Translational bioinformatics is defined as the development of storage-related, analytic, and interpretivemethodstooptimizethetransformationofincreasinglyvoluminousbiomedicaldata, and genomic data in particular, into proactive, predictive, preventive, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integrationofbiologicalandclinicaldataandtheevolutionofclinicalinformaticsmethodologyto encompassbiologicalobservations.Theendproductoftranslationalbioinformaticsisthenewly found knowledge from these integrative efforts that can be disseminated to a variety of stake- holdersincludingbiomedicalscientists,clinicians,andpatients.Issuesrelatedtodatabaseman- agement,administration,orpolicywillbecoordinatedthroughtheclinicalresearchinformatics domain.Analytic,storage-related,andinterpretivemethodsshouldbeusedtoimprovepredictions, earlydiagnostics,severitymonitoring,therapeuticeffects,andtheprognosisofhumandiseases. RecentlyPublishedandForthcomingVolumes ComputationalandStatistical AllergyBioinformatics Epigenomics Editors:AilinTao,EyalRaz Editor:AndrewE.Teschendorff Volume8 Volume7 PediatricBiomedicalInformatics- TranscriptomicsandGeneRegulation ComputerApplications Editor:JiaqianWu inPediatricResearch(Edition2) Volume9 Editor:JohnJ.Hutton Volume10 Moreinformationaboutthisseriesathttp://www.springer.com/series/11057 Xiangdong Wang • Christian Baumgartner Denis C. Shields • Hong-Wen Deng Jacques S. Beckmann Editors Application of Clinical Bioinformatics Editors XiangdongWang ChristianBaumgartner ZhongshanHospital,FudanUniversity InstituteofHealthCareEngineeringwith ShanghaiInstituteofClinical EuropeanNotifiedBodyofMedicalDevices Bioinformatics GrazUniversityofTechnology Shanghai,China Graz,Austria DenisC.Shields Hong-WenDeng SchoolofMedicine CenterforBioinformaticsandGenomics, UniversityCollegeDublin DepartmentofBiostatistics Dublin4,Ireland andBioinformatics TulaneUniversitySchoolofPublicHealth JacquesS.Beckmann andTropicalMedicine SectionofClinicalBioinformatics NewOrleans,LA,USA SwissInstituteofBioinformatics Switzerland ISSN2213-2775 ISSN2213-2783 (electronic) ISBN978-94-017-7541-0 ISBN978-94-017-7543-4 (eBook) DOI10.1007/978-94-017-7543-4 LibraryofCongressControlNumber:2016933682 SpringerDordrechtHeidelbergNewYorkLondon ©SpringerScience+BusinessMediaDordrecht2016 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexempt fromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthis book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained hereinorforanyerrorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper Springer Science+Business Media B.V. Dordrecht is part of Springer Science+Business Media (www.springer.com) Contents 1 TheEraofBigData:FromData-DrivenResearch toData-DrivenClinicalCare. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 ChristianBaumgartner 2 Biostatistics,DataMiningandComputationalModeling. . . . . . . . 23 HaoHe,DongdongLin,JigangZhang, YupingWang,andHong-WenDeng 3 GeneExpressionandProfiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 YuZhou,ChaoXu,JigangZhang,andHong-WenDeng 4 TheNextGenerationSequencingandApplications inClinicalResearch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 JunboDuan,XiaoyingFu,JigangZhang, Yu-PingWang,andHong-WenDeng 5 ClinicalEpigeneticsandEpigenomics. . . . . . . . . . . . . . . . . . . . . . . 115 NianDong,LinShi,ChengshuiChen,WenhuanMa, andXiangdongWang 6 ProteomicProfiling:DataMiningandAnalyses. . . . . . . . . . . . . . . 133 LanZhang,WeiZhu,YongZeng, JigangZhang,andHong-WenDeng 7 TargetedMetabolomics:TheNextGeneration ofClinicalChemistry!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 KlausM.WeinbergerandMarcBreit 8 ClinicalBioinformaticsforBiomarkerDiscoveryinTargeted Metabolomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 MarcBreit,ChristianBaumgartner,MichaelNetzer, andKlausM.Weinberger v vi Contents 9 MetagenomicProfiling,InteractionofGenomics withMeta-genomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 RuifengWang,YuZhou,ShaolongCao,YupingWang, JigangZhang,andHong-WenDeng 10 ClinicalEpigeneticsandEpigenomics. . . . . . . . . . . . . . . . . . . . . . . 269 ChuanQiu,FangtangYu,Hong-WenDeng,andHuiShen 11 IntegrativeBiologicalDatabases. . . . . . . . . . . . . . . . . . . . . . . . . . . 295 JinzengWangandHaiyunWang 12 StandardsandRegulationsfor(Bio)MedicalSoftware. . . . . . . . . . 309 J€orgSchr€ottner,RobertNeubauer,andChristianBaumgartner 13 ClinicalApplicationsandSystemsBiomedicine. . . . . . . . . . . . . . . 323 DuojiaoWu,DavidE.Sanin,andXiangdongWang 14 KeyLawandPolicyConsiderations forClinicalBioinformaticians. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 MarkPhillips 15 ChallengesandOpportunitiesinClinicalBioinformatics. . . . . . . . 359 DenisC.Shields 16 HeterogeneityofHepatocellularCarcinoma. . . . . . . . . . . . . . . . . . 371 TingtingFang,LiFeng,andJinglinXia Chapter 1 The Era of Big Data: From Data-Driven Research to Data-Driven Clinical Care ChristianBaumgartner Abstract Whentheeraofbigdataarrivedintheearlynineteennineties,biomed- ical research boosted new innovations, procedures and methods aiding in clinical care and patient management. This chapter provides an introduction to the basic conceptsandstrategiesofdata-drivenbiomedicalresearchandapplication,anarea thatisexplainedusingtermssuchascomputationalbiomedicineorclinical/medical bioinformatics.Afterabriefmotivationitstartswithasurveyondatasourcesand bioanalytictechnologiesforhigh-throughputdatageneration,aselectionofexper- imental study designs andtheir applications, proceduresandrecommendationson how to handle data quality and privacy, followed by a discussion on basic data warehouse concepts utilized for life science data integration, data mining and knowledge discovery. Finally, five application examples are briefly delineated, emphazising the benefit and power of computational methods and tools in this field. The author trusts that this chapter will encourage the reader to handle and interpretthehugeamountofdatausuallygeneratedinresearchprojectsorclinical routinetoexploitminedbioinformationandmedicalknowledgeforindividualized healthcare. Keywords Computational biomedicine • Data integration and management • Knowledgediscovery•Datamining•Clinicalapplications 1.1 Introduction In the past two decades, the new era of “big data” in experimental and clinical biomedicine has arrived and grown as a direct consequence of the availability of largereservoirsofdata.Datacollectionindigitalformwasalreadyunderwaybythe 1960s, allowing for retrospective data management and analysis to be undertaken using computers for the first time. Relational databases arose in the 1980s along withStructuredQueryLanguages(SQL),enablingdynamic,on-demandstructural analysisandinterpretationofdatafromcomplexresearchdesigns.The1990ssaw C.Baumgartner,Ph.D.(*) InstituteofHealthCareEngineeringwithEuropeanNotifiedBodyofMedicalDevices,Graz UniversityofTechnology,Stremayrgasse16,A-8010Graz,Austria e-mail:[email protected] ©SpringerScience+BusinessMediaDordrecht2016 1 X.Wangetal.(eds.),ApplicationofClinicalBioinformatics, TranslationalBioinformatics11,DOI10.1007/978-94-017-7543-4_1 2 C.Baumgartner anexplosioninthe growth ofdataassociated with theemerginguse ofnew high- throughput,labandimagingtechnologiesinfundamentalbiomedicalresearchand clinical application. Data warehouses were beginning to be used for storing and integratingvarioustypesofdata,wheredifferentdatasourcesaretransformedinto a common format and converted to a common vocabulary needed to overcome computationalchallengesofdata-drivenresearchanddevelopment.Theneweraof “computational biomedicine” or “clinical bioinformatics” was born as a multidis- ciplinary approach that brought together medical, natural and computer sciences, aiming at uncovering unknown and unexpected biomedical knowledge stored in these data sources, which had the potential to transform our current clinical practices (Chang 2005; Wang and Liotta 2011; Coveney et al. 2014). Research areas such as data warehousing and information retrieval, machine learning, data mining, and others thus arose as a response to challenges faced by the computer science and bioinformatics community in dealing with huge amounts of data, enabling a better quality of data-driven decision making. As data are any facts, numbers, images or texts that can be accessed and processed by stand-alone computers or computational networks, the patterns, associations or relationships among available data can provide informationabout historical patternsand future trends so that undreamt of opportunities emerge for biomedical research and application.Thisknowledgemayhelptocreateanewwayofdealingwithclinical care and patient management never previously possible. Clinical bioinformatics, which resulted from the big data era, is thus a crucial element of the medical knowledge discovery process where relevant sources of medical information and bioinformationarecombinedandminedtoallowforindividualizedhealthcare. 1.2 The Revolution of High-Throughput and Imaging Technologies and the Flood of Generated Data Inthelifesciences,hugeamountofdataaregenerated,utilizingthewidespectrumof high throughput and laboratory technologies, and modern health care imaging sys- tems such as MRI or CT. In biomolecular research, microarray based expression profiling and more recently next-generation sequencing (NGS) technologies have becomethemethodologyofchoicee.g.forwholetranscriptomeexpressionprofiling, producing a flood of data that need to be computationally processed and analysed (Worthey2013;Soonetal.2013).ThemostwidelyusedNGSdevices,forexample, areabletosequenceupto150basesfrombothsidesofRNAfragmentsandcreatea maximum output of up to 1000 GB per run. Most advanced protein profiling technologiesareimplementedwithabroadpanelofmassspectrometry-basedtech- niques to separate, characterize and quantify analytes from complex biological samples(ChenandPramanik2009;BrewisandBrennan2010;Woodsetal.2014). Labsaretypicallyequippedwithdiversemassspectrometer(MS)systemsincluding TOF-TOF, Quadrupole-TOF, FT-ICR, and LTQ-Orbitrap type analyzers. In this field, shotgun proteomics is a widely used tool for global analysis of protein 1 TheEraofBigData 3 modifications,where,inatypicalLC-MS/MSexperiment,hundredsofthousandsof tandem mass spectra are typically generated. Sophisticated computational tools for MSsprectraprocessinganddatabasesearchstrategiesareusedfortheidentification of peptide/protein modifications (Baumgartner et al. 2008; Cerqueira et al. 2010; Sj€ostr€om et al. 2015). In metabolomics, different fundamental approaches can be distinguished,i.e.untargetedandtargetedmetabolomicsandmetabolicfingerprinting (BaumgartnerandGraber2007;Putrietal.2013;Nazetal.2014;Zhangetal.2015). Usingtargetedmetabolomics,quantitiationofapreselectedsetofknownmetabolites by determining absolute values of analyte concentrations with the use of internal chemical standards allows for hypothesis-driven research and interpretation ofdata basedona-prioriknowledge.Toprovideaholisticpictureofmetabolism,untargeted metabolic profiling aims at measuring as many analytes as possible (up to several hundreds)tocreateasnapshotofthebiochemicalprofilewithintheanalysedsample. Theestablishedtechnologiesinmetabolomicsinclude–analoguoustoproteomics– massspectrometrybasedapproachesandnuclearmagneticresonance(NMR)spec- troscopy, generatingthousands totensofthousands data pointsper spectrum. Mul- tiple processing steps are required to analyze this huge amount of spectral information, ranging from modalities for denoising, binning, aligning spectra to peakdetectionandhigh-levelanalysise.g.forbiomarkeridentificationandverifica- tion(Swanetal.2013;Netzeretal.2015). Nowadays bioimaging devices with increasing resolution are widely used in biological and clinical laboratories, generating imaging data with hundreds of Megabytes or Gigabytes (Eliceiri et al. 2012; Edelstein et al. 2014). Whole-slide bioimaging, for instance, combines light microscopy techniques with electronic scanningofslidesandisabletocollectquantitativedata,currentlyregardedasone of the most promising avenues for diagnosis or prediction of cancer and other diseases.TraditionalhealthcareimagingtechnologiessuchasCT,MRI,ultrasound or SPECT and PET make it possible to assess the current status and condition of organsortissuesandtomonitorpatientsovertimefordiagnosticevaluationorfor controlling therapeutic interventions (Smith and Webb 2010; Mikla and Mikla 2013).Inparticular,CPU-intensiveimagereconstructionandmodelingtechniques allow instant processing of2D signals tocreate 3D/4D image stacks of enormous amounts of data, typically stored in DICOM file format. This DICOM standard facilitates interoperability of medical imaging instrumentations, providing a stan- dardized medical file format and directory structure, which enables access to the images and patient-related information for further processing, modeling and analysis. 1.3 Study Design and Data Privacy Differentepidemiologicalstudydesignssuchascase-control,(longitudinal)cohort studiesormorecomplexdesignssuchasrandomizedcontrolledtrialsareselected inbiomedicalresearch(DawsonandTrapp2004;Porta2014).Case-controlstudies,