In-Memory Data Management Research SeriesEditor Prof.Dr.Dr.h.c.HassoPlattner HassoPlattnerInstitute Potsdam,Germany Forfurthervolumes: http://www.springer.com/series/11642 This book series presents selected research results in the context of In-Memory Data Management. The volumes in this series describe research results in in-mem- ory database technology, logical and physical data management, software architec- tures, real-time analysis of enterprise data, innovative new business applications, and infl uenced business processes. In addition, programming models and software engineering techniques, tools, and benchmarks are elaborated on and discussed. All books are introduced by a member of the editorial board, who outlines the popular context and the social relevance of each work. Globally, companies generate a steadily increasing amount of data, day after day. This data is obtained to optimize logistics, create knowledge, explore business rela- tionships, and to improve management decisions. The trend towards acquiring more and more data, also known as “big data,” requires fundamental support in data anal- ysis. In-Memory Technology – the common theme in all volumes of this series – has become a de facto standard for fulfi lling new requirements that are stated towards enterprise applications. Hasso Plattner • Matthieu-P. Schapranow Editors High-Performance In-Memory Genome Data Analysis How In-Memory Database Technology Accelerates Personalized Medicine 123 Editors Hasso Plattner Hasso Plattner Institute Enterprise Platform and Integration Concepts Potsdam, Germany Matthieu-P. Schapranow Hasso Plattner Institute Enterprise Platform and Integration Concepts Potsdam, Germany ISBN978-3-319-03034-0 ISBN978-3-319-03035-7(eBook) DOI10.1007/978-3-319-03035-7 SpringerCham HeidelbergNewYorkDordrechtLondon Libarary of Congress Control Number: 2013954438 (cid:2)c SpringerInternationalPublishingSwitzerland2014 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. While the advice and information in this book are believed to be true and accurate at the date of publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with respecttothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Quotes “ Anincreasedutilityofsequencingdatashouldfollowfromtheabilitytoprocesshundreds ofgigabytesofrawsequencedatainformaticallypriortosubsequentdownstreamanaly- sis.PlattnerandSchapranowshareconcretedetailsonhowtoacceleratedataprocessing within-memorydatabasetechnology,andalsohighlighthowtoacceleratetheanalysis ofsequencingdatabyleveragingrelevantinformation.Withtheirworktheyeliminate time-consumingenquiriesforrelevantdata(fromdiskstorage)andenableinstantinter- pretationoffindings.Thisinnovativeapproachshouldbeofgreatvalueforapplications ” rangingfromresearchthroughtoprecisionmedicine. ScottKahn,Illumina,CIO “ It will be essential to improve our understanding of the core functions of the human genomeinordertodevelopstratifiedtreatmentsforcomplexdiseasesandtoprovidea foundationfortreatmentstopreventordelayonsetofdiseases.Byapplyingadvancedin- memorytechnologytoconcreteproblemsofpersonalizedmedicine,PlattnerandSchapra- nowdemonstratehowinterdisciplinaryteamscandevelopinnovativeandappropriateso- lutions.Collaborativeapproachesofcomputational,scientific,andclinicalteamshavean enormouspotentialtoimprovethewayweprovidemedicaltreatmentsinthefuture.Fi- nally,theauthorsdescribenovelmethodsforflexiblereal-timeanalysisofmedicallyrele- vantdatathatprovideapowerfulbasisfortimelydecisionmakinginpersonalizedmedical ” contexts. Prof.Dr.PeterN.Robinson,Charité,HeadoftheComputationalBiologyGroup v vi Quotes “ AtCytolon,weprovideITservicestoidentifythemostappropriatecordbloodsampleto limitpatient’simmuneresponse.Forthisservice,weneedtoanalyzethousandsofsam- plesandcombinethemwithavarietyofheterogeneouspatientproperties.Plattnerand Schapranowpresentthatin-memorytechnologyprovidesameaningfulwaytointegrate heterogeneous data. In addition, they show that real-time analyses of patient data is a paradigmshiftintoday’smedicine.Thus,webelievethistechnologycanhelpustospeed ” uptheperformanceofourmatchingservice. ThomasKlein,CytolonAG,FounderandCEO “ AtLGCGenomics,webuildonourlong-standingexperienceinprovidingDNAsequenc- ing and analysis services to our customers. Latest sequencing machines have sped up extractionofDNAreads,butanalysisisstilltime-intensiveduetothesheeramountof generateddata.PlattnerandSchapranowapplytheinnovativein-memorytechnologyto challenginganalyseswithimpressiveresults.Long-runninganalysisprocessing,e.g.co- hortanalysisisreducedfromtakinguphoursreducetojustafewseconds.Webelievethat thistechnologyhelpsustospeedupourday-to-daybusiness,allowingustofasterreport ” backtoourcustomer. Dr.WolfgangZimmermann,LGCGenomics,BusinessUnitManager “ IamproudandthankfulthatHPIprovidesanenvironmentthatfostersteaching,research, andinnovationinIT.Buildingontheirformerresearchresultsindatabasetechnology, Hasso Plattner and Matthieu Schapranow share insights of their high-performance in- memorygenomeplatformthatcombinesamongothersstructuredandunstructuredmed- icaldatafromvariousheterogeneousdatasourcestoenableitsreal-timeanalysesinasin- glesystem.Theplatformistheoutcomeofadedicatedcooperationwithvariousexperts frombiology,medicine,andcomputerscience.Asaresult,itprovesthatinterdisciplinary teamswithactualknowledgefromITareabletoconsiderablycontributeinimplementing ” thevisionofgreatpersonalizedmedicine. Prof.Dr.ChristophMeinel,HassoPlattnerInstitute,CEO Preface The human genome project was officially launched in 1990 equipped with a re- searchfundingofmorethanthreebillionUSD.However,ittookmorethanadecade andthousandsofworldwideresearchinstitutestodiscoveranddecodethefullhu- mangenomesequence. Nowadays, so-called next-generation sequencing devices process whole DNA andRNAwithinhoursatmoderatecosts.LatestdevicesgeneraterawDNAreads withmorethan30-timescoverageinlessthantwodays.However,interpretation andanalysisoftheserawdataisstillatime-consumingprocesspotentiallytaking weeks.Next-generationsequencingdevicesareincreasinglyusedinresearchand clinicalenvironmentstosupporttreatmentofspecificdiseases,suchascancer.This examplehighlightshowfastthetechnologicaldevelopmentscurrentlyaffectour dailylives. Next-generationsequencingisalsonamedtobethefoundationforindividual treatment decision, optimized therapies in course of personalized medicine and systemsbiology.Personalizedmedicineaimsattreatingpatientsspecificallybased onindividualdispositions,suchasgeneticorenvironmentalfactors.However,the increasingamountofgathereddiagnosticdatarequiresspecificsoftwaretoolsto identifyrelevantportionsofdata,processthemathigh-throughput,andprovide waystoanalyzetheminteractively. Wewrotethisbooktoprovidedetailsaboutinnovativeapproachestoprocess, combine,andanalyzedatarequiredinthecourseofpersonalizedtreatment.Itcon- tainslatestresearchresultsofapplyingin-memorydatabasetechnologytoprocess andanalyzebiggenomicdata.Furthermore,wesharehowtodesignanddevelop specificresearchtoolsthatrequirereal-timeanalysisofscientificdata. With this book, we contribute by bridging the gap between medical experts, such as physician, clinicians, and biological researchers, and technology experts, suchassoftwaredevelopers,databasespecialists,andstatisticians.Asaresult,we designedaspecificstructureofthebooktosupporttheindividualaudiences. Thebookisstructuredasfollows. vii viii Preface • Part I addresses the data acquisition, the modeling of processing and analysis pipelines,andhowtoacceleratepreprocessingofdata.Thispartisdesignedfor bioinformaticians and researchers, who want to understand how to optimize thedatapreparationfortheirexperiments. • Part II gives examples how to design and implement specific applications en- ablingreal-timeanalysisofscientificdata.Furthermore,itprovidesguidelines tooperateandtoexchangehugedataatfastpace.Thispartisintendedforre- searchers and medical experts, who require to work with big data on a daily basis. It also provides guidelines for IT experts how to operate on these data fromasoftwareengineeringperspective. Potsdam,Oct20,2013 HassoPlattnerandMatthieu-P.Schapranow Contents Preface............................................................... vii 1 InnovationsforPersonalizedMedicine ......................... 1 HassoPlattner,Matthieu-P.SchapranowandFranziskaHäger 1.1 RequirementsforPersonalizedMedicine ...................... 1 1.1.1 Researchers ........................................ 3 1.1.2 Clinicians .......................................... 5 1.1.3 Patients............................................ 6 1.2 InterdisciplinaryTeams...................................... 6 1.3 TrendsinHardware ......................................... 9 1.4 In-memoryTechnologyBuildingBlocks....................... 13 1.4.1 CombinedColumnandRowStore.................... 14 1.4.2 CompleteHistory................................... 14 1.4.3 LightweightCompression ........................... 14 1.4.4 Partitioning ........................................ 15 1.4.5 Multi-coreandParallelization........................ 16 1.4.6 ActiveandPassiveDataStore........................ 16 1.4.7 ReductionofLayers................................. 17 1.5 High-performanceIn-memoryGenomePlatform............... 17 1.5.1 ApplicationLayerwithMicroApplications............ 19 1.5.2 PlatformLayer ..................................... 21 1.5.3 DataLayer ......................................... 22 1.6 StructureoftheWork ....................................... 22 1.7 References.................................................. 25 ix
Description: