ebook img

Process Design for Natural Scientists: An Agile Model-Driven Approach PDF

263 Pages·2014·20.525 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Process Design for Natural Scientists: An Agile Model-Driven Approach

Anna-Lena Lamprecht Tiziana Margaria (Eds.) Communications in Computer and Information Science 500 Process Design for Natural Scientists An Agile Model-Driven Approach 123 Communications in Computer and Information Science 500 EditorialBoard SimoneDinizJunqueiraBarbosa PontificalCatholicUniversityofRiodeJaneiro(PUC-Rio), RiodeJaneiro,Brazil PhoebeChen LaTrobeUniversity,Melbourne,Australia AlfredoCuzzocrea ICAR-CNRandUniversityofCalabria,Cosenza,Italy XiaoyongDu RenminUniversityofChina,Beijing,China JoaquimFilipe PolytechnicInstituteofSetúbal,Setúbal,Portugal OrhunKara TÜBI˙TAKBI˙LGEMandMiddleEastTechnicalUniversity,Ankara,Turkey IgorKotenko St.PetersburgInstituteforInformaticsandAutomationoftheRussian AcademyofSciences,St.Petersburg,Russia KrishnaM.Sivalingam IndianInstituteofTechnologyMadras,Chennai,India DominikS´le˛zak UniversityofWarsawandInfobright,Warsaw,Poland TakashiWashio OsakaUniversity,Osaka,Japan XiaokangYang ShanghaiJiaoTongUniversity,Shangai,China Moreinformationaboutthisseriesathttp://www.springer.com/series/7899 · Anna-Lena Lamprecht Tiziana Margaria (Eds.) Process Design for Natural Scientists An Agile Model-Driven Approach ABC Editors Anna-LenaLamprecht TizianaMargaria ChairServiceandSoftwareEngineering ChairSoftwareEngineering InstituteofComputerScience ComputerScienceandInformationSystems UniversityofPotsdam Department Potsdam UniversityofLimerick Germany and Lero,TheIrishSoftwareResearchCenter Limerick Ireland ISSN1865-0929 ISSN1865-0937 (electronic) ISBN978-3-662-45005-5 ISBN978-3-662-45006-2 (eBook) DOI978-3-662-45006-2 LibraryofCongressControlNumber: 2014950464 SpringerHeidelbergNewYorkDordrechtLondon (cid:2)c Springer-VerlagBerlinHeidelberg2014 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthe materialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broad- casting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformationstorage andretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknown orhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnectionwithreviews orscholarly analysis ormaterial suppliedspecifically forthepurposeofbeingentered andexecuted ona computersystem,forexclusive usebythepurchaser ofthework.Duplication ofthis publication orparts thereofispermittedonlyundertheprovisionsoftheCopyrightLawofthePublisher’slocation,initscur- rentversion,andpermissionforusemustalways beobtained fromSpringer. Permissionsforusemaybe obtainedthroughRightsLinkattheCopyrightClearanceCenter.Violationsareliabletoprosecutionunder therespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication, neither the authors northe editors nor the publisher can accept any legal responsibility for any errors or omissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespecttothematerial containedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Preface I In contrast with our seemingly endless ability to generate more biological data, faster,andatlowercost,thereareincreasinglyworrisomeobservationsofhuman limitationswithrespecttomanagingandmanipulatingthesemassiveandhighly complex datasets. With data at this scale, mistakes are easily made, and as Baggerly noted “the most common errors are simple... the most simple errors are common” when it comes to biological data management. Many, possibly most,biologicalresearcherslacktheskillstoprogrammaticallymanipulatelarge datasets, and therefore continue to use inappropriate tools to manage the “big data” that even a modestly resourcedlaboratory can now create. Serious errors introducedduring data managementandmanipulationare difficult to detect by the researcher and, because they go unrecorded, are nearly impossible to trace during peer-review. Beyond data manipulation errors, the statistical expertise required to cor- rectly analyze high-throughput data is rare, and biological researchers—even those who are extremely competent in rigorously executing the data-generating “omics” experiments—are seldom adequately trained in appropriate statistical analysis of the output. As such, inappropriate approaches, including trial-and- error,maybe applieduntil a “sensible”answeris found. Finally, becausemanu- ally driven analyses of high-throughput data can be extremely time-consuming and monotonous, researchers will sometimes inappropriately use a hypothesis- guided approach—examining only possibilities that they already believe are likely based on their interpretation of prior biological knowledge, or personal biastowardswheretheybelievetheanswershouldbe.Thus,thescientificlitera- ture becomes contaminatedwith errorsresulting from“fishing for significance,” from research bias, and even from outright errors. Theseproblemsarebecomingpervasiveinomics-scalescience.Theaffordabil- ityandaccessibilityofhigh-throughputtechnologiesissuchthatnowevensmall groups and individual laboratories can generate datasets that far exceed their capacity, computationally and statistically, to adequately manage and correctly analyze.Theendresultisaglutofnon-reproduciblesciencemakingitswayinto the primary literature and databases. Recent “forensic audits” of the scientific literature, such as those executed by Begley and Elles, have shown that a large proportion of bioinformatics research cannot be replicated independently, with numbers ranging from a low of 25% up to a staggering 89% of non-replicable researchin a recent study of oncology publications from 2001 to 2011. While an independent study by Baggerly into the accuracy and quality of published high-throughput analyses triggered retractions (and even a scientific misconduct investigation!) a study by Ioannidis revealed that, even in the pres- tigious Nature Genetics, more than half of the peer-reviewed, high-throughput studies cannot be replicated. Yet, ostensibly, these studies have all passed VI Preface I “rigorous peer review.” The failure of peer-review to detect non-reproducible research is, at least in part, because the analytical methodology is not ade- quatelydescribed,butperhapsequallybecausethereviewersfindtheseanalyses as challenging as the original investigator did; even if the original data were made available to them, together with a clear and comprehensive description of the analytical methodology (as we would hope are present in the Materials and Methods section of a manuscript), it still might not be reasonable to expect a reviewer to rigorously evaluate the study because the analytical infrastructure is difficult to envision or replicate from a textual description alone. In recognition of these limitations, the Institute of Medicine in 2012 pub- lished several recommendations relating to proper conduct of high-throughput analyses.Theseinclude:rigorouslydescribed,annotated,andfolloweddataman- agement procedures; “locking down” the computational analysis pipeline once it has been selected; and publishing the workflowof this analytical pipeline in a formalmanner,togetherwiththefullstartingandresultdatasets.Theserecom- mendations help ensure that (a) errorsare not introducedthroughmanual data manipulation, (b) there can be no human intervention in the data as it passes through the analytical process, and (c) that third parties can properly evaluate the data, the analytical methodology, and the result at the time of peer-review, even to the point of re-running the analysis for validation purposes. Formalworkflowtechnologieshaveprovedeffectiveatresolvingmanyofthese issues,andinfact, beyondjusttransparencyandreproducibility,they evenpro- vide opportunities for time-saving andconvenience. For example, the use of for- mal workflows provides an excellent opportunity to automate the collection of provenanceinformation(purpose,source,dataused,date,algorithms,interfaces, versions,nameofdatapublisher,etc.),farbeyondwhatiscommonlycapturedby eventhe mostattentive biologicalresearcher.Itis perhaps surprising,therefore, that integration of formal workflows into the scientific “culture” is lagging, and ad hoc. Nevertheless, as regulatory and funding agencies become increasingly concerned and dubious about the quality of the research they are supporting, the pressure to improve the way we “do science” grows.It is clear that the only path forwardinvolves mechanization of much of the scientific analysis process. It is for all the reasons above that I am so pleased to see this book on sci- entific workflows, directly aimed at the bench researcher as its target audience. The broad scope of research questions covered in these chapters will speak to the full breadth of researchers who, until now, have been reluctant to consider adoptingthesepowerfulnewapproachestodatamanagementandanalysis.Hav- ing a few key exemplars such as these—formal workflows solving real scientific problems—willhopefully “blazethe trail”forotherresearchersto follow.Atthe sametime, readingthese use caseswill no doubtspur thoseofus who study sci- entific workflow technologies to produce even better tools that make life easier for the biologists whose data we care so much about. May 2014 Mark D. Wilkinson Isaac Peral Distinguished Researcher Center for Plant Biotechnology and Genomics, UPM, Madrid, Spain Preface II At present, 95% of all digital data are estimated to have a geospatial reference (Hamilton,inPerkins,2010).Geospatialreferenceisthedefinitionoftheabsolute or relative position of objects or geographical phenomena. This is determined, in the first case, by coordinates, and, in the latter case, by spatial relations or neighborhood of geospatial objects. For centuries, spatial data have been processedtogeneratemapsormap-likecartographicrepresentations.Indoingso, cartographersmadeuseofthefactthatmapsaretheonlyvisualrepresentations of geographic space simultaneously showing both the absolute and the relative positions of spatial objects. In addition, they have served as an effective data storage well beyond the advent of ICT in cartography in the late 1960s. Unlike remote sensingimagery,mapsare notimagesbutgraphicalmodels of geospatial reality. Map models are generated in a systematic process by trans- forming non-graphic spatial data of the real worldinto spatially related graphic symbols. Selection and processing of the geospatial source data are subject to the specific thematic and/or geographical requirements of a given application. The data-to-symbol transformation is controlled by a sequence of processes im- plementing asetofcartographicrepresentationmethods andothercartographic regulations. This regulatory framework determines the thematic as well as the scale-relatedmodeling andsymbolizationofthe preprocessedgeospatialdata.It is, in fact, this processed-based, application-orientedtransformation of graphic- free geodata into graphic map data that accounts for the professional carto- graphic quality of the maps produced. To generate quality cartographic representations or, in more modern terms, geovisualizations, expertise in both methods and techniques is required on how to processnon-graphicgeospatialdata into propergraphicmaps. This expertise has, for centuries, been passed on personally from master to apprentice. Hence, different “schools” of geodata processing into maps can easily be distinguished. Around the turn of the nineteenth to twentieth centuries the process of map- making has been codified and formalized in early textbooks on cartography. It is obvious that the written description of the transformation process of geospa- tial data into maps required strict formalization beforehand of the processing steps involved for interpersonal application. Only this will put a trained person into a position to produce quality maps by applying the mapping processesto a geospatialdataset.Depending onthedata,mappurpose,application,andscale, a finite number of processes is required to generate the respective map types. To make sure the application of the relevant processes to similar datasets will yield identical map representations, the range of individual processes has been formalizedin abouta dozenso-calledcartographicrepresentationmethods. The representation methods describe the processes involved in transforming specific geodata and their features into precisely matching map models or map types. VIII Preface II Hence,cartographicrepresentationmethodscanbeconsideredasetofformalized processing rules sets applied to visualize geodata expressively and effectively. Today, almost without exception, geodata are handled and processed with ICT systems more or less dedicated to this purpose. Thus, the methodical need for the formalization of geovisual processes is complemented by the technical requirements of formalization when it comes to implementing geovisualization processes in a software environment. Software engineering, in particular, has a distinctive history and substantial record of modeling processes. Accordingly, it deals with developing as well as implementing software solutions, which can be dedicated or generic, proprietary, or open-source. Formalization requirements both from a methodical and a technical perspective can thus be considered a common ground of geoinformation processing and software engineering. It goes withoutsayingthatgeoinformationprocessingwillbenefitfromsoftwaresystems facilitating the formalization and implementation of data-to-map transforma- tions in an intuitive, easy-to-comprehend way. One such software environment is the open-source jABC framework used in the range of applications collected in this volume. Systems like jABC help to code a set of processing rules into a formalized technical framework that can be implemented in ICT systems for geoinformation processing. The use cases presented demonstrate both the potential of jABC as well as the benefits the implementation of this framework holds for the applications in question. From a geoinformation perspective it furthermore shows the added value a fruitful collaboration of the two disciplines will hold for geoinformation applications as well as for geoinformation scientists. June 2014 Hartmut Asche Professor of Geoinformation Science Department of Geography, University of Potsdam, Germany Contents Framework Scientific Workflows and XMDD ................................... 1 Anna-Lena Lamprecht and Tiziana Margaria Modeling and Execution of Scientific Workflows with the jABC Framework...................................................... 14 Anna-Lena Lamprecht, Tiziana Margaria, and Bernhard Steffen The Course’s SIB Libraries........................................ 30 Anna-Lena Lamprecht and Alexander Wickert Lessons Learned ................................................. 45 Anna-Lena Lamprecht, Alexander Wickert, and Tiziana Margaria Bioinformatics Applications Protein Classification Workflow.................................... 65 Judith Reso Data Mining for Unidentified Protein Sequences ..................... 73 Leif Blaese Workflow for Rapid Metagenome Analysis........................... 88 Gunnar Schulze Constructing a Phylogenetic Tree .................................. 101 Monika Lis Exploratory Data Analysis ........................................ 110 Janine Vierheller Identification of Differentially Expressed Genes ...................... 127 Christine Schu¨tt Geovisualization Applications Visualization of Data Transfer Paths ............................... 140 Christian Kuntzsch Spotlocator – Guess Where the Photo Was Taken! ................... 149 Marcel Hibbe

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.