ebook img

Data and Text Processing for Health and Life Sciences PDF

107 Pages·2019·2.917 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data and Text Processing for Health and Life Sciences

Advances in Experimental Medicine and Biology 1137 Francisco Couto Data and Text Processing for Health and Life Sciences Advances in Experimental Medicine and Biology Volume 1137 EditorialBoard IRUNR.COHEN,TheWeizmannInstituteofScience,Rehovot,Israel ABELLAJTHA,N.S.KlineInstituteforPsychiatricResearch, Orangeburg,NY,USA JOHND.LAMBRIS,UniversityofPennsylvania,Philadelphia,PA,USA RODOLFOPAOLETTI,UniversityofMilan,Milano,Italy NIMAREZAEI,TehranUniversityofMedicalSciences, Children’sMedicalCenterHospital,Tehran,Iran Moreinformationaboutthisseriesathttp://www.springer.com/series/5584 Francisco M. Couto Data and Text Processing for Health and Life Sciences 123 FranciscoM.Couto LASIGE,DepartmentofInformatics FaculdadedeCiências,UniversidadedeLisboa Lisbon,Portugal ISSN0065-2598 ISSN2214-8019 (electronic) AdvancesinExperimentalMedicineandBiology ISBN978-3-030-13844-8 ISBN978-3-030-13845-5 (eBook) https://doi.org/10.1007/978-3-030-13845-5 ©TheEditor(s)(ifapplicable)andTheAuthor(s)2019.Thisbookisanopenaccesspublication. OpenAccess ThisbookislicensedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsuse,sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commonslicenceandindicateifchangesweremade. The images or other third party material in this book are included in the book’s Creative Commonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnot includedinthebook’sCreativeCommonslicenceandyourintendeduseisnotpermittedby statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectly fromthecopyrightholder. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthis publication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesare exemptfromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationin thisbookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublisher northeauthorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerial containedhereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremains neutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG. Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Aosmeuspais,FranciscodeOliveiraCoutoe MariaFernandadosSantosMoreiraCouto. Preface During the last decades, I witnessed the growing importance of computer scienceskillsforcareeradvancementinHealthandLifeSciences.However, not everyone has the skill, inclination, or time to learn computer program- ming.Thelearningprocessisusuallytime-consumingandrequiresconstant practice, since software frameworks and programming languages change substantiallyovertime.Thisisthemainmotivationforwritingthisbookabout usingshellscriptingtoaddresscommonbiomedicaldataandtextprocessing tasks. Shell scripting has the advantages of being: (i) nowadays available in almost all personal computers; (ii) almost immutable for more than four decades;(iii)relativelyeasytolearnasasequenceofindependentcommands; (iv) an incremental and direct way to solve many of the data problems that HealthandLifeprofessionalsface. Duringthelastdecades,Ihadthepleasuretoteachintroductorycomputer scienceclassestoLifeandHealthandLifeSciencesundergraduates.Iused programming languages, such as Perl and Python, to address data and text processing tasks, but I always felt to lose a substantial amount of the time teaching the technicalities of these languages, which will probably change over time and are uninteresting for the majority of the students who do not intendtopursueadvancedbioinformaticscourses.Thus,thepurposeofthis book is to motivate and help specialists to automate common data and text processingtasksafterashortlearningperiod.Iftheybecomeinterested(and Ihopesomedo),thebookpresentspointerstowheretheycanacquiremore advancedcomputerscienceskills. This book does not intend to be a comprehensive compendium of shell scripting commands but instead an introductory guide for Health and Life specialists. This book introduces the commands as they are required to automate data and text processing tasks. The selected tasks have a strong focusontextminingandbiomedicalontologiesgivenmyresearchexperience and their growing relevance for Health and Life studies. Nevertheless, the same type of solutions presented in the book are also applicable to many otherresearchfieldsanddatasources. Lisboa,Portugal FranciscoM.Couto January2019 vii Acknowledgments I am grateful to all the people who helped and encouraged me along this journey, especially to Rita Ferreira for all the insightful discussions about shellscripting. I am also grateful for all the suggestions and corrections given by my colleague Prof. José Baptista Coelho and by my college students: Alice Veiros,AnaFerreira,CarlotaSilva,CatarinaRaimundo,DanielaMatias,Inês Justo, João Andrade, João Leitão, João Pedro Pais, Konil Solanki, Mariana Custódio, Marta Cunha, Manuel Fialho, Miguel Silva, Rafaela Marques, RaquelChoraandSofiaMorais. ThisworkwassupportedbyFCTthroughfundingofDeST:DeepSeman- tic Tagger project, ref. PTDC/CCI-BIO/28685/2017 (http://dest.rd.ciencias. ulisboa.pt/),andLASIGEResearchUnit,ref.UID/CEC/00408/2019. ix Contents 1 Introduction .............................................. 1 BiomedicalDataRepositories................................. 1 ScientificText.............................................. 1 AmountofText ............................................ 2 AmbiguityandContextualization.............................. 2 BiomedicalOntologies ...................................... 2 ProgrammingSkills ......................................... 2 WhyThisBook? ........................................... 4 Third-PartySolutions................................... 5 SimplePipelines....................................... 5 HowThisBookHelpsHealthandLifeSpecialists?............... 5 ShellScripting......................................... 5 TextFiles............................................. 6 RelationalDatabases ................................... 7 WhatIsintheBook? ........................................ 7 CommandLineTools................................... 7 Pipelines ............................................. 8 RegularExpressions.................................... 8 Semantics ............................................ 8 2 Resources................................................. 9 BiomedicalText ............................................ 9 What?................................................ 9 Where?............................................... 10 How? ................................................ 11 Semantics ................................................. 11 What?................................................ 12 Where?............................................... 13 How? ................................................ 14 FurtherReading ............................................ 15 3 DataRetrieval............................................. 17 CaffeineExample........................................... 17 UnixShell................................................. 24 CurrentDirectory ...................................... 24 WindowsDirectories ................................... 25 ChangeDirectory ...................................... 26 UsefulKeyCombinations ............................... 26 xi xii Contents ShellVersion ......................................... 26 DataFile............................................. 26 FileContents ......................................... 27 ReverseFileContents .................................. 27 MyFirstScript........................................ 27 LineBreaks .......................................... 27 RedirectionOperator................................... 27 InstallingTools ....................................... 28 Permissions .......................................... 28 Debug ............................................... 28 SaveOutput .......................................... 29 WebIdentifiers ............................................ 29 SingleandDoubleQuotes .............................. 30 Comments ........................................... 30 DataRetrieval ............................................. 30 StandardErrorOutput.................................. 32 DataExtraction ............................................ 32 SingleandMultiplePatterns ............................ 33 DataElementsSelection................................ 34 TaskRepetition ............................................ 34 AssemblyLine........................................ 35 FileHeader........................................... 36 Variable.............................................. 36 XMLProcessing ........................................... 36 HumanProteins....................................... 36 PubMedIdentifiers .................................... 37 PubMedIdentifiersExtraction ........................... 37 DuplicateRemoval .................................... 38 ComplexElements .................................... 39 XPath ............................................... 39 NamespaceProblems .................................. 39 OnlyLocalNames..................................... 39 Queries .............................................. 40 ExtractingXPathResults ............................... 41 TextRetrieval ............................................. 41 PublicationURL ...................................... 41 TitleandAbstract ..................................... 42 DiseaseRecognition ................................... 43 FurtherReading............................................ 43 4 TextProcessing ........................................... 45 PatternMatching........................................... 45 CaseInsensitiveMatching .............................. 45 NumberofMatches.................................... 46 InvertMatch.......................................... 46 FileDifferences ....................................... 46 EvaluationMetrics .................................... 47 WordMatching ....................................... 47

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.