ebook img

Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information PDF

455 Pages·2018·7.397 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information

Principles and Practice of Big Data Principles and Practice of Big Data Preparing, Sharing, and Analyzing Complex Information Second Edition Jules J. Berman AcademicPressisanimprintofElsevier 125LondonWall,LondonEC2Y5AS,UnitedKingdom 525BStreet,Suite1650,SanDiego,CA92101,UnitedStates 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom ©2018ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorageand retrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowtoseekpermission, furtherinformationaboutthePublisher’spermissionspoliciesandourarrangementswith organizationssuchastheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbe foundatourwebsite:www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedical treatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformation,methods,compounds,orexperimentsdescribedherein. Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyof others,includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assume anyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability, negligenceorotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideas containedinthematerialherein. LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN:978-0-12-815609-4 ForinformationonallAcademicPresspublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:MaraConner AcquisitionEditor:MaraConner EditorialProjectManager:MarianaL.Kuhl ProductionProjectManager:PunithavathyGovindaradjane CoverDesigner:MatthewLimbert TypesetbySPiGlobal,India Other Books by Jules J. Berman Rare Diseases and Orphan Drugs Taxonomic Guide to Infec(cid:415)ous Diseases Principles of Big Data Keys to Understanding and Trea(cid:415)ng Understanding the Biologic Classes of Preparing, Sharing, and Analyzing the Common Diseases (2014) Pathogenic Organisms (2012) Complex Informa(cid:415)on (2013) 9780124199880 9780124158955 9780124045767 Repurposing Legacy Data Data Simplifica(cid:415)on Precision Medicine and The Reinven(cid:415)on of Innova(cid:415)ve Case Studies (2015) Taming Informa(cid:415)on with Open Human Disease (2018) 9780128028827 Source Tools(2016) 9780128143933 9780128037812 Dedication Tomywife,Irene,whoreadseveryday,andwhounderstandswhybooksareimportant. About the Author Jules J. Berman received two baccalaureate degrees fromMIT;inMathematics,andinEarthandPlanetary Sciences.HeholdsaPhDfromTempleUniversity,and anMD,fromtheUniversityofMiami.Hewasagraduate studentresearcherintheFelsCancerResearchInstitute, atTempleUniversity,andattheAmericanHealthFoun- dation in Valhalla, New York. His postdoctoral studies werecompletedattheUSNationalInstitutesofHealth, and his residency was completed at the George Washington University Medical Center in Washington, DC.Dr.BermanservedasChiefofAnatomicPathology, Surgical Pathology, and Cytopathology at the Veterans AdministrationMedicalCenterinBaltimore,Maryland, whereheheldjointappointmentsattheUniversityofMarylandMedicalCenterandatthe JohnsHopkinsMedicalInstitutions.In1998,hetransferredtotheUSNationalInstitutesof Health,asaMedicalOfficer,andastheProgramDirectorforPathologyInformaticsinthe CancerDiagnosisProgramattheNationalCancerInstitute.Dr.Bermanisapastpresident of the Association for Pathology Informatics, and the 2011 recipient of the Association’s Lifetime Achievement Award. He has first-authored over 100 scientific publications and has written more than a dozen books in the areas of data science and disease biology. Severalofhismostrecenttitles,publishedbyElsevier,include: Taxonomic Guideto Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms (2012) PrinciplesofBigData:Preparing,Sharing,andAnalyzingComplexInformation(2013) RareDiseases and OrphanDrugs: Keys to Understanding and Treatingthe Common Diseases (2014) Repurposing Legacy Data:Innovative CaseStudies (2015) Data Simplification: TamingInformation with Open SourceTools (2016) Precision Medicine and the Reinvention ofHuman Disease (2018) xix Author’s Preface to Second Edition Everything has been said before, but since nobody listens we haveto keep going back and beginningall overagain. AndreGide Goodsciencewriterswillalwaysjumpatthechancetowriteasecondeditionofanearlier work.Nomatterhowhardtheytry,thatfirsteditionwillcontaininaccuraciesandmislead- ingremarks.Sentencesthatseemedbrilliantwhenfirstconceivedwill,withthepassageof time,transformintoexamplesofintellectualoverreaching.Pointstootrivialtoincludein theoriginalmanuscriptmaynowseemlikeprofunditiesthatdemandafullexplanation. Asecond edition provides rueful authors with anopportunityto correctthe record. WhenthefirsteditionofPrinciplesofBigDatawaspublishedin2013thefieldwasvery youngandtherewerefewscientistswhoknewwhattodowithBigData.Thedatathatkept pouring in was stored, like wheat in silos, throughout the planet. It was obvious to data managersthatnoneofthatstoreddatawouldhaveanyscientificvalueunlessitwasprop- erly annotated with metadata, identifiers, timestamps, and a set of basic descriptors. Undertheseconditions,thefirsteditionofthePrinciplesofBigDatastressedtheproper andnecessarymethodsforcollecting,annotating,organizing,andcuratingBigData.The processofpreparingBigDatacomeswithitsownuniquesetofchallenges,andtheFirst Edition waspepperedwith warnings and exhortationsintended to steer readers clear of disaster. Itisnowfiveyearssincethefirsteditionwaspublishedandtherehavesincebeenhun- dredsofbookswrittenonthesubjectofBigData.Asascientist,itisdisappointingtome thatthebulkofBigData,today,isfocusedonissuesofmarketingandpredictiveanalytics (e.g.,“Whoislikelytobuyproductx,giventhattheyboughtproductytwoweeksprevi- ously?”);andmachinelearning(e.g.,driverlesscars,computervision,speechrecognition). Machinelearningreliesheavilyonhypeduptechniquessuchasneuralnetworksanddeep learning; neither of which are leading to fundamental laws and principles that simplify and broaden our understanding of the natural world and the physical universe. For the mostpart,thesetechniquesusedatathatisrelativelynew(i.e.,freshlycollected),poorly annotated (i.e., provided with only the minimal information required for one particular analyticprocess),andnotdepositedforpublicevaluationorforre-use.Inshort,BigData hasfollowedthepathofleastresistance,avoidingmostofthetoughissuesraised inthe firsteditionofthisbook;suchastheimportanceofsharingdatawiththepublic,thevalue offindingrelationships(notsimilarities)amongdataobjects,andtheheavy,butinescap- able,burdenof creatingrobust, immortal, and well-annotated data. xxi xxii AUTHOR’SPREFACE TOSECONDEDITION It was certainly my hope that the greatest advances from Big Data would come as fundamental breakthroughs in the realms of medicine, biology, physics, engineering, andchemistry.WhyhasthefocusofBigDatashiftedfrombasicscienceovertomachine learning?Itmayhavesomethingtodowiththefactthatnobook,includingthefirstedition ofthisbook,hasprovidedreaderswiththemethodsrequiredtoputtheprinciplesofBig Data into practice.In retrospect, it was not sufficient to describe a set of principles and then expect readers to inventtheir own methodologies. Consequently, in this second edition, the publisher has changed the title of the book from“ThePrinciplesofBigData,”to“ThePrinciplesANDPRACTICEofBigData.”Hence- forth and herein, recommendations are accompanied by the methods by which those recommendationscanbeimplemented.Thereaderwillfindthatallofthemethodsforimple- mentingBigDatapreparationandanalysisarereallyquitesimple.Forthemostpart,com- putermethodsrequiresomebasicfamiliaritywithaprogramminglanguage,and,despite misgivings,Pythonwaschosenasthelanguageofchoice.TheadvantagesofPythonare: – Pythonisano-cost,opensource,high-levelprogramminglanguagethatiseasytoacquire, install,learn,anduse,andisavailableforeverypopularcomputeroperatingsystem. – Python is extremelypopular, at the present time,and itspopularity seems to be increasing. – Pythondistributions(suchasAnaconda)comebundledwithhundredsofhighlyuseful modules (such asnumpy,matplot, and scipy). – Pythonhasalargeandactiveusergroupthathasprovidedanextraordinaryamountof documentation for Python methods and modules. – Python supports some object-orientedtechniques that will be discussedin this new edition As everything in life,Python hasits drawbacks: – The most currentversions of Python arenot backwardly compatiblewith earlier versions.The scripts and code snippets included in this book should work for most versionsofPython3.x,butmaynotworkwithPythonversions2.xandearlier,unless thereaderispreparedtodevotesometimetotweakingthecode.Ofcourse,theseshort scriptsandsnippetsareintendedassimplifieddemonstrationsofconcepts,andmust not be construed asapplication-ready code. – Thebuilt-inPythonmethodsaresometimesmaximizedforspeedbyutilizingRandom AccessMemory(RAM)toholddatastructures,includingdatastructuresbuiltthrough iterativeloops.IterationsthroughBigDatamayexhaustavailableRAM,leadingtothe failureofPython scriptsthat functioned well with small data sets. – Python’s implementation of object orientation allowsmulticlassinheritance (i.e.,a classcanbethesubclassofmorethanoneparentclass).Wewilldescribewhythisis problematic,andthecompensatorymeasuresthatwemusttake,wheneverweuseour Python programmingskills to understandlarge and complex setsof data objects. Thecoreofeveryalgorithmdescribedinthe bookcanbeimplementedin afewlines of code,usingjustaboutanypopularprogramminglanguage,underanyoperatingsystem, Author’s Prefaceto SecondEdition xxiii onanymoderncomputer.NumerousPython snippetsareprovided,alongwithdescrip- tions of free utilities that are widely available on every popular operating system. This book stresses the point that most data analyses conducted on large, complex data sets canbeachievedwithsimplemethods,bypassingspecializedsoftwaresystems(e.g.,par- allelizationofcomputationalprocesses)orhardware(e.g.,supercomputers).Readerswho are completely unacquainted with Python may find that they can read and understand Python code, if the snippets of code are brief, and accompanied by some explanation inthetext.Inanycase,readerswhoareprimarilyconcernedwithmasteringtheprinciples ofBigData can skip the codesnippets withoutlosing the narrative threadofthe book. Thissecondeditionhasbeenexpandedtostressmethodologiesthathavebeenover- lookedbytheauthorsofotherbooksinthefieldofBigDataanalysis.Thesewouldinclude: – Datapreparation. Howtoannotatedatawithmetadataandhowtocreatedataobjectscomposedoftriples. Theconceptofthetriple,asthefundamentalconveyorofmeaninginthecomputational sciences, is fully explained. – Datastructures of particular relevance to BigData Conceptssuchastriplestores,distributedledgers,uniqueidentifiers,timestamps,concor- dances,indexes,dictionaryobjects,datapersistence,andtherolesofone-wayhashesand encryption protocols fordata storage and distributionarecovered. – Classification of data objects How to assign data objects to classes based on their shared relationships, and the com- putational roles filled by classifications in the analysis of Big Data will be discussed at length. – Introspection Howto createdata objects that areself-describing,permitting the data analyst to group objectsbelongingtothesameclassandtoapplymethodstoclassobjectsthathavebeen inherited from their ancestralclasses. – Algorithms that havespecial utility inBigDatapreparation and analysis Howtouseone-wayhashes,uniqueidentifiergenerators,cryptographictechniques,tim- ingmethods,andtimestampingprotocolstocreateuniquedataobjectsthatareimmu- table(neverchanging),immortal,andprivate;andtocreatedatastructuresthatfacilitatea hostofusefulfunctionsthatwillbedescribed(e.g.,blockchainsanddistributedledgers, protocolsforsafelysharingconfidential information, andmethods forreconcilingiden- tifiers across data collections without violating privacy). – TipsforBigDataanalysis Howtoovercomemanyoftheanalyticlimitationsimposedbyscaleanddimensionality, usingarangeofsimpletechniques(e.g.,approximations,so-calledback-of-the-envelope

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.