ebook img

Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information PDF

449 Pages·2018·3.24 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information

Principles and Practice of Big Data Preparing, Sharing, and Analyzing Complex Information Second Edition Jules J. Berman ©2018ElsevierInc ISBN:978-0-12-815609-4 Author’s Preface to Second Edition Everything has been said before, but since nobody listens we haveto keep going back and beginningall overagain. AndreGide Goodsciencewriterswillalwaysjumpatthechancetowriteasecondeditionofanearlier work.Nomatterhowhardtheytry,thatfirsteditionwillcontaininaccuraciesandmislead- ingremarks.Sentencesthatseemedbrilliantwhenfirstconceivedwill,withthepassageof time,transformintoexamplesofintellectualoverreaching.Pointstootrivialtoincludein theoriginalmanuscriptmaynowseemlikeprofunditiesthatdemandafullexplanation. Asecond edition provides rueful authors with anopportunityto correctthe record. WhenthefirsteditionofPrinciplesofBigDatawaspublishedin2013thefieldwasvery youngandtherewerefewscientistswhoknewwhattodowithBigData.Thedatathatkept pouring in was stored, like wheat in silos, throughout the planet. It was obvious to data managersthatnoneofthatstoreddatawouldhaveanyscientificvalueunlessitwasprop- erly annotated with metadata, identifiers, timestamps, and a set of basic descriptors. Undertheseconditions,thefirsteditionofthePrinciplesofBigDatastressedtheproper andnecessarymethodsforcollecting,annotating,organizing,andcuratingBigData.The processofpreparingBigDatacomeswithitsownuniquesetofchallenges,andtheFirst Edition waspepperedwith warnings and exhortationsintended to steer readers clear of disaster. Itisnowfiveyearssincethefirsteditionwaspublishedandtherehavesincebeenhun- dredsofbookswrittenonthesubjectofBigData.Asascientist,itisdisappointingtome thatthebulkofBigData,today,isfocusedonissuesofmarketingandpredictiveanalytics (e.g.,“Whoislikelytobuyproductx,giventhattheyboughtproductytwoweeksprevi- ously?”);andmachinelearning(e.g.,driverlesscars,computervision,speechrecognition). Machinelearningreliesheavilyonhypeduptechniquessuchasneuralnetworksanddeep learning; neither of which are leading to fundamental laws and principles that simplify and broaden our understanding of the natural world and the physical universe. For the mostpart,thesetechniquesusedatathatisrelativelynew(i.e.,freshlycollected),poorly annotated (i.e., provided with only the minimal information required for one particular analyticprocess),andnotdepositedforpublicevaluationorforre-use.Inshort,BigData hasfollowedthepathofleastresistance,avoidingmostofthetoughissuesraised inthe firsteditionofthisbook;suchastheimportanceofsharingdatawiththepublic,thevalue offindingrelationships(notsimilarities)amongdataobjects,andtheheavy,butinescap- able,burdenof creatingrobust, immortal, and well-annotated data. It was certainly my hope that the greatest advances from Big Data would come as fundamental breakthroughs in the realms of medicine, biology, physics, engineering, andchemistry.WhyhasthefocusofBigDatashiftedfrombasicscienceovertomachine learning?Itmayhavesomethingtodowiththefactthatnobook,includingthefirstedition ofthisbook,hasprovidedreaderswiththemethodsrequiredtoputtheprinciplesofBig Data into practice.In retrospect, it was not sufficient to describe a set of principles and then expect readers to inventtheir own methodologies. Consequently, in this second edition, the publisher has changed the title of the book from“ThePrinciplesofBigData,”to“ThePrinciplesANDPRACTICEofBigData.”Hence- forth and herein, recommendations are accompanied by the methods by which those recommendationscanbeimplemented.Thereaderwillfindthatallofthemethodsforimple- mentingBigDatapreparationandanalysisarereallyquitesimple.Forthemostpart,com- putermethodsrequiresomebasicfamiliaritywithaprogramminglanguage,and,despite misgivings,Pythonwaschosenasthelanguageofchoice.TheadvantagesofPythonare: – Pythonisano-cost,opensource,high-levelprogramminglanguagethatiseasytoacquire, install,learn,anduse,andisavailableforeverypopularcomputeroperatingsystem. – Python is extremelypopular, at the present time,and itspopularity seems to be increasing. – Pythondistributions(suchasAnaconda)comebundledwithhundredsofhighlyuseful modules (such asnumpy,matplot, and scipy). – Pythonhasalargeandactiveusergroupthathasprovidedanextraordinaryamountof documentation for Python methods and modules. – Python supports some object-orientedtechniques that will be discussedin this new edition As everything in life,Python hasits drawbacks: – The most currentversions of Python arenot backwardly compatiblewith earlier versions.The scripts and code snippets included in this book should work for most versionsofPython3.x,butmaynotworkwithPythonversions2.xandearlier,unless thereaderispreparedtodevotesometimetotweakingthecode.Ofcourse,theseshort scriptsandsnippetsareintendedassimplifieddemonstrationsofconcepts,andmust not be construed asapplication-ready code. – Thebuilt-inPythonmethodsaresometimesmaximizedforspeedbyutilizingRandom AccessMemory(RAM)toholddatastructures,includingdatastructuresbuiltthrough iterativeloops.IterationsthroughBigDatamayexhaustavailableRAM,leadingtothe failureofPython scriptsthat functioned well with small data sets. – Python’s implementation of object orientation allowsmulticlassinheritance (i.e.,a classcanbethesubclassofmorethanoneparentclass).Wewilldescribewhythisis problematic,andthecompensatorymeasuresthatwemusttake,wheneverweuseour Python programmingskills to understandlarge and complex setsof data objects. Thecoreofeveryalgorithmdescribedinthe bookcanbeimplementedin afewlines of code,usingjustaboutanypopularprogramminglanguage,underanyoperatingsystem, onanymoderncomputer.NumerousPython snippetsareprovided,alongwithdescrip- tions of free utilities that are widely available on every popular operating system. This book stresses the point that most data analyses conducted on large, complex data sets canbeachievedwithsimplemethods,bypassingspecializedsoftwaresystems(e.g.,par- allelizationofcomputationalprocesses)orhardware(e.g.,supercomputers).Readerswho are completely unacquainted with Python may find that they can read and understand Python code, if the snippets of code are brief, and accompanied by some explanation inthetext.Inanycase,readerswhoareprimarilyconcernedwithmasteringtheprinciples ofBigData can skip the codesnippets withoutlosing the narrative threadofthe book. Thissecondeditionhasbeenexpandedtostressmethodologiesthathavebeenover- lookedbytheauthorsofotherbooksinthefieldofBigDataanalysis.Thesewouldinclude: – Datapreparation. Howtoannotatedatawithmetadataandhowtocreatedataobjectscomposedoftriples. Theconceptofthetriple,asthefundamentalconveyorofmeaninginthecomputational sciences, is fully explained. – Datastructures of particular relevance to BigData Conceptssuchastriplestores,distributedledgers,uniqueidentifiers,timestamps,concor- dances,indexes,dictionaryobjects,datapersistence,andtherolesofone-wayhashesand encryption protocols fordata storage and distributionarecovered. – Classification of data objects How to assign data objects to classes based on their shared relationships, and the com- putational roles filled by classifications in the analysis of Big Data will be discussed at length. – Introspection Howto createdata objects that areself-describing,permitting the data analyst to group objectsbelongingtothesameclassandtoapplymethodstoclassobjectsthathavebeen inherited from their ancestralclasses. – Algorithms that havespecial utility inBigDatapreparation and analysis Howtouseone-wayhashes,uniqueidentifiergenerators,cryptographictechniques,tim- ingmethods,andtimestampingprotocolstocreateuniquedataobjectsthatareimmu- table(neverchanging),immortal,andprivate;andtocreatedatastructuresthatfacilitatea hostofusefulfunctionsthatwillbedescribed(e.g.,blockchainsanddistributedledgers, protocolsforsafelysharingconfidential information, andmethods forreconcilingiden- tifiers across data collections without violating privacy). – TipsforBigDataanalysis Howtoovercomemanyoftheanalyticlimitationsimposedbyscaleanddimensionality, usingarangeofsimpletechniques(e.g.,approximations,so-calledback-of-the-envelope tricks, repeated sampling using a random number generator, Monte Carlo simulations, and data reduction methods). – Data reanalysis, data repurposing, anddata sharing Why the first analysis of Big Data is almost always incorrect, misleading, or woefully incomplete,andwhydatareanalysishasbecomeacrucialskillthateveryseriousBigData analystmustacquire.TheprocessofdatareanalysisofteninspiresrepurposingofBigData resources.Neitherdatareanalysisnordatarepurposingcanbeachievedunlessanduntil theobstaclestodatasharingareovercome.Thetopicsofdatareanalysis,datarepurpos- ing, and data sharingareexploredatlength. Comprehensivetexts,suchasthesecondeditionofthePrinciplesandPracticeofBig Data,areneverquiteascomprehensiveastheymightstrivetobe;theresimplyisnowayto fullydescribeeveryconceptandmethodthatisrelevanttoamulti-disciplinaryfield,such asBigData.Tocompensateforsuchdeficiencies,thereisanextensiveGlossarysectionfor everychapter,thatdefinesthetermsintroducedinthetext,providingsomeexplanationof the relevance of the terms for Big Data scientists. In addition, when techniques and methodsarediscussed,alistofreferencesthatthereadermayfinduseful,forfurtherread- ingonthesubject,isprovided.Altogether,thesecondeditioncontainsabout600citations tooutsidereferences, mostofwhichareavailableasfreedownloads.Thereareover300 glossaryitems,manyofwhichcontainshortPythonsnippetsthatreadersmayfinduseful. Asafinalnote,thissecondeditionusescasestudiestoshowreadershowtheprinciples ofBigDataareputintopractice.Althoughcasestudiesaredrawnfrommanyfieldsofsci- ence,includingphysics,economics,andastronomy,readerswillnoticeanoverabundance ofexamplesdrawnfromthebiologicalsciences(particularlymedicineandzoology).The reasonforthisisthatthetaxonomyofalllivingterrestrialorganismsistheoldestandbest Big Data classification in existence. All of the classic errors in data organization, and in dataanalysis,havebeencommittedinthefieldofbiology.Moreimportantly,theseerrors have been documented in excruciating detail and most of the documented errors have been corrected and published for public consumption. If you want to understand how Big Data can be used as a tool for scientific advancement, then you must look at case examplestakenfromtheworldofbiology,awell-documentedfieldwhereeverythingthat canhappenhashappened,ishappening,andwillhappen.Everyefforthasbeenmadeto limit Case Studies to the simplest examples of their type,and to provide as much back- ground explanation asnon-biologists may require. PrinciplesandPracticeofBigData,SecondEdition,isdevotedtotheintellectualcon- victionthattheprimarypurposeofBigDataanalysisistopermitustoaskandanswera widerangeofquestionsthatcouldnothavebeencrediblyapproachedwithsmallsetsof data.Thereiseveryreasontohopethatthereadersofthisbookwillsoonachievescientific breakthroughs that werebeyond the reachofpriorgenerations ofscientists.Good luck! Author’s Preface to First Edition We can’t solve problemsby using the same kind of thinking we used when we created them. AlbertEinstein Datapoursintomillionsofcomputerseverymomentofeveryday.Itisestimatedthatthe totalaccumulateddatastoredoncomputersworldwideisabout300exabytes(that’s300 billiongigabytes).Datastorageincreasesatabout28%peryear.Thedatastoredispeanuts comparedtodatathatistransmittedwithoutstorage.Theannualtransmissionofdatais estimatedatabout1.9zettabytesor1,900billiongigabytes[1].Fromthisgrowingtangleof digital information, the nextgeneration ofdata resourceswill emerge. Aswebroadenourdatareach(i.e.,thedifferentkindsofdataobjectsincludedinthe resource),andourdatatimeline(i.e.,accruingdatafromthefutureandthedeeppast),we needtofindwaystofullydescribeeachpieceofdata,sothatwedonotconfuseonedata itemwithanother,andsothatwecansearchandretrievedataitemswhenweneedthem. Astuteinformaticiansunderstandthatifwefullydescribeeverythinginouruniverse,we wouldneedtohaveanancillaryuniversetoholdalltheinformation,andtheancillaryuni- verse wouldneed to be muchlargerthan our physical universe. Intherushtoacquireandanalyzedata,itiseasytooverlookthetopicofdataprepa- ration. If thedatain ourBigDataresourcesarenotwell organized,comprehensive,and fullydescribed,thentheresourceswillhavenovalue.Theprimarypurposeofthisbookis toexplaintheprinciplesuponwhichseriousBigDataresourcesarebuilt.Allofthedata heldinBigDataresourcesmusthaveaformthatsupportssearch,retrieval,andanalysis. Theanalyticmethodsmustbeavailableforreview,andtheanalyticresultsmustbeavail- able for validation. PerhapsthegreatestpotentialbenefitofBigDataisitsabilitytolinkseeminglydispa- ratedisciplines,todevelopandtesthypothesisthatcannotbeapproachedwithinasingle knowledge domain. Methods by which analysts can navigate through different BigData resources to createnew, merged data sets, will be reviewed. What exactly, is Big Data? Big Data is characterized by the three V’s: volume (large amountsofdata),variety(includesdifferenttypesofdata),andvelocity(constantlyaccu- mulatingnewdata)[2].ThoseofuswhohaveworkedonBigDataprojectsmightsuggest throwing a few more v’s into the mix: vision (having a purpose and a plan), verification (ensuringthatthedataconformstoasetofspecifications),andvalidation(checkingthat itspurpose is fulfilled). ManyofthefundamentalprinciplesofBigDataorganizationhavebeendescribedin the“metadata”literature.Thisliteraturedealswiththeformalismsofdatadescription(i.e., how to describe data); the syntax of data description (e.g., markup languages such as eXtensible Markup Language, XML); semantics (i.e., how to make computer-parsable statementsthatconveymeaning);thesyntaxofsemantics(e.g.,frameworkspecifications suchasResourceDescriptionFramework,RDF,andWebOntologyLanguage,OWL);the creation of data objects that hold data values and self-descriptive information; and the deployment ofontologies, hierarchical class systems whose members aredata objects. The field of metadata may seem like a complete waste of time to professionals who havesucceededverywell,indata-intensivefields,withoutresortingtometadataformal- isms.Manycomputerscientists,statisticians,databasemanagers,andnetworkspecialists havenotroublehandlinglargeamountsofdata,andtheymaynotseetheneedtocreatea strangenewdatamodelforBigDataresources.Theymightfeelthatalltheyreallyneedis greaterstoragecapacity,distributedovermorepowerfulcomputersthatworkinparallel withoneanother.Withthiskindofcomputationalpower,theycanstore,retrieve,andana- lyze larger and larger quantities of data. These fantasies only apply to systems that use relativelysimpledataordatathatcanberepresentedinauniformandstandardformat. Whendataishighlycomplexanddiverse,asfoundinBigDataresources,theimportance ofmetadataloomslarge.Metadatawillbediscussed,withafocusonthoseconceptsthat mustbeincorporatedintotheorganizationofBigDataresources.Theemphasiswillbeon explainingtherelevanceandnecessityoftheseconcepts,withoutgoingintogrittydetails that arewell coveredin the metadataliterature. When data originates from many different sources, arrives in many different forms, growsinsize,changesitsvalues,andextendsintothepastandthefuture,thegameshifts fromdatacomputationtodatamanagement.Ihopethatthisbookwillpersuadereaders thatfaster,morepowerfulcomputersarenicetohave,butthesedevicescannotcompen- sate for deficiencies in data preparation. For the foreseeable future, universities, federal agencies, and corporations will pour money, time, and manpower into Big Data efforts. Iftheyignorethefundamentals,theirprojectsarelikelytofail.Ontheotherhand,ifthey payattentiontoBigDatafundamentals,theywilldiscoverthatBigDataanalysescanbe performedonstandardcomputers.Thesimplelesson,thatdatatrumpscomputation,will be repeated throughout thisbookin examples drawn from well-documentedevents. Therearethreecrucialtopicsrelatedtodatapreparationthatareomittedfromvirtually everyotherBigData book: identifiers, immutability, and introspection. Athoughtfulidentifiersystemensuresthatallofthedatarelatedtoaparticulardataobject willbeattachedtothecorrectobject,throughitsidentifier,andtonootherobject.Itseems simple,anditis,butmanyBigDataresourcesassignidentifierspromiscuously,withtheend result that information related to a unique object is scattered throughout the resource, attachedto otherobjects,andcannotbesensiblyretrievedwhenneeded.Theconceptof objectidentificationisofsuchoverridingimportancethataBigDataresourcecanbeusefully envisionedasacollectionofuniqueidentifierstowhichcomplexdataisattached. ImmutabilityistheprinciplethatdatacollectedinaBigDataresourceispermanent, andcanneverbemodified.Atfirstthought,itwouldseemthatimmutabilityisaridiculous andimpossibleconstraint.Intherealworld,mistakesaremade,informationchanges,and themethodsfordescribinginformationchanges.Thisisalltrue,buttheastuteBigData manager knows how to accrue information into data objects without changing the pre- existing data. Methods for achieving this seemingly impossible trick will be described in detail. Introspectionisatermborrowedfromobject-orientedprogramming,notoftenfound intheBigDataliterature.Itreferstotheabilityofdataobjectstodescribethemselveswhen interrogated.Withintrospection,usersofaBigDataresourcecanquicklydeterminethe content of data objects and the hierarchical organization of data objects within the Big Data resource. Introspection allows users to see the types of data relationships that can be analyzed within the resource and clarifies how disparate resources can interact with one another. Anothersubjectcoveredinthisbook,andoftenomittedfromtheliteratureonBigData, is data indexing. Though there are many books written on the art of the science of so- calledback-of-the-bookindexes,scantattentionhasbeenpaidtotheprocessofpreparing indexesforlargeandcomplexdataresources.Consequently,mostBigDataresourceshave nothingthatcouldbecalledaseriousindex.TheymighthaveaWebpagewithafewlinks toexplanatorydocuments,ortheymighthaveashortandcrude"help"index,butitwould be rareto find a Big Data resourcewith a comprehensive index containing a thoughtful andupdatedlistoftermsandlinks.Withoutaproperindex,mostBigDataresourceshave limitedutilityforanybutafewcognoscenti.Itseemsoddtomethatorganizationswilling tospendhundredsofmillionsofdollarsonaBigDataresourcewillbalkatinvestingafew thousanddollars morefor aproperindex. Asidefromthesefourtopics,whichreaderswouldbehard-pressedtofindintheexist- ingBigDataliterature,thisbookcoverstheusualtopicsrelevanttoBigDatadesign,con- struction, operation, and analysis. Some of these topics include data quality, providing structuretounstructureddata,datadeidentification,datastandardsandinteroperability issues,legacydata,datareductionandtransformation,dataanalysis,andsoftwareissues. Forthesetopics,discussionsfocusontheunderlyingprinciples;programmingcodeand mathematicalequationsareconspicuouslyinconspicuous.AnextensiveGlossarycovers the technical or specialized terms and topics that appear throughout the text. As each Glossarytermis"optional"reading,Itookthelibertyofexpandingontechnicalormath- ematicalconceptsthatappearedinabbreviatedforminthemaintext.TheGlossarypro- videsanexplanationofthepracticalrelevanceofeachtermtoBigData,andsomereaders may enjoy browsingthe Glossaryas astand-alone text. The final four chapters are non-technical; all dealing in one way or another with the consequences of our exploitation of Big Data resources. These chapters will cover legal, social, and ethical issues. The book ends with my personal predictions for the future of BigData,anditsimpendingimpactonourfutures.Whenpreparingthisbook,Idebated whetherthesefourchaptersmightbestappearinthefrontofthebook,towhetthereader’s appetiteforthemoretechnicalchapters.Ieventuallydecidedthatsomereaderswouldbe unfamiliar with some of the technical language and concepts included in the final chapters, necessitating their placement near the end. Readersmaynoticethatmanyofthecaseexamplesdescribedinthisbookcomefrom the field of medical informatics. The healthcare informatics field is particularly ripe for discussion because every reader is affected, on economic and personal levels, by the BigDatapoliciesandactionsemanatingfromthefieldofmedicine.Asidefromthat,there isarichliteratureonBigDataprojectsrelatedtohealthcare.Asmuchofthisliteratureis controversial, I thought it important to select examples that I could document from reliablesources.Consequently,thereferencesectionislarge,withover200articlesfrom journals,newspaperarticles,andbooks.Mostofthesecitedarticlesareavailableforfree Web download. Whoshouldreadthisbook?ThisbookiswrittenforprofessionalswhomanageBigData resources and for students in the fields of computer science and informatics. Data managementprofessionalswouldincludetheleadershipwithincorporationsandfunding agencieswhomustcommitresourcestotheproject,theprojectdirectorswhomustdeter- mine a feasible set of goals and who must assemble a team of individuals who, in aggregate,holdtherequisiteskills forthe task:networkmanagers,datadomainspecial- ists, metadata specialists, software programmers, standards experts, interoperability experts,statisticians,dataanalysts,andrepresentativesfromtheintendedusercommu- nity.Studentsofinformatics, the computersciences, andstatisticswilldiscoverthat the specialchallengesattachedtoBigData,seldomdiscussedinuniversityclasses,areoften surprising; sometimes shocking. BymasteringthefundamentalsofBigDatadesign,maintenance,growth,andvalida- tion,readerswilllearnhowtosimplifytheendlesstasksengenderedbyBigDataresources. Adept analysts can find relationships among data objects held in disparate Big Data resourcesifthedataispreparedproperly.ReaderswilldiscoverhowintegratingBigData resourcescandeliverbenefitsfarbeyondanythingattainedfromstand-alonedatabases. References [1] MartinHilbertM,LopezP.Theworld’stechnologicalcapacitytostore,communicate,andcompute information.Science2011;332:60–5. [2] SchmidtS.Dataisexploding:the3V’sofBigData.BusinessComputingWorld;2012.May15.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.