PRINCIPLES OF BIG DATA PRINCIPLES OF BIG DATA Preparing, Sharing, and Analyzing Complex Information JULES J. BERMAN, Ph.D., M.D. AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO Morgan Kaufmann is an imprint of Elsevier AcquiringEditor:AndreaDierna EditorialProjectManager:HeatherScherer ProjectManager:PunithavathyGovindaradjane Designer:RussellPurdy MorganKaufmannisanimprintofElsevier 225WymanStreet,Waltham,MA02451,USA Copyright#2013ElsevierInc.Allrightsreserved Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorage andretrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowto seekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandour arrangementswithorganizationssuchastheCopyrightClearanceCenterandthe CopyrightLicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightby thePublisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchand experiencebroadenourunderstanding,changesinresearchmethodsorprofessional practices,maybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformationormethodsdescribedherein.Inusingsuch informationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers, includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,or editors,assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasa matterofproductsliability,negligenceorotherwise,orfromanyuseoroperationofany methods,products,instructions,orideascontainedinthematerialherein. LibraryofCongressCataloging-in-PublicationData Berman,JulesJ. Principlesofbigdata:preparing,sharing,andanalyzingcomplexinformation/Jules JBerman. pagescm ISBN978-0-12-404576-7 1.Bigdata.2.Databasemanagement.I.Title. QA76.9.D32B472013 005.74–dc23 2013006421 BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary PrintedandboundintheUnitedStatesofAmerica 13 14 15 16 17 10 9 8 7 6 5 4 3 2 1 ForinformationonallMKpublicationsvisitourwebsiteatwww.mkp.com Dedication To my father,Benjamin v Acknowledgments IthankRogerDay,andPaulLewiswhores- DenisePenrose,whoworkedonherverylast olutelypouredthroughtheentiremanuscript, dayatElseviertofindthistitleasuitablehome placing insightful and useful comments in at Elsevier’s Morgan Kaufmann imprint. I every chapter. I thank Stuart Kramer, whose thank Andrea Dierna, Heather Scherer, and valuablesuggestionsforthecontentandorga- all the staff at Morgan Kaufmann who nizationofthetextcamewhentheprojectwas shepherdedthisbookthroughthepublication in its formative stage. Special thanks go to andmarketingprocesses. xi Author Biography JulesBermanholdstwoBachelorofScience held joint appointments at the University of degrees from MIT (Mathematics, and Earth Maryland Medical Center and at the Johns andPlanetarySciences),aPh.D.fromTemple Hopkins Medical Institutions. In 1998, he University,andanM.D.fromtheUniversityof became the Program Director for Pathology Miami. He was a graduate researcher in the InformaticsintheCancerDiagnosisProgram FelsCancerResearchInstituteatTempleUni- at the U.S. National Cancer Institute, where versity and at the American Health Founda- heworkedandconsultedonBigDataprojects. tioninValhalla,NewYork.Hispostdoctoral In2006,Dr.BermanwasPresidentoftheAsso- studies were completed at the U.S. National ciationforPathologyInformatics.In2011,he Institutes of Health, and his residency was received the Lifetime Achievement Award completedattheGeorgeWashingtonUniver- fromtheAssociationforPathologyInformat- sity Medical Center in Washington, DC. ics.Heisacoauthoronhundredsofscientific Dr. Berman served as Chief of Anatomic publications.Today,Dr.Bermanisafreelance Pathology, Surgical Pathology and Cytopa- author,writingextensivelyinhisthreeareas thologyattheVeteransAdministrationMedi- ofexpertise:informatics,computerprogram- cal Centerin Baltimore, Maryland, wherehe ming,andpathology. xiii Preface Wecan’tsolveproblemsbyusingthesame value. The primary purpose of this book is kind of thinking we used when we created toexplaintheprinciplesuponwhichserious them. Albert Einstein Big Data resources are built. All of the data heldinBigDataresourcesmusthaveaform Datapoursintomillionsofcomputersev- thatsupportssearch,retrieval,andanalysis. erymomentofeveryday.Itisestimatedthat The analytic methods must be available for the total accumulated data stored on com- review, and the analytic results must be putersworldwideisabout300exabytes(that’s availablefor validation. 300 billion gigabytes). Datastorage increases Perhaps the greatest potential benefit of atabout28%peryear.Thedatastoredispea- Big Data is the ability to link seemingly nuts compared to data that is transmitted disparate disciplines, for the purpose of de- without storage. The annual transmission of velopingandtestinghypothesesthatcannot data is estimated at about 1.9 zettabytes be approached within a single knowledge (1900 billion gigabytes, see Glossary item, domain.Methodsbywhichanalystscannav- Binary sizes).1 From this growing tangle of igatethroughdifferentBigDataresourcesto digital information, the next generation of create new, merged data sets are reviewed. dataresourceswillemerge. WhatexactlyisBigData?BigDatacanbe Asthescopeofourdata(i.e.,thedifferent characterizedbythethreeV’s:volume(large kindsofdataobjectsincludedintheresource) amounts of data), variety (includes andourdatatimeline(i.e.,dataaccruedfrom different types of data), and velocity (con- thefutureandthedeeppast)arebroadened, stantly accumulating new data).2 Those of we need to find ways to fully describe each us who have worked on Big Data projects piece of data so that we do not confuse one mightsuggestthrowingafewmoreV’sinto data item with another and so that we can the mix: vision (having a purpose and a searchandretrievedataitemswhenneeded. plan), verification (ensuring that the data Astute informaticians understand that if we conformstoa setof specifications), and val- fully describe everything in our universe, idation(checkingthatitspurposeisfulfilled; wewouldneedtohaveanancillaryuniverse seeGlossary item,Validation). toholdall the information,and the ancillary Many of the fundamental principles of universewouldneedtobemuchmuchlarger Big Data organization have been described thanourphysicaluniverse. in the “metadata” literature. This literature Intherushtoacquireandanalyzedata,it deals with the formalisms of data descri- iseasytooverlookthetopicofdataprepara- ption (i.e., how to describe data), the tion. If data in our Big Data resources (see syntax of data description (e.g., markup Glossary item, Big Data resource) are not languages such as eXtensible Markup Lan- well organized, comprehensive, and fully guage, XML), semantics (i.e., how to make described, then the resources will have no computer-parsable statements that convey xv xvi PREFACE meaning), the syntax of semantics (e.g., compensate for deficiencies in data prepara- framework specifications such as Resource tion. For the foreseeable future, universities, Description Framework, RDF, and Web federal agencies, and corporations will pour Ontology Language, OWL), the creation of money, time, and manpower into Big Data data objects that hold data values and self- efforts.Iftheyignorethefundamentals,their descriptiveinformation,andthedeployment projectsarelikelytofail.However,iftheypay of ontologies, hierarchical class systems attentiontoBigDatafundamentals,theywill whose members are data objects (see discover that Big Data analyses can be Glossary items, Specification, Semantics, performed on standard computers. The sim- Ontology, RDF, XML). ple lesson, that data trumps computation, is Thefieldofmetadatamayseemlikeacom- repeated throughout this book in examples pletewasteoftimetoprofessionalswhohave drawnfromwell-documentedevents. succeeded very well in data-intensive fields, There are three crucial topics related to without resorting to metadata formalisms. datapreparationthatareomittedfromvirtu- Manycomputer scientists, statisticians,data- ally every other Big Data book: identifiers, basemanagers,andnetworkspecialistshave immutability, andintrospection. no trouble handling large amounts of data A thoughtful identifier system ensures andmaynotseetheneedtocreateastrange thatallofthedatarelatedtoaparticulardata newdatamodelforBigDataresources.They object will be attached to the correct object, might feel that all they really need isgreater throughitsidentifier,andtonootherobject. storagecapacity,distributedovermorepow- Itseemssimple,anditis,butmanyBigData erful computers, that work in parallel with resources assign identifiers promiscuously, oneanother.Withthiskindofcomputational with the end result that information related power, they can store, retrieve, and analyze to a unique object is scattered throughout largerandlargerquantitiesofdata.Thesefan- the resource, or attached to other objects, tasiesonlyapplytosystemsthatuserelatively and cannot be sensibly retrieved when simpledataordatathatcanberepresentedin needed. The concept of object identification auniformandstandardformat.Whendatais is of such overriding importance that a Big highly complex and diverse, as found in Big Data resource can be usefully envisioned as Data resources, the importance of metadata a collection of unique identifiers to which loomslarge.Metadatawillbediscussed,with complex data is attached. Data identifiers afocusonthoseconceptsthatmustbeincorpo- are discussed in Chapter2. rated into the organization of Big Data re- Immutabilityistheprinciplethatdatacol- sources. The emphasis will be on explaining lected in a Big Data resource is permanent therelevanceandnecessityoftheseconcepts, andcanneverbemodified.Atfirstthought, withoutgoingintogrittydetailsthatarewell it would seem that immutability is a ridicu- coveredinthemetadataliterature. lous and impossible constraint. In the real Whendataoriginatesfrommanydifferent world, mistakes are made, information sources, arrives in many different forms, changes,andthe methods for describingin- growsinsize,changesitsvalues,andextends formationchange.Thisisalltrue,buttheas- into the past and the future, the game shifts tuteBigDatamanagerknowshowtoaccrue fromdatacomputationtodatamanagement. informationintodataobjectswithoutchang- It is hoped that this book will persuade ing the pre-existing data. Methods for readersthatfaster,morepowerfulcomputers achieving this seemingly impossible trick are nice to have, but these devices cannot are described in detailin Chapter6. xvii PREFACE Introspection is a term borrowed from reduction and transformation, data analysis, object-oriented programming, not often andsoftwareissues.Forthesetopics,discus- found in the Big Data literature. It refers to sions focus on the underlying principles; the ability of data objects to describe them- programming code and mathematical equa- selves when interrogated. With introspec- tions are conspicuously inconspicuous. An tion, users of a Big Data resource can extensive Glossary covers the technical or quicklydeterminethecontentofdataobjects specialized terms and topics that appear and the hierarchical organization of data throughoutthetext.AseachGlossarytermis objects within the Big Data resource. Intro- “optional” reading, I took the liberty of spectionallowsuserstoseethetypesofdata expanding on technical or mathematical relationships that can be analyzed within concepts that appeared in abbreviated form the resource and clarifies how disparate re- in the main text. The Glossary provides an sourcescaninteractwithoneanother.Intro- explanationofthepracticalrelevanceofeach spection isdescribed in detail in Chapter4. termtoBigData,andsomereadersmayenjoy Anothersubjectcoveredinthisbook,and browsingtheGlossaryasastand-alonetext. oftenomittedfromtheliteratureonBigData, Thefinalfourchaptersarenontechnical— is data indexing. Though there are many all dealing in one way or another with the books written on the art of science of so- consequencesofourexploitationofBigData calledback-of-the-bookindexes,scantatten- resources.Thesechapterscoverlegal,social, tionhasbeenpaidtotheprocessofpreparing and ethical issues. The book ends with indexes for large and complex data re- my personal predictions for the future of sources. Consequently, most Big Data re- Big Data and its impending impact on the sources have nothing that could be called a world.Whenpreparingthisbook,Idebated serious index.They might have aWeb page whether these four chapters might best ap- with a few links to explanatory documents pear in the front of the book, to whet the ortheymighthaveashortandcrude“help” reader’sappetiteforthemoretechnicalchap- index, but it would be rare to find a Big ters. I eventually decided that some readers Data resource with a comprehensive index wouldbeunfamiliarwithtechnicallanguage containing a thoughtful and updated list of and concepts included in the final chapters, terms and links. Without a proper index, necessitating their placement near the end. mostBigDataresourceshaveutilityfornone Readers with a strong informatics back- but a few cognoscenti. It seems odd to me groundmayenjoythebookmoreiftheystart that organizations willing to spend hun- their reading atChapter 12. dredsofmillionsofdollarsonaBigDatare- Readersmaynoticethatmanyofthecase sourcewillbalkatinvestingsomethousands examples described in this book come from of dollars ona proper index. the field of medical informatics. The health Aside from these four topics, which care informatics field is particularly ripe for readers would be hard-pressed to find in discussion because every reader is affected, the existing Big Data literature, this book on economic and personal levels, by the covers the usual topics relevant to Big Big Data policies and actions emanating Data design, construction, operation, and from the field of medicine. Aside from that, analysis. Some of these topics include data thereisarichliteratureonBigDataprojects quality, providing structure to unstructured relatedtohealthcare.Asmuchofthislitera- data, data deidentification, data standards tureiscontroversial,Ithoughtitimportantto and interoperability issues, legacy data, data selectexamplesthatIcoulddocument,from xviii PREFACE reliablesources.Consequently,thereference interoperability experts, statisticians, data section is large, with over 200 articles from analysts, and representatives from the journals, newspaper articles, and books. intendedusercommunity.Studentsofinfor- Most of these cited articles are available for matics, the computer sciences, and statistics free Web download. will discover that the special challenges at- Who should read this book? This book tachedtoBigData,seldomdiscussedinuni- is written for professionals who manage versity classes, are often surprising and Big Data resources and for students in the sometimes shocking. fields of computer science and informatics. By mastering the fundamentals of Big Data management professionals would Datadesign,maintenance,growth,andvali- include the leadership within corporations dation, readers will learn how to simplify and funding agencies who must commit the endless tasks engendered by Big Data resourcestotheproject,theprojectdirectors resources. Adept analysts can find relation- who must determine a feasible set of goals ships among data objects held in disparate and who must assemble a team of individ- Big Data resources, if the data is prepared uals who, in aggregate, hold the requisite properly. Readers will discover how inte- skills for the task: network managers, data grating Big Data resources can deliver domain specialists, metadata specialists, benefits far beyond anything attained from software programmers, standards experts, stand-alone databases.
Description: