Studies in Big Data 26 S. Srinivasan E ditor Guide to Big Data Applications Studies in Big Data Volume 26 SeriesEditor JanuszKacprzyk,PolishAcademyofSciences,Warsaw,Poland e-mail:[email protected] AboutthisSeries Theseries“StudiesinBigData”(SBD)publishesnewdevelopmentsandadvances in the various areas of Big Data – quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neuralnetworks,evolutionarycomputation,softcomputing,fuzzysystems,aswell asartificialintelligence,datamining,modernstatisticsandOperationsresearch,as well as self-organizing systems. Of particular value to both the contributors and thereadershiparetheshortpublicationtimeframeandtheworld-widedistribution, whichenablebothwideandrapiddisseminationofresearchoutput. Moreinformationaboutthisseriesathttp://www.springer.com/series/11970 S. Srinivasan Editor Guide to Big Data Applications 123 Editor S.Srinivasan JesseH.JonesSchoolofBusiness TexasSouthernUniversity Houston,TX,USA ISSN2197-6503 ISSN2197-6511 (electronic) StudiesinBigData ISBN978-3-319-53816-7 ISBN978-3-319-53817-4 (eBook) DOI10.1007/978-3-319-53817-4 LibraryofCongressControlNumber:2017936371 ©SpringerInternationalPublishingAG2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland TomywifeLakshmiandgrandsonSahaas Foreword ItgivesmegreatpleasuretowritethisForewordforthistimelypublicationonthe topicoftheever-growinglistofBigDataapplications.Thepotentialforleveraging existing data from multiple sources has been articulated over and over, in an almostinfinitelandscape,yetitisimportanttorememberthatindoingso,domain knowledge is key to success. Naïve attempts to process data are bound to lead to errors such as accidentally regressing on noncausal variables. As Michael Jordan at Berkeley has pointed out, in Big Data applications the number of combinations of the features grows exponentially with the number of features, and so, for any particular database, you are likely to find some combination of columns that will predict perfectly any outcome, just by chance alone. It is therefore important that we do not process data in a hypothesis-free manner and skip sanity checks on our data. In this collection titled “Guide to Big Data Applications,” the editor has assembledasetofapplicationsinscience,medicine,andbusinesswheretheauthors have attempted to do just this—apply Big Data techniques together with a deep understanding of the source data. The applications covered give a flavor of the benefitsofBigDatainmanydisciplines.Thisbookhas19chaptersbroadlydivided into four parts. In Part I, there are four chapters that cover the basics of Big Data, aspects of privacy, and how one could use Big Data in natural language processing (a particular concern for privacy). Part II covers eight chapters that vii viii Foreword lookatvariousapplicationsofBigDatainenvironmentalscience,oilandgas,and civilinfrastructure,coveringtopicssuchasdeduplication,encryptedsearch,andthe friendshipparadox. Part III covers Big Data applications in medicine, covering topics ranging from “The Impact of Big Data on the Physician,” written from a purely clinical perspective, to the often discussed deep dives on electronic medical records. PerhapsmostexcitingintermsoffuturelandscapingistheapplicationofBigData applicationinhealthcarefromadevelopingcountryperspective.Thisisoneofthe most promising growth areas in healthcare, due to the current paucity of current services and the explosion of mobile phone usage. The tabula rasa that exists in manycountriesholdsthepotentialtoleapfrogmanyofthemistakeswehavemade in the west with stagnant silos of information, arbitrary barriers to entry, and the lackofanystandardizedschemaornondegenerateontologies. InPartIV,thebook covers BigDataapplications inbusiness,which isperhaps theunifyingsubjecthere,giventhatnoneoftheaboveapplicationareasarelikely to succeed without a good business model. The potential to leverage Big Data approachesinbusinessisenormous,frombankingpracticestotargetedadvertising. Theneedforinnovationinthisspaceisasimportantastheunderlyingtechnologies themselves. As Clayton Christensen points out in The Innovator’s Prescription, threerevolutionsareneededforasuccessfuldisruptiveinnovation: 1. Atechnologyenablerwhich“routinizes”previouslycomplicatedtask 2. Abusinessmodelinnovationwhichisaffordableandconvenient 3. A value network whereby companies with disruptive mutually reinforcing economicmodelssustaineachotherinastrongecosystem We see this happening with Big Data almost every week, and the future is exciting. In this book, the reader will encounter inspiration in each of the above topic areasandbeabletoacquireinsightsintoapplicationsthatprovidetheflavorofthis fast-growinganddynamicfield. Atlanta,GA,USA GariClifford December10,2016 Preface BigDataapplicationsaregrowingveryrapidlyaroundtheglobe.Thisnewapproach todecisionmakingtakesintoaccountdatagatheredfrommultiplesources.Heremy goalistoshowhowthesediversesourcesofdataareusefulinarrivingatactionable information. In this collection of articles the publisher and I are trying to bring in one place several diverse applications of Big Data. The goal is for users to see how a Big Data application in another field could be replicated in their discipline. With this in mind I have assembled in the “Guide to Big Data Applications” a collection of19chapters writtenbyacademics andindustrypractitionersglobally. These chapters reflect what Big Data is, how privacy can be protected with Big DataandsomeoftheimportantapplicationsofBigDatainscience,medicineand business. These applications are intended to be representative and not exhaustive. For nearly two years I spoke with major researchers around the world and the publisher. These discussions led to this project. The initial Call for Chapters was senttoseveralhundredresearchersgloballyviaemail.Approximately40proposals weresubmitted.Outofthesecamecommitmentsforcompletioninatimelymanner from 20 people. Most of these chapters are written by researchers while some are written by industry practitioners. One of the submissions was not included as it could not provide evidence of use of Big Data. This collection brings together in one place several important applications of Big Data. All chapters were reviewed using a double-blind process and comments provided to the authors. The chapters includedreflectthefinalversionsofthesechapters. Ihavearrangedthechaptersinfourparts.PartIincludesfourchaptersthatdeal with basic aspects of Big Data and how privacy is an integral component. In this part I include an introductory chapter that lays the foundation for using Big Data inavarietyofapplications.Thisisthenfollowedwithachapterontheimportance ofincludingprivacyaspectsatthedesignstageitself.Thischapterbytwoleading researchers in the field shows the importance of Big Data in dealing with privacy issues and how they could be better addressed by incorporating privacy aspects at thedesignstageitself.Theteamofresearchersfromamajorresearchuniversityin the USA addresses the importance of federated Big Data. They are looking at the use of distributed data in applications. This part is concluded with a chapter that ix x Preface shows the importance of word embedding and natural language processing using BigDataanalysis. In Part II, there are eight chapters on the applications of Big Data in science. Science is an important area where decision making could be enhanced on the way to approach a problem using data analysis. The applications selected here dealwithEnvironmentalScience,HighPerformanceComputing(HPC),friendship paradox in noting which friend’s influence will be significant, significance of using encrypted search with Big Data, importance of deduplication in Big Data especially when data is collected from multiple sources, applications in Oil & Gas and how decision making can be enhanced in identifying bridges that need to be replaced as part of meeting safety requirements. All these application areas selected for inclusion in this collection show the diversity of fields in which Big Data is used today. The Environmental Science application shows how the data published by the National Oceanic and Atmospheric Administration (NOAA) is usedtostudytheenvironment.Sincesuchdatasetsareverylarge,specializedtools are needed to benefit from them. In this chapter the authors show how Big Data tools help in this effort. The team of industry practitioners discuss how there is greatsimilarityinthewayHPCdealswithlow-latency,massivelyparallelsystems and distributed systems. These are all typical of how Big Data is used using tools suchasMapReduce,HadoopandSpark.Quoraisaleadingproviderofanswersto user queries and in this context one of their data scientists is addressing how the FriendshipparadoxisplayingasignificantpartinQuoraanswers.Thisisaclassic illustrationofaBigDataapplicationusingsocialmedia. Big Data applications in science exist in many branches and it is very heavily used in the Oil and Gas industry. Two chapters that address the Oil and Gas application are written by two sets of people with extensive industry experience. TwospecificchaptersaredevotedtohowBigDataisusedindeduplicationpractices involving multimedia data in the cloud and how privacy-aware searches are done over encrypted data. Today, people are very concerned about the security of data storedwithanapplicationprovider.Encryptionisthepreferredtooltoprotectsuch data and so having an efficient way to search such encrypted data is important. This chapter’s contribution in this regard will be of great benefit for many users. We conclude Part II with a chapter that shows how Big Data is used in noting the structuralsafetyofnation’sbridges.ThispracticalapplicationshowshowBigData isusedinmanydifferentways. Part III considers applications in medicine. A group of expert doctors from leading medical institutions in the Bay Area discuss how Big Data is used in the practiceofmedicine.Thisisoneareawheremanymoreapplicationsaboundandthe interestedreaderisencouragedtolookatsuchapplications.Anotherchapterlooks athowdatascientistsareimportantinanalyzingmedicaldata.Thischapterreflects a view from Asia and discusses the roadmap for data science use in medicine. Smokinghasbeennotedasoneoftheleadingcausesofhumansuffering.Thispart includes a chapter on comorbidity aspects related to smokers based on a Big Data analysis.Thedetailspresentedinthischapterwouldhelpthereadertofocusonother possibleapplicationsofBigDatainmedicine,especiallycancer.Finally,achapteris