Studies in Big Data 38 Usha Mujoo Munshi Neeta Verma Editors Data Science Landscape Towards Research Standards and Protocols Studies in Big Data Volume 38 Series editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected] Theseries“StudiesinBigData”(SBD)publishesnewdevelopmentsandadvances in the various areas of Big Data- quickly and with a high quality. The intent is to coverthetheory,research,development,andapplicationsofBigData,asembedded inthefieldsofengineering,computerscience,physics,economicsandlifesciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensorsorotherphysicalinstrumentsaswellassimulations,crowdsourcing,social networks or other internet transactions, such as emails or video click streams and others. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. More information about this series at http://www.springer.com/series/11970 Usha Mujoo Munshi Neeta Verma (cid:129) Editors Data Science Landscape Towards Research Standards and Protocols 123 Editors Usha MujooMunshi Neeta Verma Indian Institute of Public Administration National Informatics Centre NewDelhi, Delhi NewDelhi, Delhi India India ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in BigData ISBN978-981-10-7514-8 ISBN978-981-10-7515-5 (eBook) https://doi.org/10.1007/978-981-10-7515-5 LibraryofCongressControlNumber:2017961098 ©SpringerNatureSingaporePteLtd.2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerNatureSingaporePteLtd. Theregisteredcompanyaddressis:152BeachRoad,#21-01/04GatewayEast,Singapore189721,Singapore Foreword Decades back, the modern futurologist Alvin Toeffler stated that tomorrow knowledge will be power; that tomorrow is already with us. And Big Data is a significant source of knowledge generation. There are, of course, issues including security and privacy issues, in the generation of data and opening of Big Data in some areas. We also need Data Science standards and citation practices. Data can be structured and also unstructured; it can be raw data or processed data.Itcanalsobeaudiodataorvideoimages.Wearealreadyseeinganexplosion in the generation of data, and this is only going to increase in future, with contri- butions also coming from the Internet of Things (IoT) and real-life applications driven by IoT devices. There is a need to develop algorithms, for storage, compi- lation, processing and analysis of such huge volumes of data. To understand this new domain, a new discipline of study has evolved as Data Science. It is felt that Data Science could be one of the most sought-after professions in the coming decades.Thereshould,therefore,beaplantointroducedegree/diploma/certification coursesinDataScience,whichwillfosterskillsandhelpemploymentgenerationin thisfield.ItisalsorequiredtopromoteresearchinDataScience,byemphasizingits importance in our R&D value system. BigandOpendataareimportantforthegrowthofscience.Infact,thescientists who create data have to be incentivized and supported through needed infrastruc- ture,andtheirworkcitedandattributedwithfairness.Thisisparticularlyneededif the data generated and processed are important for science or for society. To build any sustainable Open Data infrastructure, there is also the need for a robust policy framework. Policy provisions have a strong influence on strategy formulation and implementationofanyOpenDataprogramme.Therefore,amechanismneedstobe developedwherethesedatacanbestored,protectedaswellasarchivedforalonger time. Setting up Open Data infrastructure will make data a national asset, and researchersmustsharetheirknowledgeinthe“OpenDataRepository”tocreatethe collectiveknowledge-base.TheuseofOpenDatageneratedbyotherswouldfasten theprocessofdatacollectionandanalysisandwillsavemoneyandtime. Itwould also help in improving India’s overall standing in this research domain. v vi Foreword With a robust infrastructure, and utilizing developments in data storage, cloud computingtechnologies,datawarehousing,anddatamining,combinedwithAIand deep learning and machine learning techniques, data can be used for effective predictive modelling as well as for trending and for supporting government in decision-making and policy-making. Thereisalsotheissueofqualityofdata.WhenIwasamemberoftheExecutive Committee of the “International Union of Crystallography” a couple of decades back, we were told that the Cambridge Organic Crystal Structure Database often found errors, particularly in the vibration amplitudes, derived from the anisotropic temperaturefactors,duetounknowinglymixingsoftwareusingdifferent formulae. Therefore,emphasisontheimportanceofreliablehigh-qualitydataandknowledge about the software used is necessary. In India, we have a National Knowledge Network (NKN), for providing a unified high-speed optical fibre network which forms the backbone for all knowl- edge institutions in the country. The purpose of such a knowledge network lies in the country’s quest for building quality institutions with necessary research facili- tiesandcreatingapoolofhighlytrainedprofessionals.NKNisdesignedasaSmart Ultra-High Bandwidth network that interconnects the leading knowledge institu- tions. NKN can play an important role in Big Data Science. The availability of supercomputers, large data storage systems and the Internet providesopportunities,asneverbefore,ofmanipulating,storingandaccessingBig Data in Science and Technology. Our office has initiated synergizing the best practicesintheBigDataexperienceofAstrophysics,BiologyandClimateScience. The Data Centres would be located in the institutes with domain knowledge, with mirror sites located in the NKN Headquarters in New Delhi. The present volume titled “Data Science Landscape: Towards Research Standards and Protocols” consisting of 24 chapters, written by eminent scholars from different parts of the world, is a timely publication. These chapters provide a currentperspectiveofdifferentareasonresearchanddevelopment,emphasizingthe major challenging issues. The volume concentrates on the important gaps in research data infrastructure, theabsence ofbroad availability,useofdata protocols andstandards, andthelike. The key issues of S&T, such as institutional, financial, sustainability, legal, IPR, data protocols, community norms and others, that need attention related to data management practices and protocols, coordinate area activities and promote com- monpracticesandstandardsoftheresearchcommunitygloballyareimportant,and some of these have been delved into this volume. Icongratulatetheeditors Dr.UshaMujooMunshiandDr.Neeta Verma forthe pioneering effort in bringing out this timely volume. New Delhi, India R. Chidambaram October 2017 Principal Scientific Adviser to the Government of India Preface Unprecedented explosion in the human capacity to acquire, store and manipulate data together with instant communication globally has transformed research from aneraofdatascarcitytodatadeluge,whereexpertsseeitasa“secondrevolutionof discovery”.Simultaneously,thegrowthofelectronicpublishingoftheliteraturehas created new challenges, such as the need for mechanisms for citing online refer- ences in ways that can assure discoverability and retrieval for many years into the future. Effective exploitation of “Big Data” basically depends on the international culture of “Open Data” involving data sharing, availability for reuse and repur- posing.Therefore,thereisaneedtocreatetheinfrastructure,evolvemethodologies, practices and most importantly policies that enable and empower researchers in identifying patterns and processes and subsequently analyse them to predict behaviour of complex systems, which so far had been beyond their capacity. In order to foster push and pull for the Big Data applications in all segments of societyandacrossdisciplines,itisimportanttoaddressmultivariateissuesthatare critical for sharing, searchability and accessibility of data resources. For instance, Data Citation being one of the ways of giving attribution, its standards and good practices can form the basis for increased incentives, recognition and rewards for research data activities, that in many cases are currently lacking. Furthermore, the rapidlyexpandinguniverseofonlinedigitaldataholdsthepromiseofallowingpeer examination and review of conclusions or analyses based on experimental or observational data, as well as the ability for subsequent users to make new and unforeseen uses and analyses of the same data—either in isolation, or in combi- nationwithotherdatasets.Accordingly,thereisaneedforstrategytobeadaptedto noveldiscoveriesandapproacheswiththeevolvingneedsofinternationalresearch and the science community. Consequently, the crying need emerges for a frame- work of international agreements, practices or standards, national policies and practices for funding and incentivizing research. As questions of implications for stakeholders vary with disciplines, the diverse requirements of research commu- nities in particular need to be addressed. vii viii Preface Theideaofthisworkgerminatedfromthetwo-dayworkshopon“BigandOpen Data: Evolving Data Science Standards and Citation Attribution Practices”, an international workshop which was attended by over 300 domain experts. The workshop focused on two priority areas: (i) Big and Open Data: Prioritizing, Addressing and Establishing Standards and Good Practices and (ii) Big and Open Data: Data Attribution and Citation Practices. This important international event was part of world initiative led by ICSU, CODATA-Data Citation Task Group. The present edited volume deals with different contours of Data Science with special reference to data management for research innovation landscape. In all, there are 24 chapters written by eminent researchers in the field. The issues concentrate on the important gap in research data infrastructure, the absenceofbroadavailability-oftheuseofDataCitationprotocolsandthelike.The fundingagenciesforresearchhavebeguntoincludedatamanagementplansaspart of their selection and approval processes. The initiatives are already underway in different countries. The key issues of S&T that need attention broadly include, institutional, financial, sustainability, legal, IPR, data protocols, community norms and others, related to data management practices and protocols, and promoting commonpracticesandstandardsgloballyfortheresearchcommunityareimportant. Theavailabilityandapplicationofdataisofcoreimportanceinthedata-centric world across disciplines right from science and technology to social sciences, arts and humanities to policy-making. Accessibility refers to the availability of domain-specific information to the user. In terms of generating applications, accessibility attributes the ease with which the existence of information can be ascertained, as well as the suitability of the form or medium through which the information can be accessed. The cost of the information may also be an aspect of accessibility for some users. With the explosion of social media sites and prolif- erationofdigitalcomputingdevicesandInternetaccess,massiveamountsofpublic dataarebeinggeneratedonadailybasis.Efficienttechniques/algorithmstoanalyse thismassiveamountofdatacanprovidenearreal-timeinformationaboutemerging trends and provide early warning in case of an imminent emergency. Careful mining of these data can reveal many useful indicators of socioeconomic and political events, which can help in establishing effective public policies for the purposeofhumandevelopment.Thus,todecipherhowcanweharnesstheBigData technologies to transform and revolutionize the developing world, we will have to considerpertinentissueslike:Howtoaccessanduseallofthedatathatarepresent out there on the isolated servers of various agencies and organizations for the developmentpurposes?Wheredowestartandprioritizeastowhatparticularareas ofdevelopmentcanbenefitfromBigData?Also,whataresomeofthewell-known techniques for Big Data analytics that can be applied in multivariate contexts to benefit society at large? Inordertomaketheabovehappen,itbecomesimperativetofocusonsettingup ofrobustinfrastructureforstoring,processingandanalysingthedata.BigDatahas gained much attention from the academia and the IT industry alike. In the digital and computing world, information is generated and collected at a rate that rapidly exceedstheboundaryrange.Asinformationistransferredandsharedatlightspeed Preface ix on optic fibre and wireless networks, the volume of data and the speed of market growth increase. However, the fast growth rate of such large data generates numerouschallenges,suchastherapidgrowthofdata,transferspeed,diversedata and security. A vast majority of organizations spanning across industries are convinced of its usefulness, but the implementation focus is primarily application-oriented than infrastructure-oriented. However, the infrastructure architecture for any Big Data cluster is of critical importance because it affects the performance of the cluster. Modelling the infrastructure architecture for Big Data essentially requires balancing the cost and efficiency to meet the specific needs of businesses.WhiledesigningtheBigDataarchitecturefortheenterpriseset-up,itis necessarytotakeacomprehensiveapproachsuchthattheapplicationrequirements can drive the overall cluster design activity including cluster sizing, hardware architecture, network architecture, storage architecture and information security architecture. Therefore, the research directions need to facilitate the exploration ofthedomainandthedevelopmentofoptimaltechniquestoaddressBigDatatobe optimally exploited for sustainable development and robust policy formulation. Having focused on the need for robust infrastructure, the other aspects for dis- cussion and requisite action include different dimensions of Data Science, such as right data quality, storage, metadata, accessibility, openness and interoperability. Users have different expectations of data that are accurate and timely, compre- hensive and cost-effective, locally relevant and also comparable with other similar situationsaspertherequirement—fit-for-purpose.Thus,itisofutmost importance to understand the definition of the so-called data quality, thereby outlining the various dimensions of quality and quality measurement, and have them integrated into quality assessment frameworks. For instance, typical examples of quality assessmentframeworksinclude:EuropeanStatisticalSystem(ESS),focusingonthe statistical outputs and defining quality with reference to six criteria, IMF Data Quality Assessment Framework (DQAF), portraying holistic view of data quality, including governance of statistical system, and OECD Quality Measurement Framework, taking the user’s side to approach quality and uses seven dimensions. Metadata is essential for the interpretation of statistical data. Therefore, the attention to address various levels of metadata (e.g., structural, reference) is of prime importance. Together with metadata, the issues like openness and interop- erability should be on the top of agenda to foster accessibility for reuse and repurposing of the data generated out of public funds. While the promise ofBig Datais real, there is currently a wide gap between its potentialanditsrealization.Anumberofchallengeslyinginthepipelinearetobe addressed in making an effective use of data discoverability and subsequent reuse. Heterogeneity, scale, timeliness, complexity and privacy problems with Big Data impede progress at all phases of the pipeline that can create value from data. The problemsstartrightawayduringdataacquisition,whenthedatatsunamirequiresus tomakedecisions,currentlyinanadhocmanner,aboutwhatdatatokeepandwhat to discard, and how to store what we keep reliably with the right metadata. Much datatodayarenotnativelyinastructuredformat.Thevalueofdataexplodeswhen it can be linked with other data, thus, data integration is a major creator of value.