ebook img

Big Data Factories: Collaborative Approaches PDF

141 Pages·2017·2.03 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Big Data Factories: Collaborative Approaches

Computational Social Sciences Sorin Adam Matei Nicolas Jullien Sean P. Goggins Editors Big Data Factories Collaborative Approaches Computational Social Sciences Computational Social Sciences Aseriesofauthoredandeditedmonographsthatutilizequantitativeandcomputational methods to model, analyze and interpret large-scale social phenomena. Titles within the series contain methods and practices that test and develop theories of complex social processes through bottom-up modeling of social interactions. Of particular interest is the study of the co-evolution of modern communication technology and social behavior and norms, in connection with emerging issues such as trust, risk, securityandprivacyinnovelsocio-technicalenvironments. Computational Social Sciences is explicitly transdisciplinary: quantitative methods from fields such as dynamical systems, artificial intelligence, network theory, agent based modeling, and statistical mechanics are invoked and combined with state-of the-artminingandanalysisoflargedatasetstohelpusunderstandsocialagents,their interactionsonandoffline,andtheeffectoftheseinteractionsatthemacrolevel.Topics include,butarenotlimitedtosocialnetworksandmedia,dynamicsofopinions,cul- turesandconflicts,socio-technicalco-evolutionandsocialpsychology.Computational SocialScienceswillalsopublishmonographsandselectededitedcontributionsfrom specializedconferencesandworkshopsspecificallyaimedatcommunicatingnewfind- ingstoalargetransdisciplinaryaudience.Afundamentalgoaloftheseriesistoprovide asingleforumwithinwhichcommonalitiesanddifferencesintheworkingsofthisfield maybediscerned,henceleadingtodeeperinsightandunderstanding. SeriesEditors ElisaBertino LarryLiebovitch PurdueUniversity,WestLafayette, QueensCollege,CityUniversityof IN,USA NewYork,Flushing,NY,USA ClaudioCioffi-Revilla SorinA.Matei GeorgeMasonUniversity,Fairfax, PurdueUniversity,WestLafayette, VA,USA IN,USA JacobFoster AntonNijholt UniversityofCalifornia,LosAngeles, UniversityofTwente,Enschede, CA,USA TheNetherlands NigelGilbert AndrzejNowak UniversityofSurrey,Guildford,UK UniversityofWarsaw,Warsaw,Poland JenniferGolbeck RobertSavit UniversityofMaryland,CollegePark, UniversityofMichigan,AnnArbor, MD,USA MI,USA BrunoGoncalves FlaminioSquazzoni NewYorkUniversity,NewYork, UniversityofBrescia,Brescia,Italy NY,USA AlessandroVinciarelli JamesA.Kitts UniversityofGlasgow,Glasgow, ColumbiaUniversity,Amherst,MA, Scotland,UK USA Moreinformationaboutthisseriesathttp://www.springer.com/series/11784 Sorin Adam Matei • Nicolas Jullien Sean P. Goggins Editors Big Data Factories Collaborative Approaches 123 Editors SorinAdamMatei NicolasJullien PurdueUniversity TechnopôleBrest-Iroise WestLafayette IMTAtlantique(TelecomBretagne) IN,USA BrestCedex3,France SeanP.Goggins ComputerScience UniversityofMissouri Columbia,MO,USA ISSN2509-9574 ISSN2509-9582 (electronic) ComputationalSocialSciences ISBN978-3-319-59185-8 ISBN978-3-319-59186-5 (eBook) https://doi.org/10.1007/978-3-319-59186-5 LibraryofCongressControlNumber:2017958439 ©SpringerInternationalPublishingAG2017 OpenAccessChapter9islicensedunderthetermsoftheCreativeCommonsAttribution4.0Interna- tionalLicense(http://creativecommons.org/licenses/by/4.0/).Forfurtherdetailsseelicenseinformation inthechapter. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Contents 1 Introduction .................................................................. 1 NicolasJullien,SorinAdamMatei,andSeanP.Goggins PartI TheoreticalPrinciplesandApproachestoDataFactories 2 AccessibilityandFlexibility:TwoOrganizingPrinciplesforBig DataCollaboration .......................................................... 9 LibbyHemphillandSusanT.Jackson 3 The Open Community Data Exchange: Advancing Data SharingandDiscoveryinOpenOnlineCommunityScience ........... 23 SeanP.Goggins,A.J.Million,GeorgJ.P.Link,MattGermonprez, andKristenSchuster PartII Theoretical Principles and Ideas for Designing and DeployingDataFactoryApproaches 4 LevelsofTraceDataforSocialandBehaviouralScienceResearch.... 39 KevinCrowston 5 The Ten Adoption Drivers of Open Source Software That Enablese-ResearchinDataFactoriesforOpenInnovations ........... 51 KerkF.Kee 6 Aligning Online Social Collaboration Data Around Social Order:TheoreticalConsiderationsandMeasures....................... 67 SorinAdamMateiandBrianC.Britt PartIII Approaches in Action Through Case Studies of Data BasedResearch,BestPracticeScenarios,orEducational Briefs 7 LessonsLearnedfromaDecadeofFLOSSDataCollection............ 79 KevinCrowstonandMeganSquire v vi Contents 8 TeachingStudentsHow(Not)toLie,Manipulate,andMislead withInformationVisualization............................................. 101 AthirMahmud,MélHogan,AndreaZeffiro,andLibbyHemphill 9 DemocratizingDataScience:TheCommunityDataScience WorkshopsandClasses...................................................... 115 BenjaminMakoHill,DharmaDailey,RichardT.Guy,BenLewis, MikaMatsuzaki,andJonathanT.Morgan Index............................................................................... 137 Chapter 1 Introduction NicolasJullien,SorinAdamMatei,andSeanP.Goggins Human interactions facilitated by social media, collaborative platforms, and the blogosphere generate an unprecedented volume of electronic trace data every day. These traces of human behavior online are a unique source for understanding contemporarylifebehaviors,beliefs,interactions,andknowledgeflows.Thesocial connections we make online, which reveal multiple types of human connection, arealsorecordedonascaleandtoalevelofgranularitypreviouslyunimaginable, except possibly by science fiction writers. To many in the data analytics world, these traces are a gold mine. New sub-domains of inquiry have emerged as a consequenceofthisrevolution:computationalsocialscience,bigdata,datascience, open innovation data analytics, network science, and undoubtedly new ones yet to appear in the near future. Massive amounts of data, each counting millions of data records and behaviors, are now available to the academic, governmental, or industryresearchandteachingcommunities.Theypromisefasteraccesstoreal-time social behavior and better understanding of how people behave and interact. Such “social” data include complete records of Wikipedia edits, interactions on social codingplatformslikeGitHub,andtheexpressionofaffiliationsandengagementof participationonsocialmedia(Twitter,Facebook,YouTube,etc.). Working with data of this kind and of this magnitude requires cleaning up andpreprocessingprodigiousamountsinformation,whichisnontrivialandcostly. N.Jullien((cid:2)) TechnopôleBrest-Iroise,IMTAtlantique(TelecomBretagne),BrestCedex3,France e-mail:[email protected] S.A.Matei PurdueUniversity,WestLafayette,IN,USA e-mail:[email protected] S.P.Goggins ComputerScience,UniversityofMissouri,Columbia,MO,USA e-mail:[email protected] ©SpringerInternationalPublishingAG2017 1 S.A.Mateietal.(eds.),BigDataFactories,ComputationalSocialSciences, https://doi.org/10.1007/978-3-319-59186-5_1 2 N.Jullienetal. Providing documentation and descriptors for the data is also costly. In addition to definingandcleaning,documentationisdevelopedseparatelyforeachdataset,asthe variables and the procedures are created for each individual dataset. Furthermore, at the end of the process, datasets end up in locked box repositories, not easily accessibletotheresearchcommunity.Asthestorageandbandwidthnecessaryfor savinganddisseminatingthedatatendstobecostly,reachingoutacrossprojectsis adifficultandonerousoperation. Inessence,bigsocialdatahascreatedaresearchlandscapeofisolatedprojects. Oneofthecostsofworkinginisolationisredundancy.Eachtimearesearchgroup aims to analyze a dataset, even if it is relatively well known and central (e.g., Wikipediaeditorialhistoryoropensourcesoftwarerepositories),workstartsanew. Furthermore,astheresearchproductsaredeliveredaspapersandfindings,thesteps thedatamovedthrough,fromraw,sourcedatathroughintermediatedataproducts and analysis products, are often lost to the idiosyncrasies of each lab’s process. Thismakescross-checking,secondarydataanalysisandmethodologicalvalidation difficult at best to realize. Concerns go beyond research because the systematic limitationsidentifiedbybigsocialdataresearchmirrorchallengesfacedingeneral datagovernance,civicaction,education,andevenbusinessintelligence. Anumberofspecificsolutionsmightaddresstheissuescommonlyexperienced bydata-centricresearchersandpractitioners.Forexample,commondataontologies, social scientific analysis protocols, documentation standards, and dissemination workflowscouldgeneraterepeatableprocesses.Asnewapproachesemerge,training and teaching materials need to be created from the common store of previously accomplished work. Yet, for this, data professionals need to be trained in data extraction, curation, and analysis with an eye to integrating data procedures, analysis,anddisseminationtechniques. Thecacophonyofcurrentprocessesandtheidealofa“datafactory”oran“open collaborationdataexchange”areattwoendsofaspectrum.Thefirstisthestateof practice;thesecond,aspirational.Thisvolumeaspirestosupportthesecondgoal. The “data factory approach” presented in this volume expresses several sys- tematic approaches to tackle the challenges of data-centric research and practice. Ourstrategicgoalistoopenandconsolidatetheconversationonhowtovertically integrate the process of data collection, analysis, and result dissemination by standardizing and unifying data workflows and scientific collaboration. One of our goals is to support those who work on projects to create repositories and documentation procedures for large datasets. At the same time, the ideal of a data factoryneedstoadvancecoremethodologiesforpreprocessing,documenting,and storingdatathatcanconnectinformationsetsacrossdomainsandresearchcontexts. Successful implementation of data factory methodologies promises to improve collaborativeresearch,validatemethodologies,andwidenthedisseminationofdata, procedures,andresults. Ourconceptualizationof“datafactories”straddlesmanyscholarlycommunities, including information studies, communication, sociology, computer science, data science,sociology,andpoliticalscience.Industrypractitionersfocusedonmarket- ing,customerrelations,businessanalytics,andbusinessintelligencegovernmental 1 Introduction 3 and policy analysts might also benefit from this vision. Scholars are piloting the assemblylineinourconceptualizationofdatafactories.Wedonotexpecta“genesis moment”whereaparticularfactorymightemergeintotheworldduetothegeniusof onescholarorherresearchgroup.Instead,weexpectdomain-andquestion-specific pathstodevelopfromeach“datafactoryassemblyline.”Fromtheseinitialassembly lines, practitioners, students, collaborators, and scholars in related disciplines will have amore solidstartingplace for theirwork or discourse.Atthesametime,we also hope that as new practices emerge, these will converge rather than diverge through the common methods discussed in this volume. The “data factory” vision aims to cross disciplinary boundaries. We hope that social computing researchers, computer scientists, social scientists, organizational scientists, and other scholars willintheenddevelopacommonlanguage. Thedatafactoryapproachcancoveravarietyofactivities,butinamoretangible way and as an overture to our volume, its core activities should at the very least include: 1. Creatingstandardworkflowsfordataprocessinganddocumentationandformat- ting inspired by a variety of projects and which cover several core dimensions: actors,behaviors,levelsofanalysis,artifacts,outcomes 2. Determining standard ontologies for categorizing in a standard manner records (observableunits)andvariables(fields) 3. Creatingtoolsforeasilyprocessingexistingandfuturedatasetsforstandardized processinganddocumentation 4. Creatingonlinestorageanddiscovery toolsthatcaneasilyidentifyrecordsand variablesacrossdatasets,disciplines,andscientificobservationdomains 5. Facilitatingdatarecombiningbymatchingonvariousvariablesofheterogeneous datasets 6. Creating methods for documenting and sharing statistical tools and procedures foranalyzingrecombineddatasets 7. Creatingplatformsandmethodsforresearchcollaborationthatrelyonexpertise, intellectual interest, skills, data ownership, and research goals for connecting individuals 8. Creating courses to teach these methods and to train graduate, undergraduate, andmid-careerprofessional Thechaptersofthisvolumeaddressalloftheseissues,proposing,wehope,an integratedstrategyfordatafactoring.Althoughsomeofthechapterscanbereadas “usecase,”“howto,”or“guideline”contributions,theirvalueshouldbeseeninthe contextoftheoverallgoal,whichistoproposeanintegratedvisiontowhata“data factory”researchandmethodologicalprogramshouldbe. The volume is divided into three large sections. The first is dedicated to the theoretical principles of big data analysis and the needs associated with a data factory approach. The second proposes some theoretical principles and ideas for designing and deploying data factory approaches. The third presents these approaches in action through case studies of data-based research, best practice scenarios,oreducationalbriefs.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.