ebook img

Crowdsourced Data Management: Hybrid Machine-Human Computing PDF

169 Pages·2018·4.504 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Crowdsourced Data Management: Hybrid Machine-Human Computing

Guoliang Li · Jiannan Wang  Yudian Zheng · Ju Fan  Michael J. Franklin Crowdsourced Data Management Hybrid Machine-Human Computing Crowdsourced Data Management Guoliang Li • Jiannan Wang (cid:129) Yudian Zheng Ju Fan (cid:129) Michael J. Franklin Crowdsourced Data Management Hybrid Machine-Human Computing 123 GuoliangLi JiannanWang DepartmentofComputerScience SchoolofComputingScience andTechnology SimonFraserUniversity TsinghuaUniversity Burnaby,BC,Canada Beijing,Beijing,China JuFan YudianZheng DEKELab&SchoolofInformation TwitterInc. RenminUniversityofChina SanFrancisco,CA,USA Beijing,Beijing,China MichaelJ.Franklin DepartmentofComputerScience UniversityofChicago Chicago,IL,USA ISBN978-981-10-7846-0 ISBN978-981-10-7847-7 (eBook) https://doi.org/10.1007/978-981-10-7847-7 LibraryofCongressControlNumber:2018953702 ©SpringerNatureSingaporePteLtd.2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Preface Many important data management and analytics tasks, such as entity resolution, sentiment analysis, and image recognition, can be enhanced through the use of human cognitive ability. Crowdsourcing platforms provide an effective way of harnessing the capabilities of people (i.e., the crowd) to process such tasks, and they encourage many real-world applications, such as reCaptcha, ImageNet, ESP game, Foldit, and Waze. Recently, crowdsourced data management has attracted increasinginterestfrombothacademiaandindustry. Thisbookprovidesacomprehensivereviewofcrowdsourceddatamanagement, includingmotivation,applications,techniques,andexistingsystems.Thebookfirst introduces an overview of crowdsourcing, including crowdsourcing motivation, background, applications, workflows, and platforms. For example, consider the entityresolutionproblem,which,givenasetofobjects,findstheobjectsthatrefer to the same entity. Since machine algorithms cannot achieve high quality for this problem, crowdsourcing can be used to improve the quality. A user (called the “requester”) first generates some relevant tasks (e.g., asking whether two objects refer to the same entity), configures them (e.g., setting the price, latency), and then posts them onto a crowdsourcing platform. Users (called “workers”) that are interested in these tasks can accept and answer them. The requester pays the participating workers for their labor. Crowdsourcing platforms manage the tasks andassignthemtotheworkers. Then, this book summarizes three important problems in crowdsourced data management:(1)qualitycontrol:workersmayreturnnoisyorincorrectresults,so effective quality-control techniques are required to improve the quality; (2) cost control: the crowd is often not free, and cost control aims to reduce the monetary cost; (3) latency control: human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required to control the pace. There have been significant works addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators,andoptimizingplansconsistingofmultipleoperators. v vi Preface Next,thisbooksynthesizesawidespectrumofexistingstudiesoncrowdsourced datamanagement,includingcrowdsourcingmodels,declarativelanguages,crowd- sourcedoperators,andcrowdsourcedoptimizations. Finally, based on this analysis, this book outlines key factors that need to be consideredtoimprovecrowdsourceddatamanagement. Beijing,China GuoliangLi Burnaby,BC,Canada JiannanWang SanFrancisco,CA,USA YudianZheng Beijing,China JuFan Chicago,IL,USA MichaelJ.Franklin December2017 Acknowledgments Weexpressourdeepgratitudetothosewhohavehelpedusinwritingthisbook.We thankChengliangChai,JamesPan,andJianhuaFengfordiscussingorproofreading theearlierversionsofthebook.OurthanksalsogotoLanlanChangandJianLiat Springerfortheirkindhelpandpatienceduringthepreparationofthisbook. We acknowledge the financial support by the 973 Program of China (2015CB358700), the NSF of China (61632016, 61373024, 61602488, 61422205, 61472198),TALeducation,Tencent,Huawei,andFDCT/007/2016/AFJ. vii Contents 1 Introduction .................................................................. 1 1.1 Motivation.............................................................. 1 1.2 CrowdsourcingOverview ............................................. 2 1.3 CrowdsourcedDataManagement..................................... 4 References..................................................................... 8 2 CrowdsourcingBackground................................................ 11 2.1 CrowdsourcingOverview ............................................. 11 2.2 CrowdsourcingWorkflow............................................. 12 2.2.1 WorkflowfromRequesterSide.............................. 12 2.2.2 WorkflowfromWorkerSide................................. 15 2.2.3 WorkflowfromPlatformSide............................... 16 2.3 CrowdsourcingPlatforms ............................................. 16 2.3.1 AmazonMechanicalTurk(AMT) .......................... 16 2.3.2 CrowdFlower................................................. 17 2.3.3 OtherPlatforms .............................................. 17 2.4 ExistingSurveys,Tutorials,andBooks............................... 18 2.5 OptimizationGoalofCrowdsourcedDataManagement ............ 18 References..................................................................... 19 3 QualityControl............................................................... 21 3.1 OverviewofQualityControl.......................................... 21 3.2 TruthInference......................................................... 23 3.2.1 TruthInferenceProblem..................................... 23 3.2.2 UnifiedSolutionFramework ................................ 25 3.2.3 ComparisonsofExistingWorks............................. 28 3.2.4 ExtensionsofTruthInference............................... 35 3.3 TaskAssignment....................................................... 36 3.3.1 TaskAssignmentSetting .................................... 36 3.3.2 WorkerSelectionSetting .................................... 40 3.4 SummaryofQualityControl.......................................... 42 References..................................................................... 42 ix x Contents 4 CostControl.................................................................. 45 4.1 OverviewofCostControl............................................. 45 4.2 TaskPruning........................................................... 46 4.2.1 DifficultyMeasurement...................................... 47 4.2.2 ThresholdSelection.......................................... 48 4.2.3 ProsandCons................................................ 49 4.3 AnswerDeduction..................................................... 49 4.3.1 IterativeWorkflow ........................................... 49 4.3.2 PresentationOrder ........................................... 50 4.3.3 ProsandCons................................................ 51 4.4 TaskSelection.......................................................... 51 4.4.1 Model-Driven ................................................ 52 4.4.2 Problem-Driven.............................................. 53 4.4.3 ProsandCons................................................ 54 4.5 Sampling............................................................... 54 4.5.1 CrowdsourcedAggregation ................................. 54 4.5.2 DataCleaning................................................ 55 4.5.3 ProsandCons................................................ 57 4.6 TaskDesign............................................................ 57 4.6.1 UserInterfaceDesign........................................ 58 4.6.2 Non-monetaryIncentives.................................... 59 4.6.3 ProsandCons................................................ 60 4.7 SummaryofCostControl............................................. 60 References..................................................................... 61 5 LatencyControl.............................................................. 63 5.1 OverviewofLatencyControl ......................................... 63 5.2 Single-TaskLatencyControl.......................................... 64 5.2.1 RecruitmentTime............................................ 64 5.2.2 QualificationTestTime...................................... 65 5.2.3 WorkTime.................................................... 65 5.3 Single-BatchLatencyControl......................................... 66 5.3.1 StatisticalModel ............................................. 66 5.3.2 StragglerMitigation.......................................... 66 5.4 Multi-batchLatencyControl.......................................... 68 5.4.1 MotivationofMultipleBatches............................. 68 5.4.2 TwoBasicIdeas.............................................. 68 5.5 SummaryofLatencyControl ......................................... 69 References..................................................................... 70 6 CrowdsourcingDatabaseSystemsandOptimization.................... 71 6.1 OverviewofCrowdsourcingDatabaseSystems ..................... 71 6.2 CrowdsourcingQueryLanguage...................................... 75 6.2.1 CrowdDB..................................................... 75 6.2.2 Qurk .......................................................... 76 Contents xi 6.2.3 Deco .......................................................... 77 6.2.4 CDAS......................................................... 78 6.2.5 CDB .......................................................... 80 6.3 CrowdsourcingQueryOptimization.................................. 82 6.3.1 CrowdDB..................................................... 82 6.3.2 Qurk .......................................................... 84 6.3.3 Deco .......................................................... 85 6.3.4 CDAS......................................................... 87 6.3.5 CDB .......................................................... 91 6.4 SummaryofCrowdsourcingDatabaseSystems ..................... 93 References..................................................................... 94 7 CrowdsourcedOperators ................................................... 97 7.1 CrowdsourcedSelection............................................... 97 7.1.1 CrowdsourcedFiltering...................................... 98 7.1.2 CrowdsourcedFind .......................................... 99 7.1.3 CrowdsourcedSearch........................................ 101 7.2 CrowdsourcedCollection ............................................. 101 7.2.1 CrowdsourcedEnumeration................................. 101 7.2.2 CrowdsourcedFill ........................................... 104 7.3 CrowdsourcedJoin(CrowdsourcedEntityResolution).............. 104 7.3.1 Background................................................... 104 7.3.2 CandidateSetGeneration.................................... 105 7.3.3 CandidateSetVerification................................... 106 7.3.4 HumanInterfaceforJoin.................................... 108 7.3.5 OtherApproaches............................................ 109 7.4 CrowdsourcedSort,Top-k,andMax/Min............................ 113 7.4.1 Workflow..................................................... 113 7.4.2 PairwiseComparisons ....................................... 113 7.4.3 ResultInference.............................................. 114 7.4.4 TaskSelection................................................ 119 7.4.5 CrowdsourcedMax .......................................... 120 7.5 CrowdsourcedAggregation........................................... 121 7.5.1 CrowdsourcedCount ........................................ 121 7.5.2 CrowdsourcedMedian....................................... 122 7.5.3 CrowdsourcedGroupBy .................................... 123 7.6 CrowdsourcedCategorization......................................... 123 7.7 CrowdsourcedSkyline................................................. 124 7.7.1 CrowdsourcedSkylineonIncompleteData ................ 125 7.7.2 CrowdsourcedSkylinewithComparisons.................. 126 7.8 CrowdsourcedPlanning............................................... 126 7.8.1 GeneralCrowdsourcedPlanningQuery .................... 127 7.8.2 AnApplication:RoutePlanning ............................ 129 7.9 CrowdsourcedSchemaMatching..................................... 132

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.