ebook img

Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace PDF

3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace

Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace [Extended Technical Report] Ayush Jain†, Akash Das Sarma(cid:63), Aditya Parameswaran†, Jennifer Widom(cid:63) †University of Illinois (cid:63)Stanford University 7 ABSTRACT this includes data ranging from worker answers to specific ques- 1 tionsandresponsetimes,allthewaytotheHTMLthatencodesthe Weconductanexperimentalanalysisofadatasetcomprisingover 0 userinterfaceforaspecificquestion. 2 27 million microtasks performed by over 70,000 workers issued Thisdataallowsustoanswersomeofthemostimportantopen toalargecrowdsourcingmarketplacebetween2012-2016. Using n thisdata—neverbeforeanalyzedinanacademiccontext—weshed questionsinmicrotaskcrowdsourcing: whatconstitutesan“effec- a tive”task,howcanweimprovemarketplaces,andhowcanween- light on three crucial aspects of crowdsourcing: (1) Task design J hanceworkers’interactions.Inthispaper,usingthisdata,westudy —helpingrequestersunderstandwhatconstitutesaneffectivetask, 2 andhowtogoaboutdesigningone;(2)Marketplacedynamics— thefollowingkeyquestions: 2 • Marketplacedynamics:helpingmarketplaceadministratorsun- helping marketplace administrators and designers understand the derstand the interaction between tasks and workers, and the interactionbetweentasksandworkers,andthecorrespondingmar- ] correspondingmarketplaceload; e.g., questionslike: (a)how C ketplace load; and (3) Worker behavior — understanding worker muchdoestheloadonthemarketplacevaryovertime,andis attention spans, lifetimes, and general behavior, for the improve- H thereamismatchbetweenthenumberofworkersandthenum- mentofthecrowdsourcingecosystemasawhole. beroftasksavailable,(b)whatisthetypicalfrequencyanddis- . s tributionoftasksthatarerepeatedlyissued, (c)whattypesof c 1. INTRODUCTION tasksarerequestersmostinterestedin, andwhatsortsofdata [ Despitetheexcitementsurroundingartificialintelligenceandthe dothesetasksoperateon? 1 ubiquitous need for large volumes of manually labeled training • Task design: helping requesters understand what constitutes v data,thepastfewyearshavebeenarelativelytumultuousperiodfor an effective task, and how to go about designing one; e.g., 7 thecrowdsourcingindustry.Therehasbeenarecentspateofmerg- questionslike: whatfactorsimpact(a)theaccuracyofthere- 0 ers, e.g., [27], rebrandings, e.g., [15, 17], slowdowns, e.g., [16], sponses;(b)thetimetakenforthetasktobecompleted;or(c) 2 andmovestowardsprivatecrowds[34]. Forthefutureofcrowd- thetimetakenforthetasktobepickedup? Doexamplesand 6 sourcingmarketplaces,itisthereforebothimportantandtimelyto imageshelp?Doesthelengthorcomplexityofthetaskhurt? 0 stepbackandstudyhowthesemarketplacesareperforming,how • Worker behavior: understanding worker attention spans, life- . 1 therequestersaremakingandcanmakebestuseofthesemarket- times, andgeneralbehavior; e.g., questions like (a)wheredo 0 places,andhowworkersareparticipatinginthesemarketplaces—in workerscomefrom,(b)doworkersfromdifferentsourcesshow 7 ordertodevelopmoreefficientmarketplaces,understandthework- differentcharacteristics,suchasaccuraciesandresponsetimes, 1 ers’viewpointandmaketheirexperiencelesstedious,anddesign (c) how engaged are the workers within the marketplace, and v: more effective tasks from the requester standpoint. Achieving all relativetoeachother,and(d)howdotheirworkloadsvary? i thesegoalswouldsupportandsustainthe“trifecta”ofkeypartici- Theonlypaperthathasperformedanextensiveanalysisofcrowd- X pantsthatkeepscrowdsourcingmarketplacesticking—themarket- sourcingmarketplacedataistherecentpaperbyDifallahetal.[14]. r placeadministrators,thecrowdworkers,andthetaskrequesters. Thispaperanalyzedthedataobtainedviacrawlingapubliccrowd- a Atthesametime,developingabetterunderstandingofhowcrowd- sourcing marketplace (in this case Mechanical Turk). Unfortu- sourcing marketplaces function can help us design crowdsourced nately,thispubliclyvisibledataprovidesarestrictedviewofhow dataprocessingalgorithmsandsystemsthataremoreefficient,in themarketplaceisfunctioning,sincetheworkerresponses,demo- termsoflatency,cost,andquality.Indeed,crowdsourceddatapro- graphicsandcharacteristicsoftheworkers,andthespeedatwhich cessingisperformedatscaleatmanytechcompanies,withtensof theseresponsesareprovidedareallunavailable. Asaresult, un- millionsofdollarsspenteveryyear[34],sotheefficiencyimprove- likethepresentpaper,thatpaperonlyconsidersarestrictedaspect mentscanleadtosubstantialsavingsforthesecompanies. Inthis ofcrowdsourcingmarketplaces,specifically,thepricedynamicsof vein,therehavebeenanumberofpapersonbothoptimizedalgo- the marketplace (indeed, their title reflects this as well)—for in- rithms,e.g.,[22,37,13,40,9],andsystems,e.g.,[33,19,32,10, stance,demandandsupplymodeling,developingmodelsforpre- 38],allfromthedatabasecommunity,andsuchfindingscanhave dictingthroughput,andanalyzingthetopicsandcountriespreferred animpactinthedesignofallofthesealgorithmsandsystems. byrequesters. Thatpaperdidnotanalyzethethefull“trifecta”of Unfortunately, due to the proprietary nature of crowdsourcing participantsthatconstituteacrowdsourcingmarketplace. Evenfor marketplacedata, itishardforacademicstoperformsuchanaly- marketplacedynamics,tofullydistinguishtheresultsofthepresent sesandidentifypainpointsandsolutions. Fortunatelyforus,one paperfromthatpaper,weexcludeanyexperimentsoranalysesthat ofthemoreforwardthinkingcrowdsourcingmarketplacesmadea overlapwiththeexperimentsperformedinthatpaper.Wedescribe substantial portion of its data from 2012 to date available to us: thisandotherrelatedworkinSection6. 1 Thisexperimentsandanalysispaperisorganizedasfollows: erousenoughtoprovideaccesstotheirdataforresearchpurposes. • Dataset description and enrichment. In Section 2, we de- Tooffsetthelackoftransparencyduetotheanonymity,wediscuss scribewhatourdatasetconsistsof,andthehigh-levelgoalsof someofthecrucialoperationalaspectsofthemarketplace,thatwill ouranalysis. InSection2.1through2.3weprovidemorede- allowustounderstandhowthemarketplacefunctions,andgener- tailsaboutthemarketplacemechanics,thescaleandtimespan alizefromtheseinsightstoothersimilarmarketplaces. ofthedataset,andtheattributesprovided. Wealsoenrichthe Themarketplaceweoperateonactsasanaggregatororanin- datasetbymanuallylabelingtasksourselvesonvariousfeatures termediaryformanydifferentsourcesofcrowdlabor. Forexam- ofinterest,describedinSection2.4,e.g.,whattypeofdatadoes ple,thismarketplaceusesMechanicalTurk[3],Clixsense[1],and thetaskoperateon,whatsortofinputmechanismdoesthetask NeoDev[4],allassourcesofworkers,aswellasaninternalworker usetogetopinionsfromworkers. pool.Fortaskassignment,i.e.,assigningtaskstoworkers,themar- • Marketplaceinsights. InSection3,weaddressquestionson ketplacemakesuseofbothpushandpullmechanisms. Thetypi- the(a)marketplaceload—taskarrivals(Section3.1),worker calsettingisviapull,wheretheworkerscanchooseandcomplete availability (Section 3.2), and task distribution (Section 3.3), tasksthatinterestthem. Inasomesourcesofworkersthatwewill with the aim of helping improve marketplace design, and (b) discuss lateron, tasks are pushed to workersby the marketplace. thetypesoftasks,goals,humanoperatorsanddatatypes,and Forexample,Clixsenseinjectspaidsurveysintowebpagessothat correlationsbetweenthem(Section3.4),withtheaimofchar- individuals browsing are attracted to and work on specific tasks. acterizingthespectrumofcrowdwork. Ineithercase,themarketplaceallowsrequesterstospecifyvarious • Task design improvements. In Section 4, we (a) character- parameters,suchasaminimumaccuracyforworkerswhoareal- izeandquantifymetricsgoverningthe“effectiveness”oftasks lowedtoworkonthegiventasks,anygeographicconstraints,any (Section4.1), (b)identifyfeaturesaffectingtaskeffectiveness constraintsonthesourcesofcrowdlabor,theminimumamountof anddetailhowtheyinfluencethedifferentmetrics(Sections4.3 timethataworkermustspendonthetask,themaximumnumberof through4.7),(c)performaclassificationanalysisinSection4.9 tasksinabatchagivenworkercanattempt,andananswerdistribu- whereinwelookattheproblemofpredictingthevarious“ef- tionthreshold(i.e.,thethresholdofskewontheanswersprovided fectiveness”metricvaluesofataskbasedonsimplefeatures, bytheworkersbelowwhichaworkerisnolongerallowedtowork and(d)providefinal,summarizedrecommendationsonhowre- ontasksfromtherequester). Themarketplacemonitorsthesepa- questerscanimprovetheirtasks’designstooptimizeforthese rameters and prevents workers from working on specific tasks if metrics(Section4.8). theynolongermeetthedesiredcriteria. Themarketplacealso • Worker understanding. In Section 5, we analyze and pro- provides optional templates for HITs for some common standard videinsightsintotheworkerbehavior. Wecomparecharacter- tasks,suchasSentimentAnalysis,SearchRelevance,andContent isticsofdifferentworkerdemographicsandsources—provided Moderation,aswellasfortoolssuchasoneforimageannotation. bydifferentcrowdsourcingmarketplaces; aswewillfind, the Usageofthesestandardtemplatesleadstosomeuniformityinin- specific marketplace whose data we work with solicits work- terfaces,andalsogivespotentialfortheimprovementoftaskdesign ersfrommanysources(Section5.1). Wealsoprovideinsights simultaneouslyacrossrequesters. into worker involvement and task loads taken on by workers This marketplace categorizes certain workers as “skilled” con- (Section5.2),andcharacterizeandanalyzeworkerengagement tributors,whoaregivenaccesstomoreadvancedtasks,higherpay- (Section5.3). ments,andarealsosometimesresponsibleformeta-taskssuchas generatingtestquestions,flaggingbrokentasks,andcheckingwork 2. DATASETDETAILSANDGOALS byothercontributors.Ourresearchhighlightsthathavingapoolof engagedandactiveworkersisjustas, ifnotmoreimportantthan We now introduce some terms that will help us operate in a havingaccesstoalargeworkforce. Itmightbeinterestingtoex- marketplace-agnostic manner. The unit of work undertaken by a ploreanincentiveprogramforthe“active”or“experienced”work- singleworkerisdefinedtobeatask.Ataskistypicallylistedinits ersaswell. entiretyononewebpage,andmaycontainmultipleshortquestions. Thismarketplacealsoprovidesanumberofadditionalfeatures. Forexample,ataskmayinvolveflaggingwhethereachimageina Notable among them is its module for machine learning and AI. set of tenimages is inappropriate; so thistask contains ten ques- This allows requesters without any machine learning background tions. Eachquestioninataskoperatesonanitem;inourexample, togenerateandevaluatemodelsontrainingdatawitheasycompu- this item is an image. These tasks are issued by requesters. Of- tationandvisualizationsofmetricssuchasaccuracy,andconfusion ten, requestersissuemultipletasksinparallelsothattheycanbe matrix. attempted by different workers at once. We call this set of tasks a batch. Requesters often use multiple batches to issue identical 2.2 OriginoftheDataset unitsofwork—forexample,arequestermayissueabatchof100 “image flagging” tasks one day, operating on a set of items, and Our dataset consists of tasks issued on the marketplace from thenanotherbatchof500“imageflagging”tasksafteraweek,on 2012–2016.Unfortunately,wedonothaveaccesstoalldataabout adifferentsetofitems. Weoverloadthetermtasktoalsoreferto all tasks. There are about 58,000 batches in total, of which we theseidenticalunitsofworkissuedacrosstimeandbatches,inde- haveaccesstocompletedataforasampleofabout12,000batches, pendentoftheindividualitemsbeingoperatedupon,inadditionto andminimaldataabouttheremaining,consistingonlyofthetitle asingleinstanceofwork. Theusageofthetermtaskwillbeclear of the task and the creation date. Almost 51,000, or 88% of the fromthecontext;ifitisnotclear,wewillrefertothelatterasatask 58,000 of batches have some representatives in our 12,000 batch instance. sample—thus,thesampleismissingabout10%ofthetasks.(That is,thereareidenticaltasksinthe12,000batchsample.) Fromthe 2.1 OperationalDetails taskperspective,thereareabout6600distincttasksintotal,spread Duetoconfidentialityandintellectualpropertyreasons,weare across 58,000 batches, of which our sample contains 5000, i.e., requiredtopreservetheanonymityofthecommercialcrowdsourc- 76%ofalldistincttasks.Thus,whilenotcomplete,oursampleisa ingmarketplaceweoperateon, whohaveneverthelessbeengen- significantandrepresentativeportionoftheentiredatasetoftasks. 2 Wewilllargelyoperateonthis12,000batchsample,consistingof text-boxesandimages—wediscussthesefurtherinSection4. 27Mtaskinstances,asubstantialnumber. Figure1comparesthe • Performancemetrics—wecomputeandstoredifferentmetrics numberofdistincttaskssampledversusthetotalnumberofdistinct to characterize the latency, cost and error of tasks to help us tasks that were issued to the marketplace across different weeks, performquantitativeanalysesonthe“effectiveness”ofatask’s Weobservethatingeneralwehaveasignificantfractionoftasks design,discussedfurtherinSection4.1. fromeachweek. 2.3 DatasetAttributes 2.5 WhataretheGoalsofOurAnalysis? Thedatasetisprovidedtousatthebatchlevel.Foreachbatchin Aspreviouslymentioned, ourmaingoals(atabroadlevel)are oursample,wehavemetadatacontainingashorttextualdescription toquantitativelyaddressthequestionsof(1)Marketplacedynam- ofthebatch(typicallyonesentence),aswellasthesourceHTML ics—helpingmarketplaceadministratorsandownersunderstand codetoonesampletaskinstanceinthebatch.Inaddition,themar- theinteractionbetweentasksandworkers,andthecorresponding ketplace also provides a comprehensive set of metadata for each marketplaceload;(2)Taskdesign—helpingrequestersunderstand taskinstancewithinthebatch,containing whatconstitutesaneffectivetask,andhowtogoaboutdesigning • Workerattributessuchasworker ID,location(country,region, one; and (3) Worker behavior — understanding worker attention city, IP), and source (recall that this marketplace recruits spans,lifetimes,andgeneralbehavior,fortheimprovementofthe workersfromdifferentsources); crowdsourcing ecosystem as a whole. While the first two goals • Itemattributessuchasitem ID;and directlyimpactrequestersandmarketplaceadministrators,webe- • Taskinstanceattributessuchastask ID,start time,end time, lievetheywillalsohelpindirectlyimprovethegeneralworkerex- trust score,andworker response. perienceintermsofavailabilityofdesirabletasks,andareduction Aswecanseeinthislist,themarketplaceassignsworkersatrust inthelaboriousnessofperformingtasks.Wenowdiscussourgoals scoreforeverytaskinstancethattheyworkon. Thistrustscore inalittlemoredetail,bybreakingeachofthemdownintosub-goals reflectstheaccuracyoftheseworkersontesttaskstheanswersto anddescribingtheexperimentsweperformtoanswerthem. whosequestionsisknown. Themarketplaceadministersthesetest Marketplacedynamics. Tounderstandthemarketplace,thefirst tasksbeforeworkersbeginworkingon“real”tasks.Unfortunately, goal of our analysis is to examine general statistics that help us we were not provided these test tasks, and only have the trust estimatethescaleofcrowdsourcingoperationswithinthemarket- scoreasaproxyforthetrueaccuracyoftheworkerforthatspe- place.Thisfirstorderanalysisisusefulformarketplaceadministra- cifictypeoftask.Inadditiontothetrust score,wealsohavein- tors, helpingthemestimatetheresourcesrequiredtomanagethis formationaboutworkeridentities,andotherattributesoftheitems scale of operations, and identify key limitations. Thereafter, we being operated on, and the start and end times for each task in- lookattheavailabilityandflowoftasksorworkersonthemarket- stance. place — specifically, we check if the marketplace witnesses sud- At the same time there are some important attributes that are den bursts or a steady stream of activity. This analysis gives us notvisibletousfromthisdataset.Forinstance,wedonothavere- concrete take-aways that can help future marketplace design bet- questerIDs,butwecanusethesampletaskHTMLtoinferwhether terload-balancingstrategies. Lastly,weanalyzethemanuallabels twoseparatebatcheshavethesametypeoftask,andthereforewere assigned to tasks to look at the types of tasks that have become probablyissuedbyasinglerequester.Nordowehavehave“ground popularwithinthemarketplaceasameanstowardsabettercharac- truth” answers for questions in the tasks performed by workers. terizationofthespectrumofcrowdworkasawhole. However,aswewilldescribesubsequently,wefindotherproxies TaskDesign.Tohelpimprovetaskdesign,wemustfirstbeableto tobeabletoestimatetheaccuraciesofworkersortasks.Finally,we characterizetheeffectivenessoftasksbothqualitativelyandquan- donothavedataregardingthepaymentsassociatedwithdifferent titatively. Thethreewellknownaspectsusedtotalkabouttheef- tasksandbatches. fectivenessofacrowdsourcedtaskare(a)latency,(b)accuracyand (c)cost. Consequently, thefirststepinouranalysisistoidentify 2.4 Howdidweenrichthedata? performancemetricsthatmeasuretheseaspects.Thenextstepisto The raw data available for each batch, as described above, is studytheimpactofvaryingdifferentdesignparametersoneachof byitselfquiteusefulinexploringhigh-levelmarketplacestatistics thesemetrics. Thisanalysis,whenperformedonadatasetaslarge suchasthenumberoftasksandworkersovertime,thegeographic asours,allowsustomakedata-informedrecommendationstore- distributionofworkers,typicaltaskdurations,andworkerlifetimes questerslookingtodesigntasksthatthatareansweredaccurately andattentionspans. Thatsaid, thisrawdataisinsufficienttoad- byworkerswithlowlatencyandatlowcost. dress many of the important issues we wish to study. For exam- Workeranalyses. Aworker-centricviewofthemarketplacecan ple, we cannot automatically identify whether a task operates on helpusinunderstandingtheworkers’experienceandmakechanges webdataoronimages,andwhetherornotitcontainsexamples,or appropriately to make it easierfor them to work. In this respect, free-formtextresponseboxes. Toaugmentthisdataevenfurther, wefirstlookatthevariouslaborsourcesthatprovideworkersthat weenrichthedatasetbyinferringorcollectingadditionaldata.We performworkforthemarketplace. Theload,resourceandquality generatethreeadditionaltypesoftaskattributedata: distributionacrossthesesourcescanpointresearchersindevising • Manuallabels—wealsomanuallyannotateeachbatchonthe appropriate load-balancing strategies, and can point practitioners basisoftheirtaskgoal,e.g.,entityresolution,sentimentanal- towardstheidealsource(s)forcrowdwork. Next,weexaminethe ysis,operatortype,e.g.,rating,sorting,labeling,andthedata geographical distribution of workers — this gives us vital infor- type in the task interface, e.g., text, image, social media, dis- mationabouttheactivetime-zonesoftheworkforceandcanhelp cussedfurtherinSection3.4. marketplaceadministratorsinensuringaconstantresponserateon • Designparameters—weextractandstorefeaturesfromthesam- themarketplace.Finally,wealsostudytheendlifetimesandatten- ple HTML source as well as other raw attributes of the tasks tionspansoftheworkerstofigureoutwhatfractionofthework- thatreflectdesigndecisionsmadebytherequesters.Forexam- forceareregularlyactiveonthemarketplace,andhowmuchtime ple, wecheckwhetherataskcontainsinstructions, examples, istypicallyspentonthemarketplacebyworkersonasingleday. 3 300 50 300 Number of distinct tasks11225050500000 all sampled Number of distinct tasks 11223344505050505 all sampled Number of distinct tasks 11225050500000 all sampled 0 0 0 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Jul’12 Oct’12Jan’13Apr’13 Jul’13 Oct’13Jan’14Apr’14 Jul’14 Oct’14Jan’15 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 Date (by week) Date (by week) Date (by week) (a)Entireduration (b)PreJan2015 (c)PostJan2015 Figure1:Numberoftaskssampled(byweek) 3. MARKETPLACEANALYSES freelanceworkforceimpliesthatforthemajorityoftaskstheydo Inthissection, weaimtogaininsightsintothehighlevel, ag- not have to rely on the more expensive skilled workers, and can gregateworkingsofthemarketplace. First,weexaminesomeba- thereforebeconservativeintheiruseofresourcesinmaintaininga sic statistics of the marketplace, to understand the worker supply pooloftheseinternal“super-workers”.Thishashugeimplications and task demand interactions. Specifically, we look at (a) task forindividualswhorelyoncrowdsourcingasasolesourceofin- instance arrival distribution (Section 3.1), (b) worker availability come:dependingontheweek,theymaynothaveenoughtasksthat (Section3.2),and,(c)marketplaceload,orcontributionfrom“heavy- suittheirinterestorexpertise. -hitter” tasks, occupying a bulk of the tasks in the marketplace. Striking a good balance between the two task routing mecha- Then,inSection3.4,weexplorethetypesoftasksobservedinour nisms and worker pools is crucial to ensuring that all three par- dataset,tobetterunderstandthequestionsanddatatypesofinter- tiesaresatisfied: (1)themarketplaceisabletoclearpendingtasks estforrequesters.Wealsolookforcorrelationsacrosstheselabels without a building backlog, while maintaining requisite levels of tounderstandwhattypesoftasksoccurtogether. Finally,welook accuracyandcost,(2)requestersreceivequalityresponsesforeasy at trends in the complexity of tasks over time to gain additional andhardtasks,anddonotseehighlatenciesinresponses,and(3) insightsintotheevolutionofthemarketplace(Section3.5). workershaveaccesstoasmuchworkastheycanhandle,aswellas tasksthatcancatertotheirvariedinterestsandexpertiselevels. 3.1 Aretasksuniformorburstyovertime? Besidestheburstynatureoftaskinstancearrivalsacrossweeks, themarketplacealsowitnessesperiodsoflowtaskarrivalsonthe We first study the rate at which task instances arrive into the weekends—thenumberofinstancespostedonaweekdayisupto marketplace, andtherateatwhichtheyarecompleted. Notethat 2× the number of instances posted on Saturdays or Sundays on theloadonthemarketplaceisgovernedbythenumberoftaskin- average.Further,theaveragenumberofinstancespostedatthestart stances,whichisthefundamentalunitofworkvisibletoworkers, oftheweekisthehighest,followingwhichthenumberdecreases ratherthanthenumberofbatches;batchescanbearbitrarilysmall overtheweek.WeplotthisinFigure3. orlarge. Weplotthenumberoftaskinstancesarrivingandbeing completedeachweekinFigure2ainblue. First,notethatthetask arrivalplotisrelativelysparseuntilJan2015,whichispresumably # of tasks issued whenthemarketplacestartedattractingmorerequesters. Second, 6.0M after June 2014, there are some very prominent peaks, on top of 5.0M regularactivityeachweek. Thissuggeststhatwhiletaskinstances 4.0M arrivefairlyregularly,thereareperiodsofburstiness. Considering 3.0M the period from Jan 2015 onwards, the median of the number of 2.0M taskinstancesissuedinadayonthemarketplaceisabout30,000. 1.0M Incomparison,onitsbusiestday,morethan900,000taskinstances 0.0 wereissued,a30×increaseovernormallevels.Similarly,thenum- Mon Tue Wed Thu Fri Sat Sun beroftaskinstancesissuedonthelightestdayis0.0004×smaller Figure3:Distributionoftasksoverdaysoftheweek than the median. This raises the question: where does the high variationinthenumberoftaskinstancesderivefrom—dothenum- berofbatchesoftasksissuedfluctuatealot,ordothenumberof Takeaway:Marketplaceswitnesswidevariationinthenumber distincttasksissuedthemselvesfluctuatealot? Forthisanalysis, oftasksissued,withdailynumberofissuedinstancesvarying weoverlaythenumberoftaskinstancesissuedonthemarketplace between0.0004×,toupto30×themedianload(30,000in- withthenumberofbatchesandthenumberofdistincttasksissued stances). fortheperiodpostJanuary2015inFigure2b.Forboththesemea- sures,wefindthatthefluctuationissimilartothefluctuationinthe number of task arrivals, indicating that both factors contribute to 3.2 How does the availability and participa- thehighvariationinthemarketload. tionofworkersvary? Notethatacommonexplanationforwhycrowdsourcingisused in companies is the ability to shrink or grow labor pools on de- Worker Availability. As described earlier, the specific market- mand [34]; this finding seems to suggest that even marketplaces place we work with attracts workers from a collection of labor needtobeabletoshrinkandgrowlaborpoolsbasedondemand. sources. In this manner, it is able to keep up with the spikes in Forthismarketplace,havingaccesstobothpushandpullmecha- demand.Weinvestigatethesourcesthemarketplacedrawsfromin nismsprovidesgreatflexibility.Notonlycantheyroutetheharder Section5. Inthissection,wefocusonstudyingthenumberofac- taskstotheirmoreskilled,“on-demand”workers,theycanalsouse tiveworkersacrossdifferentweeks:Figure4depictsthisstatistic. this push mechanism to reduce latencies for requesters and clear UnlikeFigure2athathadahugevariationinthenumberissued backlogged of tasks. On the other hand, the presence of a large taskinstances, especiallyafter2015, Figure4doesnotshowthis 4 1M 400k 1M 450 1M 140 1M # tasks issued Pickup Time 350k 1M # instances # batches 400 1M # instances # distinct tasks 120 Tasks Issued468000000kkk 112230505000000kkkkk # task instances468000000kkk 112233050505000000 # task instances468000000kkk 46810000 0 200k 50k 200k 50 200k 20 0 0 0 0 0 0 Jul’12Jan’13Jul’13Jan’14Jul’14Jan’15Jul’15Jan’16Jul’16 Nov’14Jan’15Mar’15May’15 Jul’15 Sep’15Nov’15Jan’16Mar’16 Nov’14Jan’15Mar’15May’15 Jul’15 Sep’15Nov’15Jan’16Mar’16 (a)TaskinstancearrivalvsMedianpickuptimes (b)Taskinstancearrival(postJan2015)vs(1)batcharrival,(2)distincttaskarrival Figure2:TaskArrivalsbyweek 7k 6k assignedtothemarketplace’sinternalorexternalworkers. Weob- 5k servedthattheinternalworkersaccountforaverysmallfractionof kers 4k tasks.taskarrivaloverlaywithinternalandexternal wor 3k 3.3 What is the distribution of work across # 2k differentdistincttasks? 1k Next, wewantedtostudywhetherthereareasmallnumberof 0 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 tasksthatdominatethemarketplace(e.g.,repeatedlylabelingdif- Figure4:Numberofworkersperformingtasks ferentitems, issuedbyasinglerequester). Tostudythis,wefirst levelofvariation. Thus,somewhatsurprisingly,eventhoughthere clusteredthebatchesinourdatasetbasedonmetadatafromtheex- arehugechangesinthenumberofavailabletaskinstances,roughly tractedHTMLsourcecorrespondingtothetasks(seeSection2.4), thesamenumberofworkersareableto“service”agreaternumber andtunedthethresholdofamatchtoensurethatthetasksthaton of requests. This indicates that there is a limitation more in the inspectionlookverysimilarandhavesimilarpurposesareactually supplyoftaskinstancesthanavailabilityofworkers. clustered together. We shall henceforth refer to these clusters of similarbatchescorrespondingtoadistincttask,assimplyclusters. Takeaway:Despitethehugevariationinthenumberofavail- Wedenotethenumberofbatchesinaclusterbyclustersize.Then, able tasks, roughly the same number of workers (with small inFigure6,weplotthedistributionofthenumberofclustersthat variations)areabletoserviceallofthesetasks. havedifferentclustersizes(bothonlogscale). Forexample,there were5clusterswithsizelargerthan100,indicatingthattherewere WorkerLatencies,Idleness,andTask-Distribution.Wenowat- 5distincttasks(eachlumpedintotheirownclusters)thatwereis- tempttoexplainhowroughlythesamenumberofworkersareable suedacrossatleast100batcheseach. Ascanbeseeninthechart, to accommodate for the variation in the number of tasks on the thereseemtobealargenumberoftasksthatare“one-off”witha platform. Our first observation is that the median latency in task smallnumber(< 10)ofbatches: thesetasks,beingone-off,can- instancesgettingpickedupbyworkers,notedaspickuptime(and notbenefitfrommuchfine-tuningoftheinterfacepriortoissuing definedformallylaterinSection4.1)inFigure2a,anddepictedin thetasktothemarketplace. Ontheotherhand, thereareasmall red,showsthatduringperiodsofhighload,themarketplacetends numberof“heavyhitters”: morethan10taskshadclustersizesof to move faster. We also zoom in to the high activity period af- over 100, indicating that these tasks had been issued across 100s terJanuary2015inFigure5atofurtherhighlightthistrend. One of batches. Notice that even within a batch the number of tasks possibleexplanationforthisobservationisthatwhenmoretaskin- maybelarge: westudythatinthenextplotinFigure7. Wesee stancesareavailable,alargernumberofworkersareattractedtothe a wide variation in the number of tasks issued - while 204 clus- marketplaceorrecruitedviaapush-mechanism—leadingtolower tershavelessthan10tasksissued,3clustershavemorethan1M latencies. Anotherpossibility(supportedbyourdiscussionbelow taskseach. Furthermore,these“bulky”clustershaveissuedclose for Figure 5b) is that with a higher availability of tasks, workers to 80k tasks/batch, so even slight improvements in the design of arespendingalotmoreactivetimeontheplatform,andhenceare thesebatchescanleadtorichdividendsfortherequester. Across morelikelytopickupnewtasksassoonastheyareavailable. thischart, themediannumberoftasksperclusteris400. Next, Next,welookintohowtheworkloadisbeingdistributedacross theworker-pool. InFigure5b, weplotthenumberoftaskscom- Distribution of cluster sizes pletedbythetop-10%(inredcolor)andthebottom-90%(ingreen 10k color)ofworkersineachweekandcompareittothetotalnumber oftasksissued.Weobservethatwhilethebottom-90%alsotakeon rs 1k e alotmoretasksduringperiodsofhighload,itisthetop-10%that st 100 u handlesmostoftheflux,andisconsistentlyperformingalotmore # cl 10 tasksthantheremaining90%. Similarly,examiningthesameplot 1 foraverageamountofactivetimespentbyworkersontheplatform 1 10 100 1k inFigure5balsoshowsthatthetop-10%areindeedspendingalot Cluster size moretimeonaverageperweektohandlethevaryingtaskloadas comparedtothebottom-90%.Thisobservationindicatesthatwhile Figure6: #ofbatchesinclusters havingalargeworkforcecertainlyhelps, itiscrucialtofocuson worker interest and engagement—attracting more “active” work- we drill down into the top 10 tasks which had over 100 batches, erscanallowmarketplacestohandlefluctuatingworkloadsbetter. theso-called“heavyhitters”. InFigure8weplotthecumulative We also examined the workload handled by workers from differ- numberoftasksissuedovertime, onelinecorrespondingtoeach entlaborsourcestoverifywhetherthemajorityofthisvariationis heavyhitterdistincttask. Ascanbeseeninthefigure,thesetasks 5 1M 60k 1M 1M 1M 600k # tasks issued Pickup Time # tasks issued # tasks issued 1M 50k 1M TTaasskkss CCoommpplleetteedd ((tboopt1900 wwoorrkkeerrss)) 1M 1M AAccttiivvee TTiimmee ((tboopt1900 wwoorrkkeerrss)) 500k Tasks Issued 468000000kkk 234000kkk Tasks Issued 468000000kkk 468000000kkk Tasks Issued 468000000kkk 234000000kkk 200k 10k 200k 200k 200k 100k 0 0 0 0 0 0 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 (a)TaskArrivalsvsMedianPickupTime (b)Engagementoftop-10%andbot-90%workers:(1)#Tasks(2)ActiveTime Figure5:TaskArrivalsbyWeek(PostJan2015) 10,000outofthetotal12,000batches(≈83%)and24millionout Distribution of tasks across clusters ofthetotal27milliontaskinstances(≈ 89%). Thesebatchesfall 1k intoabout∼3,200clusters. ers 100 • Task Goal: Here, we separate tasks based on their end goal. st Wefindthatmosttaskscanbecharacterizedashavingone(or u cl 10 more)ofthefollowing7goals2: (1)Entity Resolution (ER), # for instance, identifying if two webpages talk about the same 1 business,oriftwosocialmediaprofilescorrespondtoonesin- 1 10 100 1k 10k 100k 1M 10M # of tasks in cluster gleperson,(2)Human Behavior (HB),includingpsychology studies, surveys and demographics, and identifying political Figure7: #oftasksinclusters leanings,(3)Search Relevance Estimation (SR),(4)Quality Assurance (QA),includingspamidentification,contentmod- show both uniform and bursty availabilities. As an example, the eration, and data cleaning, (5) Sentiment Analysis (SA), (6) taskcorrespondingtothepurplelinehasonlybeenactiveinJuly Language Understanding (LU),includingparsing,NLP,and 2015whilethetaskcorrespondingtocyanlinehashadbatchesis- extracting grammatical elements, and (7) Transcription (T), suedregularlyover11monthsfromMay2015toApril2016. including captions for audio and video, and extracting struc- Takeaway: A huge fraction of tasks and batches come from turedinformationfromimages. afewclusters,sofine-tuningtowardsthoseclusterscanlead • Task Operator: In this category, we label tasks based on the to rich dividends. The heavy hitter task types have a rapid human-operators,orunderlyingdataprocessingbuildingblocks increasetoasteadystreamofactivityfollowedbyacomplete usedbyrequesterstoachievetasks’goals. Weobserveprimar- shutdown,afterwhichthattasktypeisneverissuedagain. ily10differentoperators:(1)Filter(Filt),i.e.,separatingitems intodifferentclassesoransweringbooleanquestions,(2)Rate (Rate),i.e.,ratinganitemonanordinalscale(3)Sort (Sort), 1M (4) Count, (5) Label or Tag (Tag), (6) Gather (Gat), i.e., provide information that isn’t directly present in the data, for 100k instancebysearchingtheweb,(7)Extract (Ext),i.e.,convert 10k implicit information already present in provided data into an- otherform, suchasextractingtextwithinanimage. (8)Gen- 1k erate (Gen),i.e.,generateadditionalinformationbyusingin- 100 ferencesdrawnfromgivendata, usingworkerjudgementand intelligence, such as writing captions or descriptions for im- 10 Jan’15Mar’15May’15 Jul’15 Sep’15Nov’15 Jan’16 Mar’16May’16 Jul’16 ages, (9) Localize (Loc), i.e., draw, mark, identify, or bound specific segments of given data and perform some action on Figure8:HeavyHitterDistribution individualsegments,e.g.,drawboundingboxestoidentifyhu- mansinimages,and(10)External Link (Exter),i.e.,visitan external webpage and perform an action there, e.g., fill out a 3.4 Whatkindsoftasksdowesee? surveyform,orplayagame. Wenowstudyourenrichedtask-labelsfromSection2.4inor- • DataType:Wealsoseparatetasksbasedonthetypeofdatathat dertocharacterizethespectrumofcrowdworkoninthemarket- isused.Thesamegoalsandoperatorscanbeappliedonmulti- place. Such an analysis can be very useful, for example, to de- pledatatypes.Alltaskscontainacombinationofthefollowing velop a workload of crowdsourcing, and to better understand the 7typesofdata: (1)Text,(2)Image,(3)Audio,(4)Video,(5) tasktypesthataremostimportantforfurtherresearch. Maps,(6)Social Media,and(7)Webpage. LabelCategories.Welabeleachtaskunderfourbroadcategories1. Label distribution. First, we analyze the distribution of labels Taskshaveoneormorelabelundereachcategory.Ourmechanism indifferentcategoriesacrosstasks. Figure9adepictsthepopular tolabeltasksistofirstclusterbatchestogetherbasedonsimilarity goals. We observe that complex unstructured data understanding ofconstituent tasks, and thenwelabel onetaskcorresponding to basedgoals—language understandingandtranscriptionarevery eachcluster,sincealltaskswithineachclusterhaveidenticalchar- common,comprisingofover4and3milliontasks,thatisaround acteristics. Thegoalofourclusteringistocapturetheseparation betweendistincttasks,whichisnotknowntous. Aslabelingisa labor-intensiveprocess,wecurrentlyhavelabelsavailableforabout 2 Tasksthathaduncommonoruncleargoalsanddidnotfallintooneoftheseclasses, 1Labelingwasperformedindependentlybytwooftheauthors,followingwhichthe wereautomaticallyclassifiedasOtherorUnsurerespectively. Thisholdsforthe differenceswereresolvedviadiscussion. othercategoriesbesidesgoalsaswell. 6 17%and13%respectively,despitenothavingseenextensiveopti- Takeaway: Weobservethatthemarketplaceexhibitsadiverse mizationresearch,asopposedtotraditional,simplergoalslikeen- rangeoftasksspanningacrossover7broadgoals,atleast10 tity resolutionandsentiment analysisthathavebeenextensively distinctoperationsand7datatypes. analyzed. Figure9bshowsthattextandimagearestillthemain • Text and image data are by far the most prevalent and typesofdataavailableandanalyzed—9.6million(40%)and6.3 utilized across most tasks. Web and social media data million (26%) tasks contained text and image data respectively. arealsoavailableandrelevanttoasmallsubsetoftasks Audioandvideodataarealsoused,andotherrichertypesofdata (inparticulartasksinvolvingdataintegrationandclean- like social media, web pages, and maps are gaining popularity. ing for web, and natural language processing for social Figure9cshowsthecommonoperatorsused.Whilethedistribution mediadata). Whiletextandimagedata(andtoalesser ofgoalsindicatesthatasignificantfractionoftaskshavecomplex extent,webdata)havebeenheavilystudiedusingseveral goals,theunderlyingoperatorsarestillpredominantlysimple.The different operators, there are still many exciting avenues marketplaceisdominatedbythefundamentalfilterandrateopera- waitingtobeexploredfortheothertypesofdata. tions—over8million(33%)tasksemploysomefilteringoperator, • Filterandrateareusedasbasicoperatorsforachieving andnearly3million(13%of)tasksmakeuseofratingoperators. mostgoalsandanalyzingalltypesofdata. Itiscrucialto Among more complex operators, we see that gathering, extrac- understandandoptimizetheusageoftheseoperators. tion,localization,andgenerationarefrequentlyapplied,together • Language understanding, and transcription seem to be beingusedinaround5.3million,i.e.,22%ofalltasks. very popular task goals constituting of a large number Goals,operatorsanddatatypesthatoccurfrequentlytogether. tasks. Consideringthefactthatthesetasksrequirecom- Next,welookatthecorrelationsbetweenthethreetypesoflabels plexhumanoperations(generationandextractionasop- for tasks. For example, one question we aim to answer is what posed to the simple filter and rate operations), it might kindsofoperatorsaretypicallyappliedtodifferenttypesofdata, beworthwhiletotrainandmaintainaspecializedworker orusedtoachieveparticulargoals? Lookingatsuchcorrelations poolforsuchtasks. across goals, operators, and data types provides fine-grained in- • For the popular goals of Language understanding, sightsintothestructureanddesignoftasksthatisnotimmediate and transcription, we expect the heavy percentage of fromouraggregatestatisticsalone. Thechartsdepictingthecor- text-baseddata. Itisinterestingtoobservethehighper- relationcanbefoundinFigure9and10. Forinstance,Figure10b centageofofsocial mediaandimagedataforthesetasks shows the breakdown for each goal by the percentage usage of aswell. different operators towards achieving that goal; Figure 9c serves as a legend for the stacked bars. (Figure 11b, legend Figure 9a, 3.5 Trendtowardsopen-endedtasks. yields similar insights, but from a slightly different perspective.) Weobservethatfilterandrateoperatorsareusedinmostkindsof Inthissection,ouraimistoexplorethetrendinthecomplexity tasks,aswellasformasignificantmajorityastheconstituentbuild- ofcrowdsourcedtasksovertime. Thatis,weintendtoanswerthe ingblockformostgoals. Onenotableexceptionistranscription followingquestions: (which, recall,constitutesover13%ofalltasksbyitself, making • Arerequestersmovingontomorecomplex,oropen-endedgoals? itasignificantexception),wheretheprimaryoperationemployed • Aretheylookingatmorechallengingdatasets? isextraction.Asanotherexample,Figure10ashowsthattextand • Aretheyusingmoresophisticatedtoolsorutilizinghumanin- imagesareimportantforalltypesoftaskgoals,forcertaintypes, telligencemoreeffectivelythaninthepast? e.g., ER, SA, SR, social media is also quite important. Lastly, We split each of our categories, goals, operators, and data into Figure10cshowsthatbeyondfilteringandratingbeingimportant, twoclasses:simpleandcomplex.Amongthesetofobservedgoals, extraction is used quite prominently on text and image data, of- weclassify{entity resolution,sentiment analysis,quality assur- tenrivalingthatoffiltering. Forlanguage understandingtasks, ance}assimple, andtheremaining7ascomplex. Foroperators, while filter and rate are the primary operations, generate is also weclassifyfilterandrateassimpleandtheremaining8ascom- usedfrequently(16%ofthetime). Also,fortaskslookingtoun- plex. Fordatatypes,weonlyconsidertextassimple,andthere- derstand human behavior, 13% of the tasks involve performing maining6typesascomplex.Whilethisclassificationissubjective, operations at external links (such as online surveys), and 9% of ourhigh-levelobservationsapplytomostreasonablemappingsof the tasks involve localization. As Figure 11c (legend Figure 9b) labelsto{simple,complex}. indicates,filterandrateoperatorshavebeenusedtoanalyzemost InFigure12, wecomparethetrendbetweennumberofsimple typesofdataaswell. tasks and the number of complex tasks on the marketplace over Figure10ashowsthebreakdownforeachgoalbythepercent- time. Onthex-axis, weplottimeinincrementsofoneweek. On ageofdifferentdatatypespresentintaskshavingthatgoal. Fig- they-axisweplotthecumulativenumberofclustersoftasks,that ure9bservesasalegendforthestackedbars. (Figure11a,legend is the number of unique tasks, issued so far – one line each for Figure9a,yieldssimilarinsights,butfromaslightlydifferentper- simple, versus complex tasks. Note that we deduplicate similar spective.) While for most goals, a large fraction of data used in batchesandcountthemasasinglepoint,sotheseplotsrepresent tasks seems to be text and image based, we observe that for en- theinterestsofallrequestersequally. FromFigures12aand 12c, tityresolutionandsearchrelevance,webdataisrelevant(serving weobservethatthenumberofclustersoftasksinvolvingcomplex 24%and37%ofentity resolutionandsearch relevancetasksre- goalsandnon-textualdatatypesismuchlarger,andgrowingfaster, spectively). Also,sentiment analysisandlanguage understand- thanthecorrespondingnumbersofsimpleclusters.Forinstance,as ingstyleofanalysesemploysocialmediaasasignificantfraction ofJanuary2016, therehadbeenaround(a)510clustersinvolving oftheirinputdata(13%and8%respectively). Whilesomeefforts non-textualversusabout240clustersinvolvingtextdata, and(b) arebeingmadetowardsanalyzingothertypesofdata(besidestext 620clusterswithcomplexgoals,andjust80clusterswithsimple andimage),theyarestilllargelyunexplored. goals.Bycontrast,Figure12bdemonstratesthattheusageofcom- plex operators is comparable to that of simple operators. Specif- ically, we observe a total of around 410 clusters using complex 7 4.5M 10.0M 9.0M 4.0M 9.0M 8.0M 3.5M 8.0M 7.0M 3.0M 7.0M 6.0M 2.5M 6.0M 5.0M 5.0M 2.0M 4.0M 4.0M 1.5M 3.0M 3.0M 1.0M 2.0M 2.0M 500.0k 1.0M 1.0M 0.0 0.0 0.0 ER HB SR QA SA LU T Social Web Image Map Video Text Audio ExterRateSort Gat TagCountFilt Ext Loc Gen (a)PopularTaskGoals (b)Populardatatypes (c)Popularoperators Figure9:Distributionofgoals,datatypes,andoperators 120.0 120.0 120.0 100.0 100.0 100.0 80.0 80.0 80.0 60.0 60.0 60.0 40.0 40.0 40.0 20.0 20.0 20.0 0.0 0.0 0.0 ER HB SR QA SA LU T ER HB SR QA SA LU T SocialWebImageMapVideoTextAudio (a)Datausedfordifferenttaskgoals (b)PopularOperatorsfordifferenttaskgoals (c)PopularOperatorsfordifferentdata Figure10:Correlationsacrossdataandgoal,operatorandgoal,andoperatoranddata 120.0 120.0 120.0 100.0 100.0 100.0 80.0 80.0 80.0 60.0 60.0 60.0 40.0 40.0 40.0 20.0 20.0 20.0 0.0 0.0 0.0 SocialWebImageMapVideoTextAudio ExterRateSortGatTagCountFilt ExtLocGen ExterRateSortGatTagCountFilt ExtLocGen (a)Populartaskgoalsfordifferentdata (b)Populartaskgoalsfordifferentoperators (c)Populardatafordifferentoperators Figure11:Correlationsacrossgoalanddata,goalandoperator,anddataandoperator 800 500 700 700 simple complex 450 simple complex 600 simple complex 400 600 350 500 Count 345000000 Count 223050000 Count 340000 200 150 200 100 100 50 100 0 0 0 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Date Date Date (a)Goals (b)Operator (c)Data Figure12:Simplevscomplextasksovertime 8 operators and 340 clusters involving simple operators issued cu- tasks,spreadoutacrossalargenumberoffeatures(suchasthose mulativelyasofJanuary2016. discussedinthesectionstofollow)andlabels(goals,operatorsand data types)—the remaining data is too sparse for any statistically Takeaway:Whilerequesters(andresearchers)aremoreinter- significantinferencestobemade. ested in achieving complex goals on complex data (and get- Forthesecondoptionofcomputingdisagreementonnon-textual tingmoresowithtime),thefundamentalhuman-operatorsof fields,wefaceaproblemwiththedistributionofdisagreementval- filter and rate, are by themselves still as widely used as all uesitself. First,forallthetasksthatonlyhavetextualresponses, otheroperatorscombined. Thisraisesboththeneedtoopti- itisnotpossibletodefineadisagreementscore;weareunableto mizetheexistingsimpleoperatorsasfaraspossible,aswellas computeadisagreementscoreinthismannerforover60%ofall theopportunityfortheexplorationandunderstandingofmore batches. Second,ignoringtextfieldsmisrepresentsthetruedistri- complexoperators. butionofdisagreementsfortheremainingdatasets. Itispossible thatwerepresenttaskswithhighdisagreementashavinglowdis- agreementsimplybecausetheyhaveasmallnumberofnon-textual 4. EFFECTIVETASKDESIGN fields. Inthissection,weaddressthequestionofeffectivetaskdesign. A third approach would be an edit-distance or partial scoring Specifically, we(1)characterizeandquantifywhatconstitutesan scheme;however,thisapproachisnotidealsinceinpracticecrowd- “effective” task, (2) make data-driven recommendations on how sourcingrequestersrequirehighexactagreement,notpartialagree- requesters can design effective tasks, and (3) predict the “effec- ment,sothattheanswerscanbeeasilyaggregatedviaconventional tiveness”oftasksbasedonourhypotheses. majorityvotetypeschemes. Furthermore,manytaskswithtextual responsesareobjective. Somecommonexamplesthatweseein- 4.1 Metricsforeffectivetasks cludetranscription,captcha,imagelabeling,andretrievingURLs. Thestandardthreemetricsthatareusedtomeasurecrowdsourc- Forsuchtextualbutobjectivetasks,itisnotclearifanedit-distance ing effectiveness are: error, cost, and latency. There are various basedagreementschemeistherightapproach. waysthesethreemetricscouldbemeasured; wedescribeourno- Cost:MedianTaskTime.Atypicalmeasureforhowmucheffort tionsbelow,givenwhatwecancalculate. aworkerhasputintoataskinstanceistheamountoftimetakento Error:DisagreementScore.Inourdataset,wehaveeveryworker completeit.Sincewedonothaveinformationabouttheactualpay- answerprovidedforeachquestionwithineachtaskinstance,oper- mentsmadetoworkers,weusethemedianamountoftimetaken atingononedistinctitem,butnotthecorrespondinggroundtruth (inseconds)byworkerstocompletetasksinabatchasaproxyfor answers. We use these answers to quantify how “confusing” or thecostofthebatch. Thiscanbecalculatedfromthedatathatis ambiguousataskis,overall. Thewaywequantifythisistocon- available,giventhatwehavethestartandendtimesforeachtask sidertheworkeranswersforagivenquestiononagivenitem. If inabatch. Weshallsubsequentlydenotethe“MedianTaskTime” theworkersdisagreeonaspecificquestiononaspecificitem,then bytask-time. thetaskislikelytobeambiguous—indicatingthatitispoorlyde- Latency: MedianPickupTime. Tocharacterizelatency,weuse signed,orhardtoanswer—eitherway,thisinformationisimportant pickuptime,i.e.,howquicklytasksarepickedupbyworkers,on todictatethetaskdesign(e.g.,clarifyinstructions)andthelevelof average.Pickuptimeforabatchiscomputedasfollows:pickup-time redundancy (e.g., more redundancy for confusing questions) that =median(<starttimeoftask −batchstarttime>)(inseconds). i shouldbeadoptedbyrequesters.Ourproxyforerroristheaverage Here,weusethestarttimeoftheearliestbatch,i.e.starttimeoftask , 1 disagreementintheanswersforquestionsonthesameitem,across asaproxyforthebatchstarttime.Wejustifythischoiceforthela- allquestionsanditemsinabatch.Weconsiderallpairsofworkers tencymetricquantitativelyinbelow. . Ourreasonsaretwofold. who have operated on the same item, and check if their answers First,weobservethatthepickup-timeoftasksistypicallyorders are the same or different, giving a score of one if they disagree, ofmagnitudehigherthanthatoftask-time,whichmightotherwise andzeroiftheyagree;wethencomputetheaveragedisagreement seemlikeareasonableproxyforlatency. Thismeansthattheac- scoreofanitembyaveragingallthesescores;andlastly,wecom- tualturnaroundtimeforataskisdominatedbywhenworkersstart putetheaveragedisagreementscoreforabatchbyaveragingthe itsinstances,ratherthanhowlongtheytaketocompletethemonce scoresacrossitemsandquestions. Weshallhenceforthrefertothe started. Figures 13a and 13b support this claim. For each of the “DisagreementScore”asdisagreement. figures, we compare the pickup-time against the task-time, both Thereishowever,onesmallwrinkle.Someoperators,andcorre- onthey-axiswithvaryingend-to-end-timealongthex-axis. Fig- spondingworkerresponsesmayinvolvetextualinput. Twotextual ure 13a shows this distribution at a batch level, with the median responses may be unequal even if they are only slightly different values for pickup-time and task-time being plotted against each fromeachother.Sincetextualresponsesoccupyalargefractionof batch’s end-to-end-time. Figure 13b shows this distribution at a ourdataset,itisnotpossibletoignorethemaltogether. Weinstead task instance level, with each task’s individual pickup-time and adopt a simple rule: we prune away all tasks with disagreement task-timebeingplottedagainstitsend-to-end-time,whichinthis >0.5soastoeliminatetaskswithveryhighvariationsinworker caseissimply(pickup-time+task-time)(toreducethenumber responses. Thiseliminatesthesubjectivetextualtasks,whilestill ofpointsintheplot,weonlyplotthemedianofpickup-timeand retainingthetextualtasksthatareobjective. task-time corresponding to a vertical splice, that is, we plot one Anotherwaytohandlethesubjectivityoftasksistosimplyig- medianpointforallinstanceshavingacommonend-to-end-time. noretext-boxes.Thiscouldbedoneintwoways:(1)onlyevaluate Weobservethatinbothplots,thepickup-timeisordersofmagni- disagreement for tasks with no text-box fields, and (2) for every tudehigherthanthetask-time. Secondly,mostmeasuresoftime task, computedisagreementonlyonitsnontextualfields. Inour thatwecanobtainfromouravailabledatastronglydependonfea- experiments,wetriedboththeseoptions,butrejectedthemforrea- tures,suchasthesizeanddifficultyofatask. Sincepickup-time sonswediscussbelow. only looks at the time taken for workers to start a task and not Itturnsoutthatalargemajorityoftasksinourdatasetcontain howlongtheyspendonit(which,aswehaveseen,isanywayan atleastonetext-boxfield. Eliminatingallofthemleavesveryfew insignificantfractionoftime), itisrelativelyindependentofsuch 9 valuebetterthanm.Thus,ahighervalueispreferable;andwe comparethetwobins(orlines)inthisplot. Below,inSections4.3-4.7,welookattheresultsforsomeofthe significantcorrelationswefound. 4.3 NumberofHTMLwords (a)Batch-level (b)Task-level Weexaminehowthelengthoftask—definedasthenumberof wordsintheHTMLpage,anddenotedas#words—impactstheef- Figure13:Latency fectiveness of the task. We show the effect of length of task on features. Thishelpsseparateouttheinfluenceoffeaturesthatre- our metrics in Figure 14a. We observe that the line for clusters questers often cannot control, and that we cannot quantify, from withhigher#wordsintheirHTMLinterfacedominates,orisabove our latency metric, making our subsequent quantitative analyses thelinefortheclusterswithfewer#words. Weseethattheme- morestatisticallymeaningful. Inshort,weobservethatingeneral dianvalueofdisagreementfortaskswith#words≤466is0.147, thepickup-timeforbatchesisordersofmagnitudehigherthanthe while that for tasks with #words > 466 is 0.108. This may be task-time, indicating that the latency or total turnaround time of becauselongertaskstendtobemoredescriptive,andthedetailed a task is in fact dictated by the rate at which workers accept and instructions help reduce ambiguity in tasks, train workers better, startthetaskinstances. Wedenotethe“MedianPickupTime”by andtherebyreducemutualdisagreementinanswers. Wealsonote pickup-time. that the length of the task does not significantly affect either the pickup-timeortask-timemetrics. Thus,workersareneitherdis- couragednorsloweddownbylongertextualdescriptionsoftasks. 4.2 CorrelationAnalysisMethodology WhileincreasingthenumberofwordsintheHTMLsourceof Inthenextsetofsubsections,weexaminesomeinfluentialfea- tasks helps reduce disagreement in general, this benefit may be tures or parameters that a requester can tune, to help improve a morepronouncedforparticulartypesoftasks. Intuitively,weex- task’serror(disagreement),cost(task-time)andlatency(pickup- pectdetailedinstructionstohelpmoreforhardertasks, andhave -time). For instance, features of a task include the length of the less impact on easier tasks. To test this hypothesis, we separate task, or the number of examples within it. For each feature, we tasksintobucketsbytheirlabels(recallgoal,operatoranddata), look at the correlation between the feature and each of the three andtesttheeffectofourfeature,#words.FromFigure25a,wesee metrics. Weperformaseriesof(correlation-investigating)experi- thatfor(relativelyhard)gathertasks,#wordshasapronouncedef- ments,eachofwhichcorrespondstoone{feature,metric}pair.All fectondisagreementwithhigher#wordsleadingtosignificantly ourexperimentsfollowthefollowingstructure: lowerdisagreement.Ontheotherhand,Figure25bseemstoindi- • Cluster: We first cluster batches based on the task in order catethatfor(relativelysimple)ratingtasks,#wordshasnosignif- to not have the “heavy-hitter” tasks that appear frequently in icantimpactondisagreement. multiple batches across the dataset to dominate and bias our Example. Todemonstratetheeffectofhavingmoredetailedde- findings.Sinceouranalysiswillalsoinvolvematching,orclus- scription,orhighernumberofwordsinatask’sHTMLinterfaceon teringtasksfurtherbasedonlabels,werestrictourfocustothe disagreement, wecomparetwoactualtaskswhicharebothfrom setofaround3,200labeledclusterscorrespondingto83%ofall the domain of Language Understanding, but differ in their de- batchesand89%ofalltaskinstances. Subsequently,foreach scriptivenessand#num-words.Welookattwodifferenttasksthat cluster,wetakethemedianofmetricvaluesacrossbatches,as requireworkerstofindurlsoremailIDsofbusinessesthroughba- wellasthemedianofthefeaturebeinginvestigated. sicwebsearches. Bothhaveextremelysimilarinterfaces,andask • Binning: Weseparatetheclustersintotwobinsbasedontheir similarquestions. Neitheremploysexamples(whichweshallsee featurevalue—allclusterswithfeaturevaluelowerthanthe later has a significant impact on disagreement). The main differ- globalmedianfeaturevaluegointoBin-1(say),whiletheones encebetweenthetwotasksisthatthefirst(having970instances), withfeaturevaluehigherthanthemediangointoBin-2.(Clus- hasmediannumberofwords=233,whilethesecond(having1254 terswithfeaturevalueexactlyequaltothemedianareallput instances)hasamedianof6072wordsinitsHTMLinterface.Fig- intoeitherBin-1orBin-2whilekeepingthebinsasbalanced ure15depictsthefirsttaskandFigure16depictsthesecond. The aspossible.) Foreachmetric,wethenexamineitsvaluedistri- firsttaskusestheseextrawordstogivedetailedinstructions(shown butioninthetwobins—inparticular,welookforsignificant inFigure16a)onhowtogoaboutthetask. Incontrast,thesecond differencesbetweentheaverage,median,ordistributionofmet- taskhasalmostnodescriptionatall. Itrequiresworkerstoenter ricvaluesinthetwobins. Asignificantdifferenceindicatesa the“synonymy”ofcorrectsentences,andtocorrectincorrectsen- correlationbetweenthefeaturewehavebinnedon,andthemet- tences,withoutgivinganyexamplesorinputforwhatthesetasks ricbeinglookedat. Wethenhypothesizeabouttheunderlying entail. Whilethefirsttaskhasamediandisagreementof0.26,the reason(s)behindthecorrelation. secondshowsamediandisagreementof0.08. Thisdemonstrates • Statisticalsignificance: Weperformat-testtocheckwhether thepowerofexamplesinreducingtaskambiguity. themetricvaluedistributioninourtwofeature-value-separated binsisstatisticallysignificant. Weuseathresholdp-valueof 0.01todeterminesignificance, thatis, weonlyrejectthenull Takeaway: Tasks with higher #words in their HTML sources hypothesis(thatbinshavesimilarmetricvalues)ifthep-value are typically the ones with more detailed instructions or ex- islessthan1%. amples.Weseethatthishastheeffectofdecreasingdisagree- • Visualization: Foreachfeature-metricpair,weplotacumula- mentamongstworkers,particularlyforcomplextasks. tivedistribution(CDF)plot,withthemetricvalueplottedalong the x-axis. Each of the two bins corresponds to one line in the plot. For x = m, the corresponding y value on each of 4.4 Presenceofinputtext-boxes thelinesrepresentstheprobabilitythatabatchwillhavemetric 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.