ebook img

Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics PDF

21 Pages·2016·2.79 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics

Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing Marketplace [Extended Technical Report] Ayush Jain†, Akash Das Sarma(cid:63), Aditya Parameswaran†, Jennifer Widom(cid:63) †University of Illinois (cid:63)Stanford University ABSTRACT this includes data ranging from worker answers to specific ques- tionsandresponsetimes,allthewaytotheHTMLthatencodesthe Weconductanexperimentalanalysisofadatasetcomprisingover userinterfaceforaspecificquestion. 27 million microtasks performed by over 70,000 workers issued Thisdataallowsustoanswersomeofthemostimportantopen toalargecrowdsourcingmarketplacebetween2012-2016. Using questionsinmicrotaskcrowdsourcing: whatconstitutesan“effec- thisdata—neverbeforeanalyzedinanacademiccontext—weshed tive”task,howcanweimprovemarketplaces,andhowcanween- light on three crucial aspects of crowdsourcing: (1) Task design hanceworkers’interactions.Inthispaper,usingthisdata,westudy —helpingrequestersunderstandwhatconstitutesaneffectivetask, thefollowingkeyquestions: andhowtogoaboutdesigningone;(2)Marketplacedynamics— • Marketplacedynamics:helpingmarketplaceadministratorsun- helping marketplace administrators and designers understand the derstand the interaction between tasks and workers, and the interactionbetweentasksandworkers,andthecorrespondingmar- correspondingmarketplaceload; e.g., questionslike: (a)how ketplace load; and (3) Worker behavior — understanding worker muchdoestheloadonthemarketplacevaryovertime,andis attention spans, lifetimes, and general behavior, for the improve- thereamismatchbetweenthenumberofworkersandthenum- mentofthecrowdsourcingecosystemasawhole. beroftasksavailable,(b)whatisthetypicalfrequencyanddis- tributionoftasksthatarerepeatedlyissued, (c)whattypesof 1. INTRODUCTION tasksarerequestersmostinterestedin, andwhatsortsofdata Despitetheexcitementsurroundingartificialintelligenceandthe dothesetasksoperateon? ubiquitous need for large volumes of manually labeled training • Task design: helping requesters understand what constitutes data,thepastfewyearshavebeenarelativelytumultuousperiodfor an effective task, and how to go about designing one; e.g., thecrowdsourcingindustry.Therehasbeenarecentspateofmerg- questionslike: whatfactorsimpact(a)theaccuracyofthere- ers, e.g., [27], rebrandings, e.g., [15, 17], slowdowns, e.g., [16], sponses;(b)thetimetakenforthetasktobecompleted;or(c) andmovestowardsprivatecrowds[34]. Forthefutureofcrowd- thetimetakenforthetasktobepickedup? Doexamplesand sourcingmarketplaces,itisthereforebothimportantandtimelyto imageshelp?Doesthelengthorcomplexityofthetaskhurt? stepbackandstudyhowthesemarketplacesareperforming,how • Worker behavior: understanding worker attention spans, life- therequestersaremakingandcanmakebestuseofthesemarket- times, andgeneralbehavior; e.g., questions like (a)wheredo places,andhowworkersareparticipatinginthesemarketplaces—in workerscomefrom,(b)doworkersfromdifferentsourcesshow ordertodevelopmoreefficientmarketplaces,understandthework- differentcharacteristics,suchasaccuraciesandresponsetimes, ers’viewpointandmaketheirexperiencelesstedious,anddesign (c) how engaged are the workers within the marketplace, and more effective tasks from the requester standpoint. Achieving all relativetoeachother,and(d)howdotheirworkloadsvary? thesegoalswouldsupportandsustainthe“trifecta”ofkeypartici- Theonlypaperthathasperformedanextensiveanalysisofcrowd- pantsthatkeepscrowdsourcingmarketplacesticking—themarket- sourcingmarketplacedataistherecentpaperbyDifallahetal.[14]. placeadministrators,thecrowdworkers,andthetaskrequesters. Thispaperanalyzedthedataobtainedviacrawlingapubliccrowd- Atthesametime,developingabetterunderstandingofhowcrowd- sourcing marketplace (in this case Mechanical Turk). Unfortu- sourcing marketplaces function can help us design crowdsourced nately,thispubliclyvisibledataprovidesarestrictedviewofhow dataprocessingalgorithmsandsystemsthataremoreefficient,in themarketplaceisfunctioning,sincetheworkerresponses,demo- termsoflatency,cost,andquality.Indeed,crowdsourceddatapro- graphicsandcharacteristicsoftheworkers,andthespeedatwhich cessingisperformedatscaleatmanytechcompanies,withtensof theseresponsesareprovidedareallunavailable. Asaresult, un- millionsofdollarsspenteveryyear[34],sotheefficiencyimprove- likethepresentpaper,thatpaperonlyconsidersarestrictedaspect mentscanleadtosubstantialsavingsforthesecompanies. Inthis ofcrowdsourcingmarketplaces,specifically,thepricedynamicsof vein,therehavebeenanumberofpapersonbothoptimizedalgo- the marketplace (indeed, their title reflects this as well)—for in- rithms,e.g.,[22,37,13,40,9],andsystems,e.g.,[33,19,32,10, stance,demandandsupplymodeling,developingmodelsforpre- 38],allfromthedatabasecommunity,andsuchfindingscanhave dictingthroughput,andanalyzingthetopicsandcountriespreferred animpactinthedesignofallofthesealgorithmsandsystems. byrequesters. Thatpaperdidnotanalyzethethefull“trifecta”of Unfortunately, due to the proprietary nature of crowdsourcing participantsthatconstituteacrowdsourcingmarketplace. Evenfor marketplacedata, itishardforacademicstoperformsuchanaly- marketplacedynamics,tofullydistinguishtheresultsofthepresent sesandidentifypainpointsandsolutions. Fortunatelyforus,one paperfromthatpaper,weexcludeanyexperimentsoranalysesthat ofthemoreforwardthinkingcrowdsourcingmarketplacesmadea overlapwiththeexperimentsperformedinthatpaper.Wedescribe substantial portion of its data from 2012 to date available to us: thisandotherrelatedworkinSection6. 1 Thisexperimentsandanalysispaperisorganizedasfollows: erousenoughtoprovideaccesstotheirdataforresearchpurposes. • Dataset description and enrichment. In Section 2, we de- Tooffsetthelackoftransparencyduetotheanonymity,wediscuss scribewhatourdatasetconsistsof,andthehigh-levelgoalsof someofthecrucialoperationalaspectsofthemarketplace,thatwill ouranalysis. InSection2.1through2.3weprovidemorede- allowustounderstandhowthemarketplacefunctions,andgener- tailsaboutthemarketplacemechanics,thescaleandtimespan alizefromtheseinsightstoothersimilarmarketplaces. ofthedataset,andtheattributesprovided. Wealsoenrichthe Themarketplaceweoperateonactsasanaggregatororanin- datasetbymanuallylabelingtasksourselvesonvariousfeatures termediaryformanydifferentsourcesofcrowdlabor. Forexam- ofinterest,describedinSection2.4,e.g.,whattypeofdatadoes ple,thismarketplaceusesMechanicalTurk[3],Clixsense[1],and thetaskoperateon,whatsortofinputmechanismdoesthetask NeoDev[4],allassourcesofworkers,aswellasaninternalworker usetogetopinionsfromworkers. pool.Fortaskassignment,i.e.,assigningtaskstoworkers,themar- • Marketplaceinsights. InSection3,weaddressquestionson ketplacemakesuseofbothpushandpullmechanisms. Thetypi- the(a)marketplaceload—taskarrivals(Section3.1),worker calsettingisviapull,wheretheworkerscanchooseandcomplete availability (Section 3.2), and task distribution (Section 3.3), tasksthatinterestthem. Inasomesourcesofworkersthatwewill with the aim of helping improve marketplace design, and (b) discuss lateron, tasks are pushed to workersby the marketplace. thetypesoftasks,goals,humanoperatorsanddatatypes,and Forexample,Clixsenseinjectspaidsurveysintowebpagessothat correlationsbetweenthem(Section3.4),withtheaimofchar- individuals browsing are attracted to and work on specific tasks. acterizingthespectrumofcrowdwork. Ineithercase,themarketplaceallowsrequesterstospecifyvarious • Task design improvements. In Section 4, we (a) character- parameters,suchasaminimumaccuracyforworkerswhoareal- izeandquantifymetricsgoverningthe“effectiveness”oftasks lowedtoworkonthegiventasks,anygeographicconstraints,any (Section4.1), (b)identifyfeaturesaffectingtaskeffectiveness constraintsonthesourcesofcrowdlabor,theminimumamountof anddetailhowtheyinfluencethedifferentmetrics(Sections4.3 timethataworkermustspendonthetask,themaximumnumberof through4.7),(c)performaclassificationanalysisinSection4.9 tasksinabatchagivenworkercanattempt,andananswerdistribu- whereinwelookattheproblemofpredictingthevarious“ef- tionthreshold(i.e.,thethresholdofskewontheanswersprovided fectiveness”metricvaluesofataskbasedonsimplefeatures, bytheworkersbelowwhichaworkerisnolongerallowedtowork and(d)providefinal,summarizedrecommendationsonhowre- ontasksfromtherequester). Themarketplacemonitorsthesepa- questerscanimprovetheirtasks’designstooptimizeforthese rameters and prevents workers from working on specific tasks if metrics(Section4.8). theynolongermeetthedesiredcriteria. Themarketplacealso • Worker understanding. In Section 5, we analyze and pro- provides optional templates for HITs for some common standard videinsightsintotheworkerbehavior. Wecomparecharacter- tasks,suchasSentimentAnalysis,SearchRelevance,andContent isticsofdifferentworkerdemographicsandsources—provided Moderation,aswellasfortoolssuchasoneforimageannotation. bydifferentcrowdsourcingmarketplaces; aswewillfind, the Usageofthesestandardtemplatesleadstosomeuniformityinin- specific marketplace whose data we work with solicits work- terfaces,andalsogivespotentialfortheimprovementoftaskdesign ersfrommanysources(Section5.1). Wealsoprovideinsights simultaneouslyacrossrequesters. into worker involvement and task loads taken on by workers This marketplace categorizes certain workers as “skilled” con- (Section5.2),andcharacterizeandanalyzeworkerengagement tributors,whoaregivenaccesstomoreadvancedtasks,higherpay- (Section5.3). ments,andarealsosometimesresponsibleformeta-taskssuchas generatingtestquestions,flaggingbrokentasks,andcheckingwork 2. DATASETDETAILSANDGOALS byothercontributors.Ourresearchhighlightsthathavingapoolof engagedandactiveworkersisjustas, ifnotmoreimportantthan We now introduce some terms that will help us operate in a havingaccesstoalargeworkforce. Itmightbeinterestingtoex- marketplace-agnostic manner. The unit of work undertaken by a ploreanincentiveprogramforthe“active”or“experienced”work- singleworkerisdefinedtobeatask.Ataskistypicallylistedinits ersaswell. entiretyononewebpage,andmaycontainmultipleshortquestions. Thismarketplacealsoprovidesanumberofadditionalfeatures. Forexample,ataskmayinvolveflaggingwhethereachimageina Notable among them is its module for machine learning and AI. set of tenimages is inappropriate; so thistask contains ten ques- This allows requesters without any machine learning background tions. Eachquestioninataskoperatesonanitem;inourexample, togenerateandevaluatemodelsontrainingdatawitheasycompu- this item is an image. These tasks are issued by requesters. Of- tationandvisualizationsofmetricssuchasaccuracy,andconfusion ten, requestersissuemultipletasksinparallelsothattheycanbe matrix. attempted by different workers at once. We call this set of tasks a batch. Requesters often use multiple batches to issue identical 2.2 OriginoftheDataset unitsofwork—forexample,arequestermayissueabatchof100 “image flagging” tasks one day, operating on a set of items, and Our dataset consists of tasks issued on the marketplace from thenanotherbatchof500“imageflagging”tasksafteraweek,on 2012–2016.Unfortunately,wedonothaveaccesstoalldataabout adifferentsetofitems. Weoverloadthetermtasktoalsoreferto all tasks. There are about 58,000 batches in total, of which we theseidenticalunitsofworkissuedacrosstimeandbatches,inde- haveaccesstocompletedataforasampleofabout12,000batches, pendentoftheindividualitemsbeingoperatedupon,inadditionto andminimaldataabouttheremaining,consistingonlyofthetitle asingleinstanceofwork. Theusageofthetermtaskwillbeclear of the task and the creation date. Almost 51,000, or 88% of the fromthecontext;ifitisnotclear,wewillrefertothelatterasatask 58,000 of batches have some representatives in our 12,000 batch instance. sample—thus,thesampleismissingabout10%ofthetasks.(That is,thereareidenticaltasksinthe12,000batchsample.) Fromthe 2.1 OperationalDetails taskperspective,thereareabout6600distincttasksintotal,spread Duetoconfidentialityandintellectualpropertyreasons,weare across 58,000 batches, of which our sample contains 5000, i.e., requiredtopreservetheanonymityofthecommercialcrowdsourc- 76%ofalldistincttasks.Thus,whilenotcomplete,oursampleisa ingmarketplaceweoperateon, whohaveneverthelessbeengen- significantandrepresentativeportionoftheentiredatasetoftasks. 2 Wewilllargelyoperateonthis12,000batchsample,consistingof • Performancemetrics—wecomputeandstoredifferentmetrics 27Mtaskinstances,asubstantialnumber. Figure1comparesthe to characterize the latency, cost and error of tasks to help us numberofdistincttaskssampledversusthetotalnumberofdistinct performquantitativeanalysesonthe“effectiveness”ofatask’s tasks that were issued to the marketplace across different weeks, design,discussedfurtherinSection4.1. Weobservethatingeneralwehaveasignificantfractionoftasks 2.5 WhataretheGoalsofOurAnalysis? fromeachweek. Aspreviouslymentioned, ourmaingoals(atabroadlevel)are 2.3 DatasetAttributes toquantitativelyaddressthequestionsof(1)Marketplacedynam- Thedatasetisprovidedtousatthebatchlevel.Foreachbatchin ics—helpingmarketplaceadministratorsandownersunderstand oursample,wehavemetadatacontainingashorttextualdescription theinteractionbetweentasksandworkers,andthecorresponding ofthebatch(typicallyonesentence),aswellasthesourceHTML marketplaceload;(2)Taskdesign—helpingrequestersunderstand codetoonesampletaskinstanceinthebatch.Inaddition,themar- whatconstitutesaneffectivetask,andhowtogoaboutdesigning ketplace also provides a comprehensive set of metadata for each one; and (3) Worker behavior — understanding worker attention taskinstancewithinthebatch,containing spans,lifetimes,andgeneralbehavior,fortheimprovementofthe • Workerattributessuchasworker ID,location(country,region, crowdsourcing ecosystem as a whole. While the first two goals city, IP), and source (recall that this marketplace recruits directlyimpactrequestersandmarketplaceadministrators,webe- workersfromdifferentsources); lievetheywillalsohelpindirectlyimprovethegeneralworkerex- • Itemattributessuchasitem ID;and perienceintermsofavailabilityofdesirabletasks,andareduction • Task instance attributes such as task ID, start time, end inthelaboriousnessofperformingtasks.Wenowdiscussourgoals time,trust score,andworker response. inalittlemoredetail,bybreakingeachofthemdownintosub-goals Aswecanseeinthislist,themarketplaceassignsworkersatrust anddescribingtheexperimentsweperformtoanswerthem. scoreforeverytaskinstancethattheyworkon. Thistrustscore Marketplacedynamics. Tounderstandthemarketplace,thefirst reflectstheaccuracyoftheseworkersontesttaskstheanswersto goal of our analysis is to examine general statistics that help us whosequestionsisknown. Themarketplaceadministersthesetest estimatethescaleofcrowdsourcingoperationswithinthemarket- tasksbeforeworkersbeginworkingon“real”tasks.Unfortunately, place.Thisfirstorderanalysisisusefulformarketplaceadministra- we were not provided these test tasks, and only have the trust tors, helpingthemestimatetheresourcesrequiredtomanagethis scoreasaproxyforthetrueaccuracyoftheworkerforthatspecific scale of operations, and identify key limitations. Thereafter, we typeoftask.Inadditiontothetrust score,wealsohaveinforma- lookattheavailabilityandflowoftasksorworkersonthemarket- tionaboutworkeridentities,andotherattributesoftheitemsbeing place — specifically, we check if the marketplace witnesses sud- operatedon,andthestartandendtimesforeachtaskinstance. den bursts or a steady stream of activity. This analysis gives us At the same time there are some important attributes that are concrete take-aways that can help future marketplace design bet- notvisibletousfromthisdataset.Forinstance,wedonothavere- terload-balancingstrategies. Lastly,weanalyzethemanuallabels questerIDs,butwecanusethesampletaskHTMLtoinferwhether assigned to tasks to look at the types of tasks that have become twoseparatebatcheshavethesametypeoftask,andthereforewere popularwithinthemarketplaceasameanstowardsabettercharac- probablyissuedbyasinglerequester.Nordowehavehave“ground terizationofthespectrumofcrowdworkasawhole. truth” answers for questions in the tasks performed by workers. TaskDesign.Tohelpimprovetaskdesign,wemustfirstbeableto However,aswewilldescribesubsequently,wefindotherproxies characterizetheeffectivenessoftasksbothqualitativelyandquan- tobeabletoestimatetheaccuraciesofworkersortasks.Finally,we titatively. Thethreewellknownaspectsusedtotalkabouttheef- donothavedataregardingthepaymentsassociatedwithdifferent fectivenessofacrowdsourcedtaskare(a)latency,(b)accuracyand tasksandbatches. (c)cost. Consequently, thefirststepinouranalysisistoidentify performancemetricsthatmeasuretheseaspects.Thenextstepisto 2.4 Howdidweenrichthedata? studytheimpactofvaryingdifferentdesignparametersoneachof thesemetrics. Thisanalysis,whenperformedonadatasetaslarge The raw data available for each batch, as described above, is asours,allowsustomakedata-informedrecommendationstore- byitselfquiteusefulinexploringhigh-levelmarketplacestatistics questerslookingtodesigntasksthatthatareansweredaccurately suchasthenumberoftasksandworkersovertime,thegeographic byworkerswithlowlatencyandatlowcost. distributionofworkers,typicaltaskdurations,andworkerlifetimes Workeranalyses. Aworker-centricviewofthemarketplacecan andattentionspans. Thatsaid, thisrawdataisinsufficienttoad- helpusinunderstandingtheworkers’experienceandmakechanges dress many of the important issues we wish to study. For exam- appropriately to make it easierfor them to work. In this respect, ple, we cannot automatically identify whether a task operates on wefirstlookatthevariouslaborsourcesthatprovideworkersthat webdataoronimages,andwhetherornotitcontainsexamples,or performworkforthemarketplace. Theload,resourceandquality free-formtextresponseboxes. Toaugmentthisdataevenfurther, distributionacrossthesesourcescanpointresearchersindevising weenrichthedatasetbyinferringorcollectingadditionaldata.We appropriate load-balancing strategies, and can point practitioners generatethreeadditionaltypesoftaskattributedata: towardstheidealsource(s)forcrowdwork. Next,weexaminethe • Manuallabels—wealsomanuallyannotateeachbatchonthe geographical distribution of workers — this gives us vital infor- basisoftheirtaskgoal,e.g.,entityresolution,sentimentanal- mationabouttheactivetime-zonesoftheworkforceandcanhelp ysis,operatortype,e.g.,rating,sorting,labeling,andthedata marketplaceadministratorsinensuringaconstantresponserateon type in the task interface, e.g., text, image, social media, dis- themarketplace.Finally,wealsostudytheendlifetimesandatten- cussedfurtherinSection3.4. tionspansoftheworkerstofigureoutwhatfractionofthework- • Designparameters—weextractandstorefeaturesfromthesam- forceareregularlyactiveonthemarketplace,andhowmuchtime ple HTML source as well as other raw attributes of the tasks istypicallyspentonthemarketplacebyworkersonasingleday. thatreflectdesigndecisionsmadebytherequesters.Forexam- ple, wecheckwhetherataskcontainsinstructions, examples, 3. MARKETPLACEANALYSES text-boxesandimages—wediscussthesefurtherinSection4. 3 300 50 300 Number of distinct tasks11225050500000 all sampled Number of distinct tasks 11223344505050505 all sampled Number of distinct tasks 11225050500000 all sampled 0 0 0 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Jul’12 Oct’12Jan’13Apr’13 Jul’13 Oct’13Jan’14Apr’14 Jul’14 Oct’14Jan’15 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 Date (by week) Date (by week) Date (by week) (a)Entireduration (b)PreJan2015 (c)PostJan2015 Figure1:Numberoftaskssampled(byweek) Inthissection, weaimtogaininsightsintothehighlevel, ag- not have to rely on the more expensive skilled workers, and can gregateworkingsofthemarketplace. First,weexaminesomeba- thereforebeconservativeintheiruseofresourcesinmaintaininga sic statistics of the marketplace, to understand the worker supply pooloftheseinternal“super-workers”.Thishashugeimplications and task demand interactions. Specifically, we look at (a) task forindividualswhorelyoncrowdsourcingasasolesourceofin- instance arrival distribution (Section 3.1), (b) worker availability come:dependingontheweek,theymaynothaveenoughtasksthat (Section3.2),and,(c)marketplaceload,orcontributionfrom“heavy- suittheirinterestorexpertise. -hitter” tasks, occupying a bulk of the tasks in the marketplace. Striking a good balance between the two task routing mecha- Then,inSection3.4,weexplorethetypesoftasksobservedinour nisms and worker pools is crucial to ensuring that all three par- dataset,tobetterunderstandthequestionsanddatatypesofinter- tiesaresatisfied: (1)themarketplaceisabletoclearpendingtasks estforrequesters.Wealsolookforcorrelationsacrosstheselabels without a building backlog, while maintaining requisite levels of tounderstandwhattypesoftasksoccurtogether. Finally,welook accuracyandcost,(2)requestersreceivequalityresponsesforeasy at trends in the complexity of tasks over time to gain additional andhardtasks,anddonotseehighlatenciesinresponses,and(3) insightsintotheevolutionofthemarketplace(Section3.5). workershaveaccesstoasmuchworkastheycanhandle,aswellas tasksthatcancatertotheirvariedinterestsandexpertiselevels. 3.1 Aretasksuniformorburstyovertime? Besidestheburstynatureoftaskinstancearrivalsacrossweeks, themarketplacealsowitnessesperiodsoflowtaskarrivalsonthe We first study the rate at which task instances arrive into the weekends—thenumberofinstancespostedonaweekdayisupto marketplace, andtherateatwhichtheyarecompleted. Notethat 2× the number of instances posted on Saturdays or Sundays on theloadonthemarketplaceisgovernedbythenumberoftaskin- average.Further,theaveragenumberofinstancespostedatthestart stances,whichisthefundamentalunitofworkvisibletoworkers, oftheweekisthehighest,followingwhichthenumberdecreases ratherthanthenumberofbatches;batchescanbearbitrarilysmall overtheweek.WeplotthisinFigure3. orlarge. Weplotthenumberoftaskinstancesarrivingandbeing completedeachweekinFigure2ainblue. First,notethatthetask arrivalplotisrelativelysparseuntilJan2015,whichispresumably # of tasks issued whenthemarketplacestartedattractingmorerequesters. Second, 6.0M after June 2014, there are some very prominent peaks, on top of 5.0M regularactivityeachweek. Thissuggeststhatwhiletaskinstances 4.0M arrivefairlyregularly,thereareperiodsofburstiness. Considering 3.0M the period from Jan 2015 onwards, the median of the number of 2.0M taskinstancesissuedinadayonthemarketplaceisabout30,000. 1.0M Incomparison,onitsbusiestday,morethan900,000taskinstances 0.0 wereissued,a30×increaseovernormallevels.Similarly,thenum- Mon Tue Wed Thu Fri Sat Sun beroftaskinstancesissuedonthelightestdayis0.0004×smaller Figure3:Distributionoftasksoverdaysoftheweek than the median. This raises the question: where does the high variationinthenumberoftaskinstancesderivefrom—dothenum- berofbatchesoftasksissuedfluctuatealot,ordothenumberof Takeaway:Marketplaceswitnesswidevariationinthenumber distincttasksissuedthemselvesfluctuatealot? Forthisanalysis, oftasksissued,withdailynumberofissuedinstancesvarying weoverlaythenumberoftaskinstancesissuedonthemarketplace between0.0004×,toupto30×themedianload(30,000in- withthenumberofbatchesandthenumberofdistincttasksissued stances). fortheperiodpostJanuary2015inFigure2b.Forboththesemea- sures,wefindthatthefluctuationissimilartothefluctuationinthe number of task arrivals, indicating that both factors contribute to 3.2 How does the availability and participa- thehighvariationinthemarketload. tionofworkersvary? Notethatacommonexplanationforwhycrowdsourcingisused in companies is the ability to shrink or grow labor pools on de- Worker Availability. As described earlier, the specific market- mand [34]; this finding seems to suggest that even marketplaces place we work with attracts workers from a collection of labor needtobeabletoshrinkandgrowlaborpoolsbasedondemand. sources. In this manner, it is able to keep up with the spikes in Forthismarketplace,havingaccesstobothpushandpullmecha- demand.Weinvestigatethesourcesthemarketplacedrawsfromin nismsprovidesgreatflexibility.Notonlycantheyroutetheharder Section5. Inthissection,wefocusonstudyingthenumberofac- taskstotheirmoreskilled,“on-demand”workers,theycanalsouse tiveworkersacrossdifferentweeks:Figure4depictsthisstatistic. this push mechanism to reduce latencies for requesters and clear UnlikeFigure2athathadahugevariationinthenumberissued backlogged of tasks. On the other hand, the presence of a large taskinstances, especiallyafter2015, Figure4doesnotshowthis freelanceworkforceimpliesthatforthemajorityoftaskstheydo levelofvariation. Thus,somewhatsurprisingly,eventhoughthere 4 1M 400k 1M 450 1M 140 1M # tasks issued Pickup Time 350k 1M # instances # batches 400 1M # instances # distinct tasks 120 Tasks Issued468000000kkk 112230505000000kkkkk # task instances468000000kkk 112233050505000000 # task instances468000000kkk 46810000 0 200k 50k 200k 50 200k 20 0 0 0 0 0 0 Jul’12Jan’13Jul’13Jan’14Jul’14Jan’15Jul’15Jan’16Jul’16 Nov’14Jan’15Mar’15May’15 Jul’15 Sep’15Nov’15Jan’16Mar’16 Nov’14Jan’15Mar’15May’15 Jul’15 Sep’15Nov’15Jan’16Mar’16 (a)TaskinstancearrivalvsMedianpickuptimes (b)Taskinstancearrival(postJan2015)vs(1)batcharrival,(2)distincttaskarrival Figure2:TaskArrivalsbyweek 7k 6k servedthattheinternalworkersaccountforaverysmallfractionof 5k tasks.taskarrivaloverlaywithinternalandexternal kers 4k wor 3k 3.3 What is the distribution of work across # 2k differentdistincttasks? 1k Next, wewantedtostudywhetherthereareasmallnumberof 0 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 tasksthatdominatethemarketplace(e.g.,repeatedlylabelingdif- Figure4:Numberofworkersperformingtasks ferentitems, issuedbyasinglerequester). Tostudythis,wefirst arehugechangesinthenumberofavailabletaskinstances,roughly clusteredthebatchesinourdatasetbasedonmetadatafromtheex- thesamenumberofworkersareableto“service”agreaternumber tractedHTMLsourcecorrespondingtothetasks(seeSection2.4), of requests. This indicates that there is a limitation more in the andtunedthethresholdofamatchtoensurethatthetasksthaton supplyoftaskinstancesthanavailabilityofworkers. inspectionlookverysimilarandhavesimilarpurposesareactually clustered together. We shall henceforth refer to these clusters of Takeaway:Despitethehugevariationinthenumberofavail- similarbatchescorrespondingtoadistincttask,assimplyclusters. able tasks, roughly the same number of workers (with small Wedenotethenumberofbatchesinaclusterbyclustersize.Then, variations)areabletoserviceallofthesetasks. inFigure6,weplotthedistributionofthenumberofclustersthat havedifferentclustersizes(bothonlogscale). Forexample,there Worker Latencies, Idleness, and Task-Distribution. We now were5clusterswithsizelargerthan100,indicatingthattherewere attempt to explain how roughly the same number of workers are 5distincttasks(eachlumpedintotheirownclusters)thatwereis- able to accommodate for the variation in the number of tasks on suedacrossatleast100batcheseach. Ascanbeseeninthechart, the platform. Our first observation is that the median latency in thereseemtobealargenumberoftasksthatare“one-off”witha taskinstancesgettingpickedupbyworkers,notedaspickuptime smallnumber(< 10)ofbatches: thesetasks,beingone-off,can- in Figure 2a, and depicted in red, shows that during periods of notbenefitfrommuchfine-tuningoftheinterfacepriortoissuing high load, the marketplace tends to move faster. We also zoom thetasktothemarketplace. Ontheotherhand, thereareasmall in to the high activity period after January 2015 in Figure 5a to numberof“heavyhitters”: morethan10taskshadclustersizesof furtherhighlightthistrend. Onepossibleexplanationforthisob- over 100, indicating that these tasks had been issued across 100s servation is that when more task instances are available, a larger of batches. Notice that even within a batch the number of tasks numberofworkersareattractedtothemarketplaceorrecruitedvia maybelarge: westudythatinthenextplotinFigure7. Wesee apush-mechanism—leadingtolowerlatencies. Anotherpossibil- a wide variation in the number of tasks issued - while 204 clus- ity(supportedbyourdiscussionbelowforFigure5b)isthatwith tershavelessthan10tasksissued,3clustershavemorethan1M ahigheravailabilityoftasks,workersarespendingalotmoreac- taskseach. Furthermore,these“bulky”clustershaveissuedclose tivetimeontheplatform,andhencearemorelikelytopickupnew to 80k tasks/batch, so even slight improvements in the design of tasksassoonastheyareavailable. thesebatchescanleadtorichdividendsfortherequester. Across Next,welookintohowtheworkloadisbeingdistributedacross thischart, themediannumberoftasksperclusteris400. Next, theworker-pool. InFigure5b, weplotthenumberoftaskscom- pletedbythetop-10%(inredcolor)andthebottom-90%(ingreen Distribution of cluster sizes color)ofworkersineachweekandcompareittothetotalnumber 10k oftasksissued.Weobservethatwhilethebottom-90%alsotakeon alotmoretasksduringperiodsofhighload,itisthetop-10%that rs 1k e handlesmostoftheflux,andisconsistentlyperformingalotmore st 100 u tasksthantheremaining90%. Similarly,examiningthesameplot # cl 10 foraverageamountofactivetimespentbyworkersontheplatform 1 inFigure5balsoshowsthatthetop-10%areindeedspendingalot 1 10 100 1k moretimeonaverageperweektohandlethevaryingtaskloadas Cluster size comparedtothebottom-90%.Thisobservationindicatesthatwhile havingalargeworkforcecertainlyhelps, itiscrucialtofocuson Figure6: #ofbatchesinclusters worker interest and engagement—attracting more “active” work- erscanallowmarketplacestohandlefluctuatingworkloadsbetter. we drill down into the top 10 tasks which had over 100 batches, We also examined the workload handled by workers from differ- theso-called“heavyhitters”. InFigure8weplotthecumulative entlaborsourcestoverifywhetherthemajorityofthisvariationis numberoftasksissuedovertime, onelinecorrespondingtoeach assignedtothemarketplace’sinternalorexternalworkers. Weob- heavyhitterdistincttask. Ascanbeseeninthefigure,thesetasks 5 1M 60k 1M 1M 1M 600k # tasks issued Pickup Time # tasks issued # tasks issued 1M 50k 1M TTaasskkss CCoommpplleetteedd ((tboopt1900 wwoorrkkeerrss)) 1M 1M AAccttiivvee TTiimmee ((tboopt1900 wwoorrkkeerrss)) 500k Tasks Issued 468000000kkk 234000kkk Tasks Issued 468000000kkk 468000000kkk Tasks Issued 468000000kkk 234000000kkk 200k 10k 200k 200k 200k 100k 0 0 0 0 0 0 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 Nov’14Jan’15Mar’15May’15Jul’15Sep’15Nov’15Jan’16Mar’16May’16Jul’16 (a)TaskArrivalsvsMedianPickupTime (b)Engagementoftop-10%andbot-90%workers:(1)#Tasks(2)ActiveTime Figure5:TaskArrivalsbyWeek(PostJan2015) 10,000outofthetotal12,000batches(≈83%)and24millionout Distribution of tasks across clusters ofthetotal27milliontaskinstances(≈ 89%). Thesebatchesfall 1k intoabout∼3,200clusters. ers 100 • Task Goal: Here, we separate tasks based on their end goal. st Wefindthatmosttaskscanbecharacterizedashavingone(or u cl 10 more)ofthefollowing7goals2: (1)Entity Resolution (ER), # for instance, identifying if two webpages talk about the same 1 business,oriftwosocialmediaprofilescorrespondtoonesin- 1 10 100 1k 10k 100k 1M 10M # of tasks in cluster gleperson,(2)Human Behavior (HB),includingpsychology studies, surveys and demographics, and identifying political Figure7: #oftasksinclusters leanings,(3)Search Relevance Estimation (SR),(4)Quality Assurance (QA),includingspamidentification,contentmod- show both uniform and bursty availabilities. As an example, the eration, and data cleaning, (5) Sentiment Analysis (SA), (6) taskcorrespondingtothepurplelinehasonlybeenactiveinJuly Language Understanding (LU),includingparsing,NLP,and 2015whilethetaskcorrespondingtocyanlinehashadbatchesis- extracting grammatical elements, and (7) Transcription (T), suedregularlyover11monthsfromMay2015toApril2016. including captions for audio and video, and extracting struc- Takeaway: A huge fraction of tasks and batches come from turedinformationfromimages. afewclusters,sofine-tuningtowardsthoseclusterscanlead • Task Operator: In this category, we label tasks based on the to rich dividends. The heavy hitter task types have a rapid human-operators,orunderlyingdataprocessingbuildingblocks increasetoasteadystreamofactivityfollowedbyacomplete usedbyrequesterstoachievetasks’goals. Weobserveprimar- shutdown,afterwhichthattasktypeisneverissuedagain. ily10differentoperators:(1)Filter(Filt),i.e.,separatingitems intodifferentclassesoransweringbooleanquestions,(2)Rate (Rate),i.e.,ratinganitemonanordinalscale(3)Sort (Sort), 1M (4) Count, (5) Label or Tag (Tag), (6) Gather (Gat), i.e., provide information that isn’t directly present in the data, for 100k instancebysearchingtheweb,(7)Extract (Ext),i.e.,convert 10k implicit information already present in provided data into an- otherform, suchasextractingtextwithinanimage. (8)Gen- 1k erate (Gen),i.e.,generateadditionalinformationbyusingin- 100 ferencesdrawnfromgivendata, usingworkerjudgementand intelligence, such as writing captions or descriptions for im- 10 Jan’15Mar’15May’15 Jul’15 Sep’15Nov’15 Jan’16 Mar’16May’16 Jul’16 ages, (9) Localize (Loc), i.e., draw, mark, identify, or bound specific segments of given data and perform some action on Figure8:HeavyHitterDistribution individualsegments,e.g.,drawboundingboxestoidentifyhu- mansinimages,and(10)External Link (Exter),i.e.,visitan external webpage and perform an action there, e.g., fill out a 3.4 Whatkindsoftasksdowesee? surveyform,orplayagame. Wenowstudyourenrichedtask-labelsfromSection2.4inor- • DataType:Wealsoseparatetasksbasedonthetypeofdatathat dertocharacterizethespectrumofcrowdworkoninthemarket- isused.Thesamegoalsandoperatorscanbeappliedonmulti- place. Such an analysis can be very useful, for example, to de- pledatatypes.Alltaskscontainacombinationofthefollowing velop a workload of crowdsourcing, and to better understand the 7typesofdata: (1)Text,(2)Image,(3)Audio,(4)Video,(5) tasktypesthataremostimportantforfurtherresearch. Maps,(6)Social Media,and(7)Webpage. LabelCategories.Welabeleachtaskunderfourbroadcategories1. Label distribution. First, we analyze the distribution of labels Taskshaveoneormorelabelundereachcategory.Ourmechanism indifferentcategoriesacrosstasks. Figure9adepictsthepopular tolabeltasksistofirstclusterbatchestogetherbasedonsimilarity goals. We observe that complex unstructured data understanding ofconstituent tasks, and thenwelabel onetaskcorresponding to basedgoals—language understandingandtranscriptionarevery eachcluster,sincealltaskswithineachclusterhaveidenticalchar- common,comprisingofover4and3milliontasks,thatisaround acteristics. Thegoalofourclusteringistocapturetheseparation betweendistincttasks,whichisnotknowntous. Aslabelingisa labor-intensiveprocess,wecurrentlyhavelabelsavailableforabout 2 Tasksthathaduncommonoruncleargoalsanddidnotfallintooneoftheseclasses, 1Labelingwasperformedindependentlybytwooftheauthors,followingwhichthe wereautomaticallyclassifiedasOtherorUnsurerespectively. Thisholdsforthe differenceswereresolvedviadiscussion. othercategoriesbesidesgoalsaswell. 6 17%and13%respectively,despitenothavingseenextensiveopti- Takeaway: Weobservethatthemarketplaceexhibitsadiverse mizationresearch,asopposedtotraditional,simplergoalslikeen- rangeoftasksspanningacrossover7broadgoals,atleast10 tity resolutionandsentiment analysisthathavebeenextensively distinctoperationsand7datatypes. analyzed. Figure9bshowsthattextandimagearestillthemain • Text and image data are by far the most prevalent and typesofdataavailableandanalyzed—9.6million(40%)and6.3 utilized across most tasks. Web and social media data million (26%) tasks contained text and image data respectively. arealsoavailableandrelevanttoasmallsubsetoftasks Audioandvideodataarealsoused,andotherrichertypesofdata (inparticulartasksinvolvingdataintegrationandclean- like social media, web pages, and maps are gaining popularity. ing for web, and natural language processing for social Figure9cshowsthecommonoperatorsused.Whilethedistribution mediadata). Whiletextandimagedata(andtoalesser ofgoalsindicatesthatasignificantfractionoftaskshavecomplex extent,webdata)havebeenheavilystudiedusingseveral goals,theunderlyingoperatorsarestillpredominantlysimple.The different operators, there are still many exciting avenues marketplaceisdominatedbythefundamentalfilterandrateopera- waitingtobeexploredfortheothertypesofdata. tions—over8million(33%)tasksemploysomefilteringoperator, • Filterandrateareusedasbasicoperatorsforachieving andnearly3million(13%of)tasksmakeuseofratingoperators. mostgoalsandanalyzingalltypesofdata. Itiscrucialto Among more complex operators, we see that gathering, extrac- understandandoptimizetheusageoftheseoperators. tion,localization,andgenerationarefrequentlyapplied,together • Language understanding, and transcription seem to be beingusedinaround5.3million,i.e.,22%ofalltasks. very popular task goals constituting of a large number Goals,operatorsanddatatypesthatoccurfrequentlytogether. tasks. Consideringthefactthatthesetasksrequirecom- Next,welookthecorrelationsbetweenthethreetypesoflabelsfor plexhumanoperations(generationandextractionasop- tasks. Forexample,onequestionweaimtoansweriswhatkinds posed to the simple filter and rate operations), it might ofoperatorsaretypicallyappliedtodifferenttypesofdata,orused beworthwhiletotrainandmaintainaspecializedworker to achieve particular goals? Looking at such correlations across poolforsuchtasks. goals,operators,anddatatypesprovidesfine-grainedinsightsinto • For the popular goals of Language understanding, thestructureanddesignoftasksthatisnotimmediatefromourag- and transcription, we expect the heavy percentage of gregatestatisticsalone. Thechartsdepictingthecorrelationcan text-baseddata. Itisinterestingtoobservethehighper- befoundinFigure9and10. Forinstance,Figure10bshowsthe centageofofsocial mediaandimagedataforthesetasks breakdownforeachgoalbythepercentageusageofdifferentop- aswell. erators towards achieving that goal; Figure 9c serves as a legend forthestackedbars. (Figure11b,legendFigure9a,yieldssimilar 3.5 Trendtowardsopen-endedtasks. insights,butfromaslightlydifferentperspective.)Weobservethat filter and rate operators are used in most kinds of tasks, as well Inthissection,ouraimistoexplorethetrendinthecomplexity asformasignificantmajorityastheconstituentbuildingblockfor ofcrowdsourcedtasksovertime. Thatis,weintendtoanswerthe mostgoals. Onenotableexceptionistranscription(which,recall, followingquestions: constitutesover13%ofalltasksbyitself, makingitasignificant • Arerequestersmovingontomorecomplex,oropen-endedgoals? exception), where the primary operation employed is extraction. • Aretheylookingatmorechallengingdatasets? As another example, Figure 10a shows that text and images are • Aretheyusingmoresophisticatedtoolsorutilizinghumanin- important for all types of task goals, for certain types, e.g., ER, telligencemoreeffectivelythaninthepast? SA, SR, social mediais also quite important. Lastly, Figure 10c We split each of our categories, goals, operators, and data into showsthatbeyondfilteringandratingbeingimportant,extraction twoclasses:simpleandcomplex.Amongthesetofobservedgoals, isusedquiteprominentlyontextandimagedata,oftenrivalingthat weclassify{entity resolution,sentiment analysis,quality assur- offiltering. Forlanguage understandingtasks,whilefilterand ance}assimple, andtheremaining7ascomplex. Foroperators, rate are the primary operations, generate is also used frequently weclassifyfilterandrateassimpleandtheremaining8ascom- (16% of the time). Also, for tasks looking to understand human plex. Fordatatypes,weonlyconsidertextassimple,andthere- behavior, 13% of the tasks involve performing operations at ex- maining6typesascomplex.Whilethisclassificationissubjective, ternal links(suchasonlinesurveys),and9%ofthetasksinvolve ourhigh-levelobservationsapplytomostreasonablemappingsof localization.AsFigure11c(legendFigure9b)indicates,filterand labelsto{simple,complex}. rateoperatorshavebeenusedtoanalyzemosttypesofdataaswell. InFigure12, wecomparethetrendbetweennumberofsimple Figure10ashowsthebreakdownforeachgoalbythepercent- tasks and the number of complex tasks on the marketplace over ageofdifferentdatatypespresentintaskshavingthatgoal. Fig- time. Onthex-axis, weplottimeinincrementsofoneweek. On ure9bservesasalegendforthestackedbars. (Figure11a,legend they-axisweplotthecumulativenumberofclustersoftasks,that Figure9a,yieldssimilarinsights,butfromaslightlydifferentper- is the number of unique tasks, issued so far – one line each for spective.) While for most goals, a large fraction of data used in simple, versus complex tasks. Note that we deduplicate similar tasks seems to be text and image based, we observe that for en- batchesandcountthemasasinglepoint,sotheseplotsrepresent tityresolutionandsearchrelevance,webdataisrelevant(serving theinterestsofallrequestersequally. FromFigures12aand 12c, 24%and37%ofentity resolutionandsearch relevancetasksre- weobservethatthenumberofclustersoftasksinvolvingcomplex spectively). Also,sentiment analysisandlanguage understand- goalsandnon-textualdatatypesismuchlarger,andgrowingfaster, ingstyleofanalysesemploysocialmediaasasignificantfraction thanthecorrespondingnumbersofsimpleclusters.Forinstance,as oftheirinputdata(13%and8%respectively). Whilesomeefforts ofJanuary2016, therehadbeenaround(a)510clustersinvolving arebeingmadetowardsanalyzingothertypesofdata(besidestext non-textualversusabout240clustersinvolvingtextdata, and(b) andimage),theyarestilllargelyunexplored. 620clusterswithcomplexgoals,andjust80clusterswithsimple goals.Bycontrast,Figure12bdemonstratesthattheusageofcom- plex operators is comparable to that of simple operators. Specif- ically, we observe a total of around 410 clusters using complex 7 4.5M 10.0M 9.0M 4.0M 9.0M 8.0M 3.5M 8.0M 7.0M 3.0M 7.0M 6.0M 2.5M 6.0M 5.0M 5.0M 2.0M 4.0M 4.0M 1.5M 3.0M 3.0M 1.0M 2.0M 2.0M 500.0k 1.0M 1.0M 0.0 0.0 0.0 ER HB SR QA SA LU T Social Web Image Map Video Text Audio ExterRateSort Gat TagCountFilt Ext Loc Gen (a)PopularTaskGoals (b)Populardatatypes (c)Popularoperators Figure9:Distributionofgoals,datatypes,andoperators 120.0 120.0 120.0 100.0 100.0 100.0 80.0 80.0 80.0 60.0 60.0 60.0 40.0 40.0 40.0 20.0 20.0 20.0 0.0 0.0 0.0 ER HB SR QA SA LU T ER HB SR QA SA LU T SocialWebImageMapVideoTextAudio (a)Datausedfordifferenttaskgoals (b)PopularOperatorsfordifferenttaskgoals (c)PopularOperatorsfordifferentdata Figure10:Correlationsacrossdataandgoal,operatorandgoal,andoperatoranddata 120.0 120.0 120.0 100.0 100.0 100.0 80.0 80.0 80.0 60.0 60.0 60.0 40.0 40.0 40.0 20.0 20.0 20.0 0.0 0.0 0.0 SocialWebImageMapVideoTextAudio ExterRateSortGatTagCountFilt ExtLocGen ExterRateSortGatTagCountFilt ExtLocGen (a)Populartaskgoalsfordifferentdata (b)Populartaskgoalsfordifferentoperators (c)Populardatafordifferentoperators Figure11:Correlationsacrossgoalanddata,goalandoperator,anddataandoperator 800 500 700 700 simple complex 450 simple complex 600 simple complex 400 600 350 500 Count 345000000 Count 223050000 Count 340000 200 150 200 100 100 50 100 0 0 0 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Jul’12 Jan’13 Jul’13 Jan’14 Jul’14 Jan’15 Jul’15 Jan’16 Jul’16 Date Date Date (a)Goals (b)Operator (c)Data Figure12:Simplevscomplextasksovertime 8 operators and 340 clusters involving simple operators issued cu- tasks,spreadoutacrossalargenumberoffeatures(suchasthose mulativelyasofJanuary2016. discussedinthesectionstofollow)andlabels(goals,operatorsand data types)—the remaining data is too sparse for any statistically Takeaway:Whilerequesters(andresearchers)aremoreinter- significantinferencestobemade. ested in achieving complex goals on complex data (and get- Forthesecondoptionofcomputingdisagreementonnon-textual tingmoresowithtime),thefundamentalhuman-operatorsof fields,wefaceaproblemwiththedistributionofdisagreementval- filter and rate, are by themselves still as widely used as all uesitself. First,forallthetasksthatonlyhavetextualresponses, otheroperatorscombined. Thisraisesboththeneedtoopti- itisnotpossibletodefineadisagreementscore;weareunableto mizetheexistingsimpleoperatorsasfaraspossible,aswellas computeadisagreementscoreinthismannerforover60%ofall theopportunityfortheexplorationandunderstandingofmore batches. Second,ignoringtextfieldsmisrepresentsthetruedistri- complexoperators. butionofdisagreementsfortheremainingdatasets. Itispossible thatwerepresenttaskswithhighdisagreementashavinglowdis- agreementsimplybecausetheyhaveasmallnumberofnon-textual 4. EFFECTIVETASKDESIGN fields. Inthissection,weaddressthequestionofeffectivetaskdesign. A third approach would be an edit-distance or partial scoring Specifically, we(1)characterizeandquantifywhatconstitutesan scheme;however,thisapproachisnotidealsinceinpracticecrowd- “effective” task, (2) make data-driven recommendations on how sourcingrequestersrequirehighexactagreement,notpartialagree- requesters can design effective tasks, and (3) predict the “effec- ment,sothattheanswerscanbeeasilyaggregatedviaconventional tiveness”oftasksbasedonourhypotheses. majorityvotetypeschemes. Furthermore,manytaskswithtextual responsesareobjective. Somecommonexamplesthatweseein- 4.1 Metricsforeffectivetasks cludetranscription,captcha,imagelabeling,andretrievingURLs. Thestandardthreemetricsthatareusedtomeasurecrowdsourc- Forsuchtextualbutobjectivetasks,itisnotclearifanedit-distance ing effectiveness are: error, cost, and latency. There are various basedagreementschemeistherightapproach. waysthesethreemetricscouldbemeasured; wedescribeourno- Cost:MedianTaskTime.Atypicalmeasureforhowmucheffort tionsbelow,givenwhatwecancalculate. aworkerhasputintoataskinstanceistheamountoftimetakento Error:DisagreementScore.Inourdataset,wehaveeveryworker completeit.Sincewedonothaveinformationabouttheactualpay- answerprovidedforeachquestionwithineachtaskinstance,oper- mentsmadetoworkers,weusethemedianamountoftimetaken atingononedistinctitem,butnotthecorrespondinggroundtruth (inseconds)byworkerstocompletetasksinabatchasaproxyfor answers. We use these answers to quantify how “confusing” or thecostofthebatch. Thiscanbecalculatedfromthedatathatis ambiguousataskis,overall. Thewaywequantifythisistocon- available,giventhatwehavethestartandendtimesforeachtask sidertheworkeranswersforagivenquestiononagivenitem. If inabatch. Weshallsubsequentlydenotethe“MedianTaskTime” theworkersdisagreeonaspecificquestiononaspecificitem,then bytask-time. thetaskislikelytobeambiguous—indicatingthatitispoorlyde- Latency: MedianPickupTime. Tocharacterizelatency,weuse signed,orhardtoanswer—eitherway,thisinformationisimportant pickuptime,i.e.,howquicklytasksarepickedupbyworkers,on todictatethetaskdesign(e.g.,clarifyinstructions)andthelevelof average.Pickuptimeforabatchiscomputedasfollows:pickup-time redundancy (e.g., more redundancy for confusing questions) that =median(<starttimeoftask −batchstarttime>)(inseconds). i shouldbeadoptedbyrequesters.Ourproxyforerroristheaverage Here,weusethestarttimeoftheearliestbatch,i.e.starttimeoftask , 1 disagreementintheanswersforquestionsonthesameitem,across asaproxyforthebatchstarttime.Wejustifythischoiceforthela- allquestionsanditemsinabatch.Weconsiderallpairsofworkers tencymetricquantitativelyinbelow. . Ourreasonsaretwofold. who have operated on the same item, and check if their answers First,weobservethatthepickup-timeoftasksistypicallyorders are the same or different, giving a score of one if they disagree, ofmagnitudehigherthanthatoftask-time,whichmightotherwise andzeroiftheyagree;wethencomputetheaveragedisagreement seemlikeareasonableproxyforlatency. Thismeansthattheac- scoreofanitembyaveragingallthesescores;andlastly,wecom- tualturnaroundtimeforataskisdominatedbywhenworkersstart putetheaveragedisagreementscoreforabatchbyaveragingthe itsinstances,ratherthanhowlongtheytaketocompletethemonce scoresacrossitemsandquestions. Weshallhenceforthrefertothe started. Figures 13a and 13b support this claim. For each of the “DisagreementScore”asdisagreement. figures, we compare the pickup-time against the task-time, both Thereishowever,onesmallwrinkle.Someoperators,andcorre- onthey-axiswithvaryingend-to-end-timealongthex-axis. Fig- spondingworkerresponsesmayinvolvetextualinput. Twotextual ure 13a shows this distribution at a batch level, with the median responses may be unequal even if they are only slightly different values for pickup-time and task-time being plotted against each fromeachother.Sincetextualresponsesoccupyalargefractionof batch’s end-to-end-time. Figure 13b shows this distribution at a ourdataset,itisnotpossibletoignorethemaltogether. Weinstead task instance level, with each task’s individual pickup-time and adopt a simple rule: we prune away all tasks with disagreement task-timebeingplottedagainstitsend-to-end-time,whichinthis >0.5soastoeliminatetaskswithveryhighvariationsinworker caseissimply(pickup-time+task-time)(toreducethenumber responses. Thiseliminatesthesubjectivetextualtasks,whilestill ofpointsintheplot,weonlyplotthemedianofpickup-timeand retainingthetextualtasksthatareobjective. task-time corresponding to a vertical splice, that is, we plot one Anotherwaytohandlethesubjectivityoftasksistosimplyig- medianpointforallinstanceshavingacommonend-to-end-time. noretext-boxes.Thiscouldbedoneintwoways:(1)onlyevaluate Weobservethatinbothplots,thepickup-timeisordersofmagni- disagreement for tasks with no text-box fields, and (2) for every tudehigherthanthetask-time. Secondly,mostmeasuresoftime task, computedisagreementonlyonitsnontextualfields. Inour thatwecanobtainfromouravailabledatastronglydependonfea- experiments,wetriedboththeseoptions,butrejectedthemforrea- tures,suchasthesizeanddifficultyofatask. Sincepickup-time sonswediscussbelow. only looks at the time taken for workers to start a task and not Itturnsoutthatalargemajorityoftasksinourdatasetcontain howlongtheyspendonit(which,aswehaveseen,isanywayan atleastonetext-boxfield. Eliminatingallofthemleavesveryfew insignificantfractionoftime), itisrelativelyindependentofsuch 9 valuebetterthanm.Thus,ahighervalueispreferable;andwe comparethetwobins(orlines)inthisplot. Below,welookattheresultsforsomeofthesignificantcorrelations wefound. 4.3 NumberofHTMLwords (a)Batch-level (b)Task-level Weexaminehowthelengthoftask—definedasthenumberof wordsintheHTMLpage,anddenotedas#words—impactstheef- Figure13:Latency fectiveness of the task. We show the effect of length of task on features. Thishelpsseparateouttheinfluenceoffeaturesthatre- our metrics in Figure 14a. We observe that the line for clusters questers often cannot control, and that we cannot quantify, from withhigher#wordsintheirHTMLinterfacedominates,orisabove our latency metric, making our subsequent quantitative analyses thelinefortheclusterswithfewer#words. Weseethattheme- morestatisticallymeaningful. Inshort,weobservethatingeneral dianvalueofdisagreementfortaskswith#words≤466is0.147, thepickup-timeforbatchesisordersofmagnitudehigherthanthe while that for tasks with #words > 466 is 0.108. This may be task-time, indicating that the latency or total turnaround time of becauselongertaskstendtobemoredescriptive,andthedetailed a task is in fact dictated by the rate at which workers accept and instructions help reduce ambiguity in tasks, train workers better, startthetaskinstances. Wedenotethe“MedianPickupTime”by andtherebyreducemutualdisagreementinanswers. Wealsonote pickup-time. that the length of the task does not significantly affect either the pickup-time or task-time metrics.Thus, workers are neither dis- couragednorsloweddownbylongertextualdescriptionsoftasks. 4.2 CorrelationAnalysisMethodology WhileincreasingthenumberofwordsintheHTMLsourceof Inthenextsetofsubsections,weexaminesomeinfluentialfea- tasks helps reduce disagreement in general, this benefit may be tures or parameters that a requester can tune, to help improve a morepronouncedforparticulartypesoftasks. Intuitively,weex- task’serror(disagreement),cost(task-time)andlatency(pickup- pectdetailedinstructionstohelpmoreforhardertasks, andhave -time). For instance, features of a task include the length of the less impact on easier tasks. To test this hypothesis, we separate task, or the number of examples within it. For each feature, we tasksintobucketsbytheirlabels(recallgoal,operatoranddata), look at the correlation between the feature and each of the three andtesttheeffectofourfeature,#words.FromFigure25a,wesee metrics. Weperformaseriesof(correlation-investigating)experi- thatfor(relativelyhard)gathertasks,#wordshasapronouncedef- ments,eachofwhichcorrespondtoone{feature,metric}pair. All fectondisagreementwithhigher#wordsleadingtosignificantly ourexperimentsfollowthefollowingstructure: lowerdisagreement.Ontheotherhand,Figure25bseemstoindi- • Cluster: We first cluster batches based on the task in order catethatfor(relativelysimple)ratingtasks,#wordshasnosignif- to not have the “heavy-hitter” tasks that appear frequently in icantimpactondisagreement. multiple batches across the dataset to dominate and bias our Example. Todemonstratetheeffectofhavingmoredetailedde- findings.Sinceouranalysiswillalsoinvolvematching,orclus- scription,orhighernumberofwordsinatask’sHTMLinterfaceon teringtasksfurtherbasedonlabels,werestrictourfocustothe disagreement, wecomparetwoactualtaskswhicharebothfrom setofaround3,200labeledclusterscorrespondingto83%ofall the domain of Language Understanding, but differ in their de- batchesand89%ofalltaskinstances. Subsequently,foreach scriptivenessand#num-words.Welookattwodifferenttasksthat cluster,wetakethemedianofmetricvaluesacrossbatches,as requireworkerstofindurlsoremailIDsofbusinessesthroughba- wellasthemedianofthefeaturebeinginvestigated. sicwebsearches. Bothhaveextremelysimilarinterfaces,andask • Binning: Weseparatetheclustersintotwobinsbasedontheir similarquestions. Neitheremploysexamples(whichweshallsee featurevalue—allclusterswithfeaturevaluelowerthanthe later has a significant impact on disagreement). The main differ- globalmedianfeaturevaluegointoBin-1(say),whiletheones encebetweenthetwotasksisthatthefirst(having970instances), withfeaturevaluehigherthanthemediangointoBin-2.(Clus- hasmediannumberofwords=233,whilethesecond(having1254 terswithfeaturevalueexactlyequaltothemedianareallput instances)hasamedianof6072wordsinitsHTMLinterface.Fig- intoeitherBin-1orBin-2whilekeepingthebinsasbalanced ure15depictsthefirsttaskandFigure16depictsthesecond. The aspossible.) Foreachmetric,wethenexamineitsvaluedistri- firsttaskusestheseextrawordstogivedetailedinstructions(shown butioninthetwobins—inparticular,welookforsignificant inFigure16a)onhowtogoaboutthetask. Incontrast,thesecond differencesbetweentheaverage,median,ordistributionofmet- taskhasalmostnodescriptionatall. Itrequiresworkerstoenter ricvaluesinthetwobins. Asignificantdifferenceindicatesa the“synonymy”ofcorrectsentences,andtocorrectincorrectsen- correlationbetweenthefeaturewehavebinnedon,andthemet- tences,withoutgivinganyexamplesorinputforwhatthesetasks ricbeinglookedat. Wethenhypothesizeabouttheunderlying entail. Whilethefirsttaskhasamediandisagreementof0.26,the reason(s)behindthecorrelation. secondshowsamediandisagreementof0.08. Thisdemonstrates • Statisticalsignificance: Weperformat-testtocheckwhether thepowerofexamplesinreducingtaskambiguity. themetricvaluedistributioninourtwofeature-value-separated binsisstatisticallysignificant. Weuseathresholdp-valueof 0.01todeterminesignificance, thatis, weonlyrejectthenull Takeaway: Tasks with higher #words in their HTML sources hypothesis(thatbinshavesimilarmetricvalues)ifthep-value are typically the ones with more detailed instructions or ex- islessthan1%. amples.Weseethatthishastheeffectofdecreasingdisagree- • Visualization: Foreachfeature-metricpair,weplotacumula- mentamongstworkers,particularlyforcomplextasks. tivedistribution(CDF)plot,withthemetricvalueplottedalong the x-axis. Each of the two bins corresponds to one line in the plot. For x = m, the corresponding y value on each of 4.4 Presenceofinputtext-boxes thelinesrepresentstheprobabilitythatabatchwillhavemetric 10

Description:
Ayush Jain†, Akash Das Sarma*, Aditya Parameswaran†, Jennifer Widom*. †University of Unfortunately, due to the proprietary nature of crowdsourcing through 4.7), (c) perform a classification analysis in Section 4.9 wherein we .. In comparison, on its busiest day, more than 900,000 task insta
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.