A Novel Algorithm for Aggregating Crowdsourced Opinions Ethan Prihar Neil Heffernan WorcesterPolytechnicInstitute WorcesterPolytechnicInstitute 100InstituteRoad 100InstituteRoad Worcester,Massachusetts Worcester,Massachusetts [email protected] [email protected] ABSTRACT structionalinterventionsbycomparingastudent’sscoreson Similarcontenthastremendousutilityinclassroomandon- similar problems before and after the intervention. Similar line learning environments. For example, similar content instructional material can be used to offer students choices canbeusedtocombatcheating,trackstudents’learningover inwhichinstructionalmaterialtheyreceive,whichhasbeen time,andmodelstudents’latentknowledge. Thesedifferent showntoincreaseengagementandachievement[7]. Whileit use cases for similar content all rely on different notions of ispossibletoimplementthesemethodswithgeneralknowl- similarity,whichmakeitdifficulttodeterminecontents’sim- edge of content similarity, such as similarity in prerequisite ilarities. Crowdsourcing is an effective way to identify sim- knowledgeordifficulty,ifamoreinformeddefinitionofcon- ilar content in a variety of situations by providing workers tentsimilarityisused,thesuccessofthesemethodsislikely withguidelinesonhowtoidentifysimilarcontentforapar- to grow. ticularusecase. However,crowdsourcedopinionsarerarely homogeneousandthereforemustbeaggregatedintowhatis Although there is a lot of value in knowing what content is most likely the truth. This work presents the Dynamically similartoothercontent,whatcontentshouldbeconsidered Weighted Majority Vote method. A novel algorithm that similar is highly dependent on use case. This makes it a combines aggregating workers’ crowdsourced opinions with challenge for content creators to define the similarity in the estimating the reliability of each worker. This method was content, as they don’t necessarily know what their content compared to the traditional majority vote method in both will be used for. While some content is obviously similar, a simulation study and an empirical study, in which opin- for example, two mathematics problems that are identical ionsonseventhgrademathematicsproblems’similaritywere except for the numbers used in the problems, in other situ- crowdsourcedfrommiddleschoolmathteachersandcollege ations it is much more difficult, especially when content is students. Inboththesimulationandtheempiricalstudythe being aggregated from multiple sources that may not even DynamicallyWeightedMajorityVotemethodoutperformed usethesamemetricsforprerequisiteknowledgeordifficulty. the traditional majority vote method, suggesting that this method should be used instead of majority vote in future Crowdsourcingoffersawaytoderivewhichcontentissimilar crowdsourcing endeavors. to other content for specific use cases. Crowdsourced opin- ionsonsimilarcontentcanbegatheredeachtimeanewuse Keywords case for similar content arises. By informing the workers, whose opinions are being crowdsourced, of the specific use Crowdsourcing, Similarity, Community Detection, Hierar- case and requirements for similarity, the methods that rely chical Clustering on content being similar are more likely to be successful. However, crowdsourcing opinions on similar content poses 1. INTRODUCTION some challenges as well. Before an online learning platform Within online learning platforms and intelligent tutoring or intelligent tutoring system uses crowdsourced assertions systemsthereisatremendousopportunitytoutilizeknowl- of similarity, steps must be taken to assess the trustworthi- edgeofcontentsimilarity. Similarproblemscanhelpprevent ness of workers whose opinions are being crowdsourced and cheating during exams by randomly selecting from multi- ensure the truthfulness of the final assertions of similarity. ple similar problems when students receive the exam, mea- surestudents’learninggainsbyspreadingoutsimilarprob- In this work we present a novel algorithm that both mea- lems between assignments, and measure the effects of in- sures the reliability of the workers whose opinions are be- ing crowdsourced, and determines, from these individual’s opinions, what content is most likely to be similar to other content. Toevaluatethismethod,wefirstsimulatedawide Ethan Prihar and Neil Heffernan “A Novel Algorithm for Ag- range of conditions in which assertions of similarity were gregating Crowdsourced Opinions”. 2021. In: Proceedings made, and compared the performance of our algorithm to of The 14th International Conference on Educational Data Mining thetraditionalalternative. Wethenperformedacasestudy (EDM21). International Educational Data Mining Society, 547-552. where teachers and college students were told to identify https://educationaldatamining.org/edm2021/ middle-schoolmathematicsproblemsthatevaluatedasimi- EDM’21June29-July022021,Paris,France Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 547 lar skill set. The assertions of similarity collected from the all be represented as networks of interconnected items [3]. case study were used to identify groups of similar problems Findingsimilareducationalcontentcanbeframedasacom- and measure the reliability of each worker’s assertions. munity detection problem by representing educational con- tent as a network in which items are connected by topic, Ultimately, this work seeks to answer the following three difficulty, language, prerequisite knowledge, or, in the case research questions: of this work, opinions on similarity. Structuring the task of identifying similar educational content as a community de- tectionproblemallowsfortheuseofvariouswell-established 1. Can we exploit properties of community detection to community detection algorithms, such as hierarchical clus- more accurately form groups of content from crowd- tering. In hierarchical clustering, each item begins in it’s sourced opinions? own cluster. Then, clusters are merged based on the merge strategyanddistancebetweenclusters[5]. Hierarchicalclus- 2. Howdoestheresultingalgorithmperforminasimula- teringwasusedinboththesimulationandempiricalstudy. tion study compared to the more traditional method? 3. How does the resulting algorithm perform in a case 3. METHODOLOGY study using workers of various expertise to determine whichmathematicsproblemsaresimilartoeachother? 3.1 DynamicallyWeightedMajorityVote TheDynamicallyWeightedMajorityVote(DWMV)method 2. BACKGROUND isouralternativetothetraditionalmajorityvotemethodfor 2.1 EnsemblingCrowdsourcedOpinions combiningmultiplecrowdsourcedopinionsontaskswithbi- nary outputs. The DWMV method calculates the weighted Identifying the truth from crowdsourced opinions is not a majority opinion for each task, then determines the weight new problem. Most of the techniques employed to ensure ofeachworkerbyhowcloselytheiropinionagreedwiththe theaccuracyofcrowdsourcedopinionsrelyonensuringthat majority opinion. The closeness of a worker’s opinion to workers have sufficient knowledge of the subject matter. the majority opinion can be determined with any function Thiscanbedonethroughtestingworkersbeforegivingthem for comparing two vectors that results in a value greater tasks, tailoring tasks specific to their skill sets, recruiting than or equal to zero. For example, accuracy or Dice co- high quality workers, and educating workers before assign- efficient[2]. DWMV initializes all workers’ weights to be ing them tasks. This can also be done through encourage- equal at the beginning of the algorithm, and iteratively up- ment with extrinsic motivators like money, promotions, or dates these weights until the weighted majority vote does prizes, or intrinsic motivators like a sense of purpose, or by not change between iterations. Once the weighted majority gamifying the crowdsourcing tasks [1]. vote remains constant from one iteration to the next, the weights of the workers can be interpreted as a measure of While there are many methods to encourage individuals confidence in each worker, and the final weighted majority whose opinions are being crowdsourced to be accurate, this votecanbeuseddownstreaminthesamewaythetraditional workisfocusedonhowtovalidatethequalityofindividuals’ majority vote would have been used. Algorithm 1 formally opinions after their task is complete. Current methods for defines the DWMV algorithm. In Algorithm 1, the func- accomplishingthisplacetheburdenofvalidationbackonto tion s(x,y) determines the closeness of worker i’s opinion, theworkers. Havingworkersrankthequalityofotherwork- (B [A =1])t ,tothemajorityopinion,(u [A =1])t . ersassertionsisonemethodofvalidation. Anothercommon ij ij j=1 j ij j=1 ThealgorithmrequiresamatrixAofresponseindicators,in method for validation is to have multiple workers perform which a = 1 if worker i completed task j, and a = 0 the same task and merge the output of each worker, either ij ij otherwise,andamatrixB ofworker’sresponsestotasks,in as an average or as a majority vote [1]. whichb containsthebinaryresponseofworkeritotaskj. ij InAlgorithm1,vectorucontainsthefinalweightedmajority There are also more advanced ways of algorithmically val- voteforeachtask,andvectorccontainsthefinalmeasureof idating crowdsourced opinions. Item response theory and confidence for eachworker, based on the similarity between latentfactoranalysisbasedmodelshaveout-performedma- the weighted majority votes and the individual worker’s re- jority voting based validation methods on tasks related to sponses. identifyingfacialexpressionsandansweringquestionsabout geography [6, 10]. These models also determine the quality ofindividualswhoseopinionsarebeingcrowdsourced,which 3.2 SimulationStudy can be used to refine the pool of individuals usedfor future To determine if DWMV had a positive impact on forming crowdsourcing tasks [6, 10]. The novel algorithm in this groups from crowdsourced opinion, a simulation study was work also aggregates crowdsourced opinions while evaluat- performedtocomparetheDWMVmethodtothetraditional ing the quality of each worker. majority vote method in a variety of conditions. Figure 1 illustrates the simulation process. In the simulation study, 2.2 CommunityDetection hierarchical clustering was used to form groups from simu- The field of community detection is focused around deter- lated workers’ opinions of item similarity aggregated using mininggroupsofsimilaritemsfromanetworkofconnected both the majority vote method and the DWMV method. items. This has many applications throughout mathemat- Table1liststhedifferentinitialparametersandtheirvalues ics, physics, biology, computer science, and social sciences. used in the simulation. Five trials of every possible combi- Many things can represented as a network, for example, in- nationofthevaluesinTable1weresimulatedforatotalof terstellarobjects,neurons,citystreets,andsocialmediacan 37,500 simulation runs. 548 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) The simulation began by randomly placing i items into g Algorithm1 Dynamically Weighted Majority Vote groups, where i and g are initial parameters of the simula- Require: s(x,y) : function for the similarity of two vectors tion. Then the simulation crated ten workers. Each worker Require: w : number of workers had a false positive rate and a false negative rate. These Require: t : number of tasks values were calculated separately to make the simulation Require: A=(a ) : matrix of response indicators more true to real life. In real life, it is not often that a wt Require: B =(b ) : matrix of response values worker would have an equal chance of incorrectly asserting wt v←(0)t (cid:46) initialize with values different from u thattwoitemsareorarenotsimilar. Themorelikelycaseis j=1 u←(–1)t (cid:46) initialize with values different from v that some workers think there is more similarity and other j=1 c←(1)w (cid:46) start with equal confidence in all workers workersthinkthereislesssimilaritybetweenitemsthanthe i=1 while u(cid:54)=v do actual similarity of items. The false positive and false neg- v←u ative rates of the workers were sampled separately for each (cid:32)(cid:40)1, if (cid:80)wi=1(c(cid:12)B(cid:12)A)ij ≥ 1(cid:33)t workerfromauniformdistributionintherange[0,wfp]and u← (cid:80)wi=1(c(cid:12)A)ij 2 [0,wfn]respectively,wherewfp andwfn areinitialparame- 0, otherwise tersofthesimulation. Oncetheitemswererandomlyplaced j=1 (cid:32) (cid:16) (cid:17)(cid:33)w ingroups,andtheerrorratesoftheworkerswererandomly c← s (uj[Aij =1])tj=1,(Bij[Aij =1])tj=1 determined, a random p percent of all pairs of items were i=1 given to each worker, where p is an initial parameter of the end while simulation. Each worker then determined whether or not the items in each pair they received were similar to each other, taking into account their error rates. Onceallworkersassertedwhetherornoteachitempairthey were given contained similar items, the majority vote and DWMV for the similarity of each item pair was calculated. ThemajorityvotesandDWMVsofitemsimilaritywerethen used to form a network of item similarity, where each item is connected to every other item it was voted to be similar to. The majority vote network and DWMV network were both used to form groups through hierarchical clustering with Jaccard Index as the distance metric. Jaccard Index wasusedasthedistancemetricbecauseJaccardIndexdoes nottakeintoaccounttruenegatives[8]. Mostitemsarenot similar to each other, so a metric that takes into account truenegativeswouldbeover-inflatedandnotasinformative inthiscontext. Afterforminggroupsfromthemajorityvote and DWMV similarity networks, the difference in accuracy, precision, and recall between the groups formed from the majorityvoteandDWMVsimilaritynetworkswereusedto determine if the DWMV method improved upon tradition majority vote. 3.3 EmpiricalStudy: SimilarProblems In addition to a simulation, an empirical study was per- formedtocompareDWMVtomajorityvoteonarealcrowd- sorcingtask. Inthisstudy,middleschoolmathematicsteach- ers and college students were given 50 seventh grade math- Figure1: Aflowchartofthesimulationprocess,DWMVand ematics problems from the Engage New York1, Illustrative majorityvotewerecomparedtoeachotherthroughtheiruse Mathematics2,andUtahMiddleSchoolMathProject3 cur- incommunitydetectionthroughhierarchicalclustering. ricula. Eachworkerwastoldtoidentifyproblemsthateval- uatesimilarmathematicsskills. Theworkers’crowdsourced opinions of similarity were aggregated using both DWMV andmajorityvote,andthengroupedusinghierarchicalclus- Table1: SimulationParametersandSimulatedValues tering, with Jaccard Index as the distance metric with a Parameter Values thresholdof0.75. Theresultinggroupswerethencompared i 50, 100, 150, 200 to a ground truth, provided by ASSISTments, an online g 5, 10, 15, 20, 25 learning platform [4], in the form of Common Core State w 0.1, 0.2, 0.3, 0.4, 0.5 Standards Mathematics Skill Codes4, which each problem fp w 0.1, 0.2, 0.3, 0.4, 0.5 fn 1https://www.engageny.org/ p 20, 40, 60, 80, 100 2https://illustrativemathematics.org/ d 0.25, 0.5, 0.75 3http://utahmiddleschoolmath.org/ 4http://www.corestandards.org/ Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 549 was tagged with. These ground truth skill tags were de- termined by trained experts and the designers of the above stated curricula. The difference in accuracy, precision, and recall between groups formed with hierarchical clustering fromDWMVandmajorityvotewereagainusedtoevaluate the quality of the DWMV algorithm. 4. RESULTS 4.1 SimulationStudy TocomparetheDWMVmethodtothetraditionalmajority votemethod,thedifferenceinaccuracy,precision,andrecall asafunctionofw ,w ,i,g,andp,asdescribedinSection fp fn 3.2, were calculated. The first positive takeaway from the simulationisthatDWMVwasalmostalwaysmoreaccurate thanmajorityvote,regardlessofthesimulationparameters. Only when the simulation had more than twenty groups or themaximumfalsenegativerateofworkerswas20%orless did DWMV not reliably out perform majority vote, but it didnotsignificantlyunderperformeither. Atmost,DWMV was slightly less accurate than majority vote when workers hadverylowfalsenegativerates. Interestinglythisincrease in performance was not shared by both precision and re- Figure 2: The average and 95% confidence interval of the call. Whilerecallfollowedthetrendofaccuracyandshowed DWMVweightsofworkersasafunctionoftheworkers’false almost entirely positive improvements from using DWMV positiveandfalsenegativerates. over majority vote, precision did not. Another interesting finding is that all three performance opportunities for them to make a mistake compared to a metrics increased as both the maximum false negative rate worker with a large false negative rate. Additionally, the andfractionoflinksseenbyworkersincreased. Thisimplies large number of dissimilar problem pairs compared to the that as workers answer more problems, and become worse number of similar problem pairs caused workers with very at correctly identifying when items are similar, the benefit low false positive rates to have higher weights than work- of using DWMV over majority vote increases. ers with equally low false negative rates, because workers with low false positive rates, regardless of their false nega- Overall, t-tests [9] showed that using DWMV led to a sta- tiverates,hadmuchfeweropportunitiestomakeamistake. tistically reliable (p < 0.001) 0.18% increase in accuracy, These findings suggest that the distribution of correct re- a statistically reliable 1.78% (p < 0.001) increase in recall, sponsesincrowdsourcingtasksaffectswhichtypeofworker but no statistically reliable (p = 0.28) change in precision. errorhasalargerimpactonworkers’weightsintheDWMV While small, these reliable improvements in accuracy and method. recall over the traditional majority vote method are an in- dication of the potential positive effects of transitioning to using DWMV instead of majority vote when aggregating 4.2 EmpiricalStudy: SimilarProblems crowdsourced opinion. Intotal,sixteachersandfourstudentscompletedthecrowd- sourcing task of grouping 50 seventh grade mathematics Therewerealsosomeinterestingdifferencesinhowdifferent problems. Using each worker’s assertions of similarity, the typesoferroraffectedtheweightsofworkersasdetermined DWMVmethodandtraditional majority vote were used to by the DWMV method. Figure 2 shows the average and aggregate the opinions of the workers into a final network 95% confidence interval of the DWMV weights of workers of similarity, which was then used to create groups of simi- as a function of the workers’ false positive and false nega- larproblemsusinghierarchicalclustering. Thisistheexact tive rates. The false positive rate of the workers seems to same process that was used to form groups in the simu- decrease their weight in the final weighted majority vote of lation study. Figure 3 shows the progressive iterations of theDWMVmethodmuchmorequicklythantheirfalseneg- DWMV. Iteration 1 shows the unweighted average of each ativerate. Apotentialcauseofthisisthat,inthesimulated workersassertions. TheDWMVmethod’sprocessofiterat- groups of similar items, there were far more pairs of items ingbetweencalculatingaweightforeachworkerandcalcu- thatwerenotsimilartoeachotherthantherewerepairsof latingtheweightedmajorityvoteshiftedtheweightedaver- itemsthatweresimilar. Forexample,tohaveanequalnum- ageofworkersassertionstowardthegroundtruthsimilarity ber of items that are similar and not similar to each other, ofproblems. Thisconvergencewaspresentinthesimulated each item would have to be similar to half the items. The example in Section 3.1 as well. The benefit of the DWMV only way to facilitate that in the context of this simulation method over traditional majority vote lies in this ability to wouldbetohaveonlytwoequallysizedgroupsofitems. In converge towards ground truth. Figure 4 shows the weight the simulation there were always at least five groups, and of each worker as a function of their error rate. The cohort up to 25 groups of similar items, which caused most prob- ofmiddleschoolmathematicsteachersperformedmuchbet- lems to not being similar to each other. Therefore, when ter overall than the cohort of college students. The average a worker had a large false positive rate, there were more accuracyoftheteacherswasabout97%whiletheaverageac- 550 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Figure3: ProgressiveiterationsofDWMVconvergingonem- Figure 4: DWMV’s confidence in each worker after the piricaldata. DWMVmethodconverged. curacyofthecollegestudentswasonlyabout81%. Basedon Josiah and Tillery have new jobs at YumYum’s theseweights,itisclearthattheDWMVmethodvaluedthe IceCreamParlor. JosiahisTillery’smanager. In opinions of middle school mathematics teachers more than theirfirstyear,Josiahwillbepaid$14perhour, theopinionsofcollegestudents,whichisexpectedgiventhe and Tillery will be paid $7 per hour. They have contextandtask. While,inthisscenario,itmighthavebeen been told that after every year with the com- easyforahumaninthelooptorecognizethattheteachers’ pany, they will each be given a raise of $2 per opinions should be valued more, it will not always be the hour. Is the relationship between Josiah’s pay casethatonegroupofworkersisclearlymorequalifiedthan and Tillery’s pay rate proportional? another group, and thus the DWMV method can help elu- cidate which workers are the most reliable. To make a punch, Anna adds 8 ounces of apple Table 2 shows the difference in accuracy, precision, and re- juice for every 4 ounces of orange juice. If she call between groups formed through hierarchical clustering uses 32 ounces of apple juice, which proportion from the assertions of similarity aggregated using DWMV can she use to find the number of ounces of or- and traditional majority vote. Similar to the simulation re- ange juice x she should add to make the punch? sults, DWMVhadthelargestpositiveimpactonrecall, the second largest positive impact on accuracy, but no impact Arecentstudyclaimedthatinanygivenmonth, on precision. In this empirical study, both the traditional forevery5textmessagesaboysentorreceived, majorityvotemethodandtheDWMVmethodledtoperfect agirlsentorreceived7textmessages. Isthere- precision, meaning all problems that were placed in groups lationship between the number of text messages together were similar to each other. However, traditional sentorreceivedbyboysproportionaltothenum- majorityvoteledtoworserecallthanDWMV.Whentradi- ber of text messages sent or received by girls? tion majority vote was used, three of the 50 problems were notplacedinagroupwithanyotherproblems,whichiswhy therecallwassolow. However,whenDWMVwasused,only Although all these problems are related to ratios and pro- one problem was not placed in a group of similar problems. portions, the other problems in the group with the outlier Thisoutlierproblem,thatneithertraditionalmajorityvote problem are longer word problems that do not explicitly norDWMVwasabletocorrectlyidentifyassimilartoother use percentages. The teachers and students whose opin- problems in its group, had the following text: ions were crowdsourced could have missed the connection due to the different wording in the problems, or they could believe that calculating percentages is a different skill than 22% of 65 is 14.3. What is 22.6% of 65? Round calculating proportions from word problems. Based on the your answer to the nearest hundredths (second) differencesbetweenthissingleoutlierproblemandtheother decimal place. problemsinitsgroup,itispossiblethattheoutlierproblem was consciously excluded from its group and not simply an Below are examples of problems in the same group as this oversight. problem,whichwereallcorrectlyidentifiedassimilartoeach other. The impact of using DWMV was larger in this empirical Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 551 6. ACKNOWLEDGMENTS Table 2: A comparison of majority vote to DWMV used to WewouldliketothankmultipleNSFgrants(e.g.,1917808, formgroupsofsimilarproblemsfromcrowdsourcedassertions 1931523,1940236,1917713,1903304,1822830,1759229,172- ofsimilarity. 4889, 1636782, 1535428, 1440753, 1316736, 1252297, 11094- Metric MajorityVote DWMV %Increase 83, & DRL-1031398), as well as the US Department of Ed- Accuracy 0.987 0.997 1.054 ucation for three different funding lines; a) the Institute for Precision 1.000 1.000 0.000 Education Sciences (e.g., IES R305A170137, R305A170243, Recall 0.903 0.977 8.228 R305A180401,R305A120125,R305A180401,&R305C1000- 24), b) the Graduate Assistance in Areas of National Need program (e.g., P200A180088 & P200A150306), and c) the studythanitwasinthesimulation. Inthesimulationthere EIR. We also thank the Office of Naval Research (N00014- was a larger than average improvement in accuracy and 18-1-2768),SchmidtFutures,andananonymousphilanthro- recall when the workers had very low false positive rates. pic foundation. Given that in this empirical study both sets of groups of similar problems had perfect precision, it is likely that the 7. REFERENCES workersinthisstudyhadverylowfalsepositiverates,which [1] F.Daniel,P.Kucherbaev,C.Cappiello,B.Benatallah, likelycontributedtowhythepositiveimpactofusingDWMV and M. Allahbakhsh. Quality control in instead of majority vote was larger in this empirical study crowdsourcing: A survey of quality attributes, than in the simulation as a whole. The results of this em- assessment techniques, and assurance actions. ACM piricalstudysuggestthatnotonlycanDWMVout-perform Computing Surveys (CSUR), 51(1):1–40, 2018. traditional majority vote in simulations, but can also im- [2] L. R. Dice. Measures of the amount of ecologic prove the recall and accuracy of groups of similar problems association between species. Ecology, 26(3):297–302, formedfromcrowdsourcedopinionsoncontentsimilarityin 1945. real-life scenarios as well. [3] S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010. [4] N. T. Heffernan and C. L. Heffernan. The assistments 5. CONCLUSION ecosystem: Building a platform that brings scientists Withinonlinelearningplatformsandintelligenttutors,there and teachers together for minimally invasive research is tremendous utility to knowing what content is similar to on human learning and teaching. International other content within the platform, but each application of Journal of Artificial Intelligence in Education, similar content is likely to have different criteria for what 24(4):470–497, 2014. is considered similar. Crowdsourcing opinions on the sim- [5] S. C. Johnson. Hierarchical clustering schemes. ilarity of content is an accessible way for new applications Psychometrika, 32(3):241–254, 1967. to recognize similar content. However, crowdsourcing poses [6] P. Ruvolo, J. Whitehill, and J. R. Movellan. some difficulties, namely, how to identify reliable workers Exploiting commonality and interaction effects in andproperlyaggregateopinionsfrommultipleworkers. This crowdsourcing tasks using latent factor models. In workhasdemonstratedtheabilityoftheDynamicallyWeigh- Neural Information Processing Systems. Workshop on tedMajorityVotemethod,anovelalgorithmforaggregating Crowdsourcing: Theory, Algorithms and Applications. crowdsourced opinion while rating workers, to accomplish Citeseer, 2013. those goals. DWMV has been shown, in both a simula- [7] D. M. Stenhoff, B. J. Davey, B. Lignugaris, et al. The tion study and an empirical study, to lead to higher accu- effects of choice on assignment completion and percent racy and recall that the traditional majority vote method correct by a high school student with a learning on crowdsourcing tasks related to identifying similar con- disability. Education and treatment of Children, tent. In the simulation study, using DWMV before identi- 31(2):203–211, 2008. fyinggroupsofsimilaritemsthroughhierarchicalclustering [8] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction resultedinastatisticallysignificant0.18%increaseinaccu- to data mining. Pearson Education India, 2016. racyanda1.78%increaseinrecalloverusingmajorityvote. [9] B. L. Welch. The generalization ofstudent’s’ problem The simulation study also revealed how the distribution of when several different population variances are correctresponsesinthecrowdsourcingtaskseffectshowthe involved. Biometrika, 34(1/2):28–35, 1947. falsepositiveandfalsenegativeratesofworkerseffectstheir weight in the DWMV method. In the empirical study, us- [10] J. Whitehill, T.-f. Wu, J. Bergsma, J. Movellan, and ing DWMV before identifying groups of similar problems P. Ruvolo. Whose vote should count more: Optimal through hierarchical clustering resulted in about a 1% in- integration of labels from labelers of unknown crease in accuracy and an 8% increase in recall over using expertise. Advances in neural information processing majority vote, and provided perspective on the differences systems, 22:2035–2043, 2009. in accuracy between the expert middle school math teach- ers and the novice college students. Moving forward, when facedwiththeneedtoaggregatecrowdsourcedopinions,the learningsciencecommunitycanlooktotheDWMVmethod as an alternative to the traditional majority vote method. The DWMV method is a promising tool for increasing the reliability of crowdsourced opinion and, when paired with hierarchicalclustering,identifyinggroupsofsimilarcontent. 552 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)