ebook img

Probabilistic Multigraph Modeling for Improving the Quality of Crowdsourced Affective Data PDF

8.4 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Probabilistic Multigraph Modeling for Improving the Quality of Crowdsourced Affective Data

IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 1 Probabilistic Multigraph Modeling for Improving the Quality of Crowdsourced Affective Data Jianbo Ye, Jia Li, Michelle G. Newman, Reginald B. Adams, Jr. and James Z. Wang (cid:70) 7Abstract—We proposed a probabilistic approach to joint modeling of is the availability of large-scale data collected from human 1participants’ reliability and humans’ regularity in crowdsourced affective subjects, remedying the high complexity and heterogeneity 0studies. Reliability measures how likely a subject will respond to a ques- that the real-world has to offer. With the growing attention 2tion seriously; and regularity measures how often a human will agree on affective computing (initiated from the seminal discussion with other seriously-entered responses coming from a targeted popula- n [3] to recent communications [4]), multiple data-driven ap- tion. Crowdsourcing-based studies or experiments, which rely on human a proaches have been developed to understand what particular self-reported affect, pose additional challenges as compared with typical J environmental factors drive the feelings of humans [5, 6], and crowdsourcingstudiesthatattempttoacquireconcretenon-affectivelabelsof 6objects.Thereliabilityofparticipantshasbeenmassivelypursuedfortypical how those effects differ among various sociological structures non-affectivecrowdsourcing studies,whereas the regularityof humansin andbetweenhumangroups. ] Lanaffectiveexperimentinitsownrighthasnotbeenthoroughlyconsidered. One crucial hurdle for those affective computing ap- MIthasbeenoftenobservedthatdifferentindividualsexhibitdifferentfeelings proaches is the lack of full-spectrum annotated stimuli data on the same test question, which does not have a sole correct response at a large scale. To address this bottleneck, crowdsourcing- .in the first place. High reliability of responses from one individual thus t basedapproachesarehighlyhelpfulforcollectinguncontrolled acannot conclusively result in high consensus across individuals. Instead, humandatafromanonymousparticipants[7].Inarecentstudy tgloballytestingconsensusofapopulationisofinteresttoinvestigators.Built s reported in [8], anonymous subjects from the Internet were [upontheagreementmultigraphamongtasksandworkers,ourprobabilistic modeldifferentiatessubjectregularityfrompopulationreliability.Wedemon- recruited to annotate a set of visual stimuli (images): at each 2stratethemethod’seffectivenessforin-depthrobustanalysisoflarge-scale time point, after being presented with an image stimulus, vcrowdsourcedaffectivedata,includingemotionandaestheticassessments participants were asked to assess their personal psychologi- 6collectedbypresentingvisualstimulitohumansubjects. cal experiences using ordinal scales for each of the affective 9 dimensions: valence, arousal, dominance and likeness (which 0Index Terms—Emotions, human subjects, crowdsourcing, probabilistic 1graphicalmodel,visualstimuli means the degree of appreciation in our context). This study 0 also collected demographics data to analyze individual dif- . ference predictors of affective responses. Because labeling a 11 INTRODUCTION large number of visual stimuli can become tedious, even with 0 7Humans’ sensitivity to affective stimuli intrinsically varies crowdsourcing, each image stimulus was examined by only a 1from one person to another. Differences in gender, age, soci- few subjects. This study allowed tens of thousands of images :ety,culture,personality,socialstatus,andpersonalexperience to obtain at least one label from a participant, which created v ican contribute to its high variability between people. Further, a large data set for environmental psychology and automated X inconsistencies may also exist for the same individual across emotionanalysisofimages. renvironmental contexts and current mood or affective state. Oneinterestingquestiontoinvestigate,however,iswhether a The causal effects and factors for such affective experiences the affective labels provided by subjects are reliable. A related have been extensively investigated, as evident in the litera- question is how to separate spammers from reliable subjects, ture on psychological and human studies, where controlled or at least to narrow the scope of data to a highly reliable experiments are commonly conducted within a small group subgroup. Here, spammers are defined as those participants ofhumansubjects—toensurethereliabilityofcollecteddata. who provide answers without serious consideration of the To complement the shortcomings of those controlled experi- presented questions. No answer from a statistical perspective ments, ecological psychology aims to understand how objects isknownyetforcrowdsourcedaffectivedata. and things in our surrounding environments effect human A great difficulty in analyzing affective data is caused by behaviorsandaffectiveexperiences,inwhichreal-worldstudies the absence of ground truth in the first place, that is, there is are favored over those within artificial laboratory environ- no correct answer for evoked emotion. It is generally accepted ments[1,2].Thekeyingredientofthoseecologicalapproaches that even the most reliable subjects can naturally have varied emotions.Indeed,withvariabilityamonghumanresponsesan- Manuscriptreceived ;revised . J.YeandJ.Z.WangarewiththeCollegeofInformationSciencesandTechnology, ticipated,psychologicalstudiesoftencareaboutquestionssuch The Pennsylvania State University, University Park, PA 16802, USA. J. Li is as where humans are emotionally consistent and where they withtheDepartmentofStatistics,ThePennsylvaniaStateUniversity,University are not, and which subgroups of humans are more consistent Park,PA16802,USA.M.NewmanandR.B.Adams,Jr.arewiththeDepartment thananother.Givenapopulation,many,ifnotthevastmajority ofPsychology,ThePennsylvaniaStateUniversity,UniversityPark,PA16802, USA.(e-mails:{jxy198,jiali,mgn1,radams,jwang}@psu.edu) of stimuli may not have a consensus emotion at all. Majority 2 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 framework. With this method, investigators running affective AnnotatorID Valence Reliability humansubjectstudiescansubstantiallyreduceoreliminatethe 3474 5.1/8 0.08 contaminationcausedbyspammers,henceimprovethequality 2500 0.0/8 0.56 andusefulnessofcollecteddata(Fig.2). 3475 0.0/8 0.34 1.1 RelatedWork 2540 8.0/8 0.04 Estimating the reliability of subjects is necessary in ImageConfidence:75%( 90%) ≤ crowdsourcing-baseddatacollectionbecausetheincentivesof Figure 1. An example illustrating one may need to acquire more reliable participantsandtheinterestofresearchersdiverge.Therewere labels,ensuringtheimageconfidenceismorethan0.9. twolevelsofassumptionsexploredforthecrowdsourceddata, which we name as the first-order assumption (A1) and the second-order assumption (A2). Let a task be the provision votingor(weighted)averagingtoforcean”objectivetruth”of of emotion responses for one image. Consider a task or test theemotionalresponseorprobablyforthesakeofconvenience, conductedbyanumberofparticipants.Theirresponseswithin asisroutinelydoneinaffectivecomputingsothatclassification thistaskformasubgroupofdata. on a single quantity can be carried out, is a crude treatment bound to erase or disregard information essential for many A1 There exists a true label of practical interest for each task. interesting psychological studies, e.g., to discover connections The dependencies between collected labels are mediated betweenvariedaffectiveresponsesandvarieddemographics. by this unobserved true label, of which noisy labels are The involvement of spammers as participating subjects otherwiseconditionallyindependent. introduces an extra source of variation to the emotional re- A2 The uncertainty model for a subgroup of data does not sponses, which unfortunately is tangled with the ”appropri- depend on its actual specified task. The performance of a ate”variation.Ifresponsesassociatedwithanimagestimulus participant is consistent across subgroups of data subject containanswersbyspammers,theinter-annotatorvariationfor toasinglefixedeffect. the specific question could be as large as the variation across Existing approaches that model the complexities of tasks different questions, reducing the robustness of any analysis. or reliability of participants often require one or both of these An example is shown in Fig. 1. Most annotators labeling this twoassumptions.UndertheumbrellaofassumptionA1,most image are deemed unreliable, and two of them are highly probabilisticapproachesusingtheobservermodels [9,10,11, susceptibleasspammersaccordingtoourmodel.Investigators 12] focus on estimating the ground truth from multiple noisy mayberecommendedtoeliminatethisimageoracquiremore labels. For example, the modeling of one reliability parameter reliablelabelsforitsuse.Yet,oneshouldnotbeswayedbythis persubjectisanestablishedpracticeforestimatingtheground example into the practice of discarding images that solicited truth label [12]. For the case of categorical labels, modeling responsesofalargerange.Certainimagesarecontroversialin of one free parameter per class per subject is a more general nature and will stimulate quite different emotions to different approach [9, 13]. Our approach does not model the ground viewers. Our system acquired the reliability scores shown in truthof labels, hence it isnot viableto compareour approach Fig.1byexaminingtheentiredataset;thedataonthisimage with other methods in this regard. Instead, we sidestep this alonewouldnotbeconclusive,infact,farfromso. issue to tackle whether the labels from one subject can agree Facing the intertwined ”appropriate” and ”inappropriate” withlabelsfromanotheronasingletask.Agreementisjudged variationsinthesubjectsaswellasthevariationsintheimages, subjecttoapreselectedcriterion.Suchtreatmentmaybemore we are motivated to unravel the sources of uncertainties by realistic as a means to process sparse ordinal labels for each taking a global approach. The judgment on the reliability of task. a subject cannot be a per-image decision, and has to leverage Assumption A2 is also widely exploited among methods, the whole data. Our model was constructed to integrate these often conditioned on A1. It assumes that all of the tasks have uncertainties, attempting to discern them with the help of the same level of difficulty [14, 15]. Modeling one difficulty big data. In addition, due to the lack of ground truth labels, parameter per task has been explored in [16] for categorical we model the relational data that code whether two subjects’ labels.However,inourapproach,taskdifficultyismodeledas emotion responses on an image agree, bypassing the thorny arandomeffectwithoutsubscribingatask-specificparameter. questionsofwhatthetruelabelsareandiftheyexistatall. Wisely choosing the modeling complexity and assumptions Forthesakeofautomatedemotionanalysisofimages,one should be based on availability and purity of data. As sug- alsoneedstonarrowthescopetopartsofdata,eachofwhich gested in [17], more complexity in a model could challenge havesufficientnumberofqualifiedlabels.Ourworkcomputes the statistical estimation subject to the constraint of real data. imageconfidences,whichcansupportoff-linedatafilteringor Choices with respect to our model attempted to properly guideon-linebudgetedcrowdsourcingpractices. analyzetheaffectivedataweobtained. Insummary,systematicanalysisofcrowdsourcedaffective Ifthemutualagreementratebetweentwoparticipantsdoes data is of great importance to human subject studies and not depend on the actual specified task (i.e., when A2 holds), affective computing, while remains an open question. To sub- we can essentially convert the resulting problem to a graph stantially address the aforementioned challenges and expand mining problem, where subjects are vertices, agreements are the evidential space for psychological studies, we propose a edges, and the proximity between subjects is modeled by probabilistic approach, called Gated Latent Beta Allocation how likely they agree with each other in a general sense. (GLBA). This method computes maximum a posteriori prob- Probabilisticmodelsforsuchrelationaldatacanbetracedback ability (MAP) estimates of each subject’s reliability and reg- toearlystochasticblockmodels[18,19],latentspacemodel[20], ularity based on a variational expectation-maximization (EM) andtheirlaterextensionswithmixedmembership[21,22]and YEetal.:PROBABILISTICMULTIGRAPHMODELING... 3 rawavg.:4.06outof8 4.1 2.94 4.78 3.33 4.25 1.9 4.54 2.75 4.53 3 → → → → → new:2.51outof8 → 4.06→3.03 4.05 2.87 4.7 2.06 5.08 3.94 5.24 3.93 → → → → 5.02→3.58 5.7 3.87 5.6 3 5.17 3.19 5.32 2.98 5.38 3.76 → → → → → Figure 2. Images shown are considered of lower valence than their average valence ratings (i.e., evoking a higher degree of negative emotions) after processingthedatasetusingourproposedmethod.Ourmethodeliminatesthecontaminationintroducedbyspammers.Therangeofvalenceratingsis between0and8. 2.63 3.77 2.8 4.14 3.0 4.7 4.4 6.21 4.7 6.26 → → → → → Figure3.Imagesshownareconsideredofhighervalencethan theiraveragevalenceratings(i.e.,evoking ahigherdegreeof positiveemotions)after processing the data set using our proposed method. Our method again eliminates the contamination introduced by spammers. The range of valence ratingsisbetween0and8. nonparametric Bayes [23]. We adopt the idea of mixed mem- 1.2 OurContributions berships wherein two particular modes of memberships are To our knowledge, this is the first attempt to connect prob- modeledforeachsubject,onebeingthereliablemodeandthe abilistic observer models with probabilistic graphs, and to other the random mode. For the random mode, the behavior exploremodelingatthiscomplexityfromthejointperspective. is assumed to be shared across different subjects, whereas the Wesummarizeourcontributionsasfollows: regularbehaviorsofsubjectsinthereliablemodeareassumed tobedifferent.Therefore,wecanextendthisframeworkfrom • Wedevelopedaprobabilisticmultigraphmodeltoanalyze graphtomultigraphintheinterestofcrowdsourceddataanal- crowdsourced data and its approximate variational EM ysis.Specifically,dataarecollectedassubgroups,eachofwhich algorithm for estimation. The new method, accepting the iscomposedofasmallagreementgraphsforasingletask,such intrinsicvariationinsubjectiveresponses,doesnotassume thatthecovariatewithinasubgroupismodeled.Ourapproach the existence of ground truth labels, in stark contrast does not rely on A2. Instead, it models the random effects to previous work having devoted much effort to obtain addedtosubjects’performanceineachtaskviathemultigraph objectivetruelabels. approach. Assumption A1 and A2 implies a bipartite graph • Ourmethodexploitstherelationaldataintheconstruction structurebetweentasksandsubjects.Incontrast,ourapproach and application of the statistical model. Specifically, in- starts from the multigraph structure among subjects that is steadofthedirectlabels,thepair-wisestatusofagreement coordinatedbytasks.Findingtheproperandflexiblestructure between labels given by different subjects is used. As thatdatapossessiscrucialformodeling[24]. a result, the multigraph agreement model is naturally applicabletomoreflexibletypesofresponses,easilygoing 4 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 beyond binary and categorical labels. Our work serves as well as that between ai + 1 and aj + 1 in order to reduce aproofofconceptforthisnewrelationalperspective. the effect of imposing discrete values on the answers that are • Our experiments have validated the effectiveness of our by nature continuous. If the condition does not hold, they approachonreal-worldaffectivedata.Becauseourexper- disagree and I(k) = 0. Here we assume that if two scores for i,j imental setup was of a larger scale and more challenging the same image are within a 20% percentile interval, they are than settings addressed by existing methods, we believe considered to reach an agreement. Compared with setting a ourmethodcanfillsomegapsfordemandsinthepractical threshold on their absolute difference, such rule adapts to the world,forinstance,whengoldstandardsarenotavailable. non-uniformity of score distribution. Two subjects can agree with each other by chance or they indeed experience similar 2 THE METHOD emotionsinresponsetothesamevisualstimulus. While the choice of the percentile threshold δ is inevitably In this section, we describe our proposed method. Let us subjective,theselectioninourexperimentswasguidedbythe present the mathematical notations first. A symbol with sub- scriptomittedalwaysindicatesanarray,e.g.,x=(...,xi,...). desire to trade-off the preservation of the original continuous scaleofthescores(favoringsmallvalues)andasufficientlevel Thearithmeticoperationsperformoverarraysintheelement- wisemanner,e.g.,x+y =(...,xi+yi,...).Randomvariables of error tolerance (favoring large values). This threshold con- trols the sparsity level of the multi-graph, and influences the are denoted as capital English letters. The tilde sign indicates thevalueofparametersinthelastiterationofEM,e.g.,θ˜.Given marginal distribution of estimated parameters. Alternatively, afunctionfθ,wedenotefθ˜byf˜θ orsimplyf˜,iftheparameter one may assess different values of the threshold and make a θ˜is implied. Additional notations, as summarized in Table 1, selection based on some other criteria of preference (if exist) appliedtothefinalresults. willbeexplainedinmoredetailslater. Table1 Symbolsanddescriptionsofparameters,randomvariables,andstatistics. Symbols Descriptions 2.2 GatedLatentBetaAllocation Oi subjecti τi rateofsubjectreliability This subsection describes the basic probabilistic graphical αi,βi shapeofsubjectregularity model we used to jointly model subject reliability, which is γ rateofagreementbychance independent from the supplied questions, and regularity. We Θ unionofparameters refrain from carrying out a full Bayesian inference because it Tj(k) whetherOj reliablyresponse isimpracticaltoendusers.Instead,weusethemode(s)ofthe Ji(k) rateofOiagreeingwithotherreliableresponses posterioraspointestimates. (k) Ii,j whetherOiagreeswiththeresponsesfromOj We assume each subject i has a reliability parameter τi ωi(k)(·) cumulativedegreeofresponsesagreedbyOi [0,1]andregularityparametersαi,βi >0characterizinghiso∈r ψ(k)(·) cumulativedegreeofresponses heragreementbehaviorwiththepopulation,fori = 1,...,m. i rj(k)(·) aratioamplifiesordiscountsthereliabilityofOj We also use parameter γ for the rate of agreement between τ˜i(k) sufficientstatisticsofposteriorTi(k),givenΘ˜ subjects out of pure chance. Let Θ = ({τi,αi,βi}mi=1,γ) be α˜(k),β˜(k) sufficientstatisticsofposteriorJ(k),givenΘ˜ the set of parameters. Let Ωk be the a random sub-sample i i i from subjects 1,...,m who labeled the stimulus k, where { } k = 1,...,n. We also assume sets Ωk’s are created inde- pendently from each other. For each image k, every subject 2.1 AgreementMultigraph pair from Ω2, i.e., (i,j) with i = j, has a binary indicator Werepresentthedataasadirectedmultigraph,whichdoesnot I(k) 0,1k coding whether t(cid:54)heir opinions agree on the assume a particular type of crowdsourced response. Suppose rei,sjpec∈tiv{e sti}mulus. We assume I(k) are generated from the we have prepared m questions in the study, the answers can i,j following probabilistic process with two latent variables. The be binary, categorical, ordinal, and multidimensional. Given a subject pair (i,j) who are asked to look at the k-th question, firstlatentvariableTj(k)indicateswhethersubjectOj isreliable or not. Given that it is binary, a natural choice of model is the one designs an agreement protocol that determines whether the answer from subject i agrees with that from subject j. If Bernoulli distribution. The second latent variable Ji(k), lying subjecti’sagreeswithsubjectj’sontaskk,thenwesetI(k) = between 0 and 1, measures the extent subject Oi agrees with i,j 1.OInthoeruwricsaes,eI,i(,wkj)e=ar0e.givenordinaldatafrommultiplechan- tthereizoetdhebryreαliiaabnledrβeisptoonmseosd.eWlJei(uks)ebBeceatausdeisittriisbuatwioindeplayraumseed- nels, we define I(k) = 1 if (sum of) the percentile difference parametricdistributionforquantitiesoninterval[0,1]andthe i,j shape of the distribution is relatively flexible. In a nutshell, betweentwoanswersai,aj ∈{1,...,A}satisfies T(k)isalatentswitch(aka,gate)thatcontrolswhetherI(k)can j i,j 1(cid:12)(cid:12)P (cid:104)a(k)(cid:105) P (cid:104)a(k)(cid:105)(cid:12)(cid:12)+ 1(cid:12)(cid:12)P (cid:104)a(k)+1(cid:105) P (cid:104)a(k)+1(cid:105)(cid:12)(cid:12) δ, be used for the posterior inference of the latent variable J(k). 2(cid:12) i − j (cid:12) 2(cid:12) i − j (cid:12)≤ Hence, we call our model Gated Latent Beta Allocation (GLBiA). (1) ThepercentileP[]iscalculatedfromthewholepoolofanswers AgraphicalillustrationofthemodelisshowninFig.4. · for each discrete value, and δ = 0.2. In the above equation, We now present the mathematical formulation of the we measure the percentile difference between ai and aj as model.Fork =1,...,n,wegenerateasetofrandomvariables YEetal.:PROBABILISTICMULTIGRAPHMODELING... 5 independentlyvia Θ=({τi,αi,βi}mi=1,γ)[26].Theparameterγ isassumedtobe pre-selectedbytheuseranddoesnotneedtobeestimated.To Tj(k) i.i.d. ∼ Bernoulli(τj), j ∈Ωk , (2) regularize the other parameters in estimation, we use the em- Ji(k) i.i.d. ∼ Beta(αi,βi),(cid:16) i∈(cid:17)Ωk , (3) ppirriiocraslBayesapproachtochoosepriors.Assumethefollowing (cid:12)  Bernoulli J(k) ifT(k) =1 Ii(,kj)(cid:12)(cid:12)Tj(k),Ji(k) ∼  Bernoulli(γ)i ifTj(k) =0 (4) τi ∼ Beta(τ0,1−τ0), (6) j αi+βi Gamma(2,s0). (7) wihearnedthielaΩsktrwanitdhokm=p1ro,.ce..ss,nh,oalnddsfγoirsatnhyerjat∈eoΩf¬kaig:r=eeΩmken−t By empirical Bayes, τ0, s∼0 are adjusted. For the ease of nota- b{y}chance∈if one of i,j turns out to be unreliable. Here {Ii(,kj)} tions,wedefinetwoauxiliaryfunctionsωi(k)(·)andψi(k)(·): areobserveddata. ωi(k)(x):= (cid:88) xjIi(,kj), ψi(k)(x):= (cid:88) xj . (8) j∈Ω¬ki j∈Ωk τ m α ,β m { j}j=1 γ { i i}i=1 Similarly,wedefinetheirsiblings ω¯(k)(x)=ω(k)(1 x), ψ¯(k)(x)=ψ(k)(1 x). (9) i i − i i − Wealsodefinetheauxiliaryfunctionrj()as T(k) I(k) J(k) · j j Ω i,j i i Ωk rj(k)(x)= (cid:89) (cid:18)xγi(cid:19)Ii(,kj)(cid:18)11−xγi(cid:19)1−Ii(,kj). (10) ∈ k ∈ i∈Ω¬j − k k = 1,...,n Nowwedefinethefulllikelihoodfunction: Figure4.ProbabilisticgraphicalmodeloftheproposedGatedLatentBeta Allocation. Lk(Θ;T(k),J(k),I(k)):= (cid:89) (cid:16)(τj)Tj(k)(1 τj)1−Tj(k)(cid:17) − j∈Ωk If a spammer is in the subject pool, his or her reliability (cid:16) (cid:17)α(k)(cid:16) (cid:17)β(k) parameter τi is zero, though others can still agree with his J(k) i 1 J(k) i φ(k) or her answers by chance at rate γ. On the other hand, if (cid:89) i − i i , (11) · B(α ,β ) one is very reliable yet often provides controversial answers, i∈Ωk i i his reliability τi can be one, while he typically disagrees with others,indicatedbyhishighirregularityE[J(k)] = αi 0. whereauxiliaryvariablessimplifyingtheequationsare i αi+βi ≈ (cid:16) (cid:17) We are interested in finding both types of subjects. However, α(k) = α +ω(k) T(k) , i i i mostofsubjectslieinbetweenthesetwoextremes. (cid:16) (cid:17) Asaninterestingnote,Eq.(4)isasymmetric,meaningthat β(k) = β +ψ(k) ω(k) T(k) , Idi(e,kjfi)n(cid:54)=itioInj(,ksi)oifstphoestswiboleq,uaasncteintiaersi.oWtheaptrsohpoousledtoneavcehrieovcecusrymby- φi(ik) = γiω¯i(k)(Ti(k))−(1−i γ)ψ¯i(k)(T(k))−ω¯i(k)(T(k)) , metry in the final model by using the conditional distribution andB(, )istheBetafunction.Consequently,assumetheprior of Ii(,kj) and Ij(,ki) given that Ii(,kj) = Ij(,ki), and call this model likeliho·od· isLΘ(Θ),theMAPestimateofΘistominimize the symmetrized model. With details omitted, we state that conditioned on Ti(k), Tj(k), Ji(k), and Jj(k), the symmetrized L(Θ;T,J,I):=LΘ(Θ)(cid:89)n Lk(Θ;T(k),J(k),I(k)). (12) modelisstillaBernoullidistribution: k=1 I(k) i,j ∼ We solve the estimation using variational EM method with a (cid:32) (cid:32) (cid:33)(cid:33) Bernoulli H (cid:16)Ji(k)(cid:17)Ti(k)γ1−Ti(k),(cid:16)Jj(k)(cid:17)Tj(k)γ1−Tj(k) , fitoxeadpp(rτo0x,sim0)aatendthveaproysintegriγo.rTbhyeaidfaecatoorfivzaabrilaetitoenmapllmateet,hwohdossies probabilitydistributionminimizesitsKLdivergencetothetrue (5) posterior. Once the approximate posterior is solved, it is then where pq usedintheE-stepintheEMalgorithmasthealternativetothe H(p,q)= . trueposterior.TheusualM-stepisunchanged.EachtimeΘis pq+(1 p)(1 q) − − estimated, we adjust prior (τ0,s0) to match the mean of the Wmoedtealcfkolresitmhepliincifteyr.ence and estimation of the asymmetric MAP estimates of {τi} and (cid:26)αi+2 βi(cid:27) respective until they aresufficientlyclose. E-step.Weuse thefactorizedQ-approximationwith varia- 2.3 VariationalEM tionalprinciple: Variational inference is an optimization based strategy for approximating posterior distribution in complex distribu- p (cid:16)T(k),J(k)(cid:12)(cid:12)I(k)(cid:17) (cid:89) q∗ (cid:16)T(k)(cid:17) (cid:89) q∗ (cid:16)J(k)(cid:17). tions [25]. Since the full posterior is highly intractable, we Θ (cid:12) ≈ Tj,Θ j Ji,Θ i j∈Ωk i∈Ωk consider to use variational EM to estimate the parameters (13) 6 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 Let subject i. We set ∂L/∂αi = 0 and ∂L/∂βi = 0 for each i, • whichreads (cid:16) (cid:17) q∗ T(k)   eTxjp,Θ(cid:16)EJj,T¬j(cid:104)∝logLk(cid:16)Θ;T(k),J(k),I(k)(cid:17)(cid:105)(cid:17) , (14) (cid:18)αis+0βi −log(αi+βi)(cid:19)· 11  whosedistributioncanbewrittenas = (cid:88) ∇B(α˜i(k),β˜i(k)) ∇B(αi,βi) B(α˜(k),β˜(k)) − B(αi,βi) Bernoulli(cid:32)τjRj(τkj)R+j(k1)−τj(cid:33), =kk(cid:88)∈∈∆∆ii ΨΨ((iαβ˜˜i(i(kk))))i−−ΨΨ((αα˜˜i(i(kk))++ββ˜˜ii((kk))))    (cid:104) (cid:16) (cid:17)(cid:105) gwehseterde lboygJRohj(kn)so=naEnJd(cid:80)Koi∈tzΩ[¬k2j7l]o,gtheri(gke)(oJm(ke)tr)icm. Aeasnscuagn- −|∆i|· ΨΨ((αβi))−ΨΨ((ααi++ββi))  , (19) i i i − benumericallyapproximatedby where Ψ(x) [log(x 1),logx] is the Digamma function. ∈ − TheabovetwoequationscanbepracticallysolvedbyNewton- R(k) (cid:89) 1 (cid:32)αi(k)(cid:33)Ii(,kj)(cid:32) βi(k) (cid:33)1−Ii(,kj), Raphsonmethodwithaprojectedmodification(ensuringα,β j ≈ α(k)+β(k) γ 1 γ alwaysaregreaterthanzero). i∈Ω¬kj i i − Compute the derivatives of L with respect to τi and set ifbothαi(k) andβi(k) aresufficientlylargerthan1. (15) ∂L/∂τi =0,whichreads   • Let τi = ∆ 1+1τ0+ (cid:88) τ˜i(k) . (20) q∗ (J(k)) | i| k∈∆i Ji,Θ(cid:16) i ∝(cid:104) (cid:16) (cid:17)(cid:105)(cid:17) (16) Compute the derivatives of L w.r.t. γ and set to zero, which exp E logL Θ;T(k),J(k),I(k) , T,J¬i k reads (cid:80) ω¯(k)(τ˜(k)) γ = i∈Ωk i i . (21) whosedistributionis (cid:80) ψ¯(k)(τ˜(k)) i∈Ωk i i Beta(α +ω(k)(τ),β +ψ(k)(τ) ω(k)(τ)). Inpractice,theupdateformulaforγ needsnottobeusedifγ i i i i − i ispre-fixed.SeeAlgorithm1fordetails. Given parameter Ω˜ = τ˜i,α˜i,β˜i i=1, we can compute the { } 2.4 TheAlgorithm approximateposteriorexpectationoftheloglikelihood,which reads We present our final algorithm to estimate all parameters by knowing the multigraph data I(k) . Our algorithm is de- { i,j } E logL (Θ;T(k),J(k),I(k)) signedbasedonEqs.(19),(20),and(21).IneachEMiteration, T,J|Θ˜,I k ≈ there are two loops: one for collecting relevant statistics for const.+logL (Θ)+ Θ (cid:88) (cid:16)τ˜i(k)logτj +(1−τ˜i(k))log(1−τj)(cid:17)+ eeastcihmsautebsgrfaoprhe,aacnhdstuhebjeoctht.erPlfeoarsree-rceofmerputotinAglgthoeritphamram1eftoerr j∈Ωk details. (cid:88) (cid:42) αi ,∇B(α˜i(k),β˜i(k))(cid:43) Algorithm1VariationalEMalgorithmofGLBA i(cid:88)∈ΩklogBβ(αi i,βi)B+(lα˜ogi(kγ),β(cid:88)˜i(k))ω¯i(k)−(cid:16)τ˜i(k)(cid:17)+ OInuptuIpntu:ittiA:alsimsuabutijleotnci-tg:prτaa0rpa=hm{0eI.t5ike,,jrαs∈iΘ={0=β,1i(}{=}(τiτ,ij,i∈α=Ωik,1,β.0i0),<}imi==γ1,<1γ,).0..5.,m i∈Ωk i∈Ωk 1: repeat log(1 γ) (cid:88) (cid:16)ψ¯(k)(cid:16)τ˜(k)(cid:17) ω¯(k)(cid:16)τ˜(k)(cid:17)(cid:17), (17) 2: fork =1tondo − i∈Ωk i i − i i 3: computestatisticsα˜i(k),β˜i(k),τ˜i(k) byEq.(18); 4: endfor whererelevantstatisticsaredefinedas 5: fori=1tomdo 6: solve(αi,βi)fromEq.(19)(Newton-Raphson); α˜(k) = α˜ +ω(k)(τ˜), 7: computeτi byEq.(20); i i i 8: endfor β˜i(k) = β˜i+ψi(k)(τ˜)−ωi(k)(τ˜), and (18) 9: (optional)updateγ fromEq.(21); τ˜(k) = R˜i(k)τ˜i . 10: until{(τi,αi,βi)}mi=1 areallconverged. i R˜(k)τ˜ +1 τ˜ 11: return Θ i i − i RemarkB(, )istheBetafunction,andR˜(k)iscalculatedfrom 3 EXPERIMENTS · · i approximationEq.(15) 3.1 DataSets M-step. Compute the partial derivatives of L with respect We studied a crowdsourced affective data set acquired from to αi and βi: let ∆i be the set of images that are labeled by theAmazonMechanicalTurk(AMT)platform[8].Theaffective YEetal.:PROBABILISTICMULTIGRAPHMODELING... 7 datasetisacollectionofimagestimuliandtheiraffectivelabels Dawid and Skene [9]. Our method falls into the general includingvalence,arousal,dominanceandlikeness(degreeof category of consensus methods in the literature of statistics appreciation). Labels for each image are ordinal: 1, ... , 9 and machine learning, where the spammer filtering decision { } for the first three dimensions, and 1, ..., 7 for the likeness ismadecompletelybasedonthelabelsprovidedbyobservers. { } dimension. The study setup and collected data statistics have Thoseconsensusmethodshavebeendevelopedalongtheline beendetailedin[8],whichwedescribebrieflyhereforthesake ofDawidandSkene[9],andtheymainlydealwithcategorical ofcompleteness. labelsbymodelingeachobserverusingadesignatedconfusion At the beginning of a session, the AMT study host pro- matrix.Morerecentdevelopmentsoftheobservermodelshave vides the subject brief training on the concepts of affective beendiscussedin[17],whereabenchmarkhasshownthatthe dimensions. Here are descriptions used for valence, arousal, Dawid-Skenemethodisstillquitecompetitiveinunsupervised dominance,andlikeness. settings according to a number of real-world data sets for • Valence:degreeoffeelinghappyvs.unhappy whichground-truthlabelsarebelievedtoexistalbeitunknown. • Arousal:degreeoffeelingexcitedvs.calm However,thismethodisnotdirectlyapplicabletoourscenario. • Dominance:degreeoffeelingsubmissivevs.dominant To enable comparison with this baseline method, we first • Likeness:howmuchyoulikeordisliketheimage convert each affective dimension into a categorical label by thresholding.Wecreatethreecategories:high,neural,andlow, Thequestionspresentedtothesubjectforeachimagearegiven each covering a continuous range of values on the scale. For belowinexactwording. example, high valence category implies a score greater than • Slide the solid bubble along each of the bars associated a neural score (i.e., 5) by more than a threshold (e.g., 0.5). with the 3 scales (Valence, Arousal, and Dominance) in Suchathresholdingapproachhasbeenadoptedindeveloping ordertoindicatehowyouACTUALLYFELTWHILEYOU affectivecategorizationsystems,e.g.[5,6]. OBSERVEDTHEIMAGE. Time duration. In the practice of data collection, the host • How did you like this image? (Like extremely, Like filtered spammers by a simple criterion—to declare a subject very much, Like slightly, Neither like nor dislike, Dislike spammer if he spends substantially less time on every task. slightly,Dislikeverymuch,Dislikeextremely) The labels provided by the identified spammers were then Each AMT subject is asked to finish a set of labeling tasks, excluded from the data set for subsequent use, and the host and each task is to provide affective labels on a single image alsodeclinedtopayforthetask.However,somesubjectswho from a prepared set, called the EmoSet. This set contains were declined to be paid wrote emails to the host arguing for around40,000imagescrawledfromtheInternetusingaffective their cases. Under this spirit, in our experiments, we form a keywords. Each task is divided into two stages. First, the baseline method that uses the average time duration of each subject views the image; and second, he/she provides ratings subjecttored-flagaspammer. in the emotion dimensions through a Web interface. Subjects Filtering based on gold standard examples. A widely used usually spend three to ten seconds to view each image, and spammer detection approach in crowdsourcing is to create a fivetotwentysecondstolabelit.Thesystemrecordsthetime small set with known ground truth labels and use it to spot durations respectively for the two stages of each task and anyonewhogivesincorrectlabels.However,suchapolicywas calculates the average cost (at a rate of about 1.4 US Dollars notimplementedinourdatacollectionprocessbecauseaswe per hour). Around 4,000 subjects were recruited in total. For arguedearlier,thereissimplynogroundtruthfortheemotion the experiments below, we retained image stimuli that have responses to an image in a general sense. On the other hand, receivedaffectivelabelsfromatleastfoursubjects.Underthis just for the sake of comparison, it seems reasonable to find a screening, the AMT data have 47,688 responses from 2,039 subsetofimagesthatevokesuchextremeemotionsthatground subjects on 11,038 images. Here, one response refers to the truthlabelscanbeaccepted.Thissubsetwillthenservetherole labelingofoneimagebyonesubjectconductedinonetask. of gold standard examples. We used our method to retrieve Because humans can naturally feel differently from each a subset of images which evoke extreme emotions with high otherintheiraffectiveexperiences,therewasnogoldstandard confidence (see Section 3.7 for confidence score and emotion criterion to identify spammers. Such a human emotion data scorecalculation).Forthevalencedimension,wewereableto set is difficult to analyze and the quality of data is hard to identify at most 101 images with valence score 8 (on the assess. Among several emotion dimensions, we found that scale of 1...9) with over 90% confidence and 37≥images with participantsweremoreconsistentinthevalencedimension.As valence score 2 with over 90% confidence. We also looked areminder,valenceistherateddegreeofpositivityofemotion ≤ atthoseimagesonebyone(asprovidedinthesupplementary evoked by looking at an image. We call the variance of the materials) and believe that within a reasonable tolerance of ratings from different subjects on the same image the within- doubtthoseimagesshouldevokeclearemotionsinthevalence task variance, while the variance of the ratings from all the dimension. Unfortunately, only a small fraction of subjects in subjects on all the images the cross-task variance. For valence ourpoolhavelabeledatleastoneimagefromthis”goldstan- and likeness, the within-task variance accounts for about 70% dard”subset.Amongthissmallgroup,theirdisparityfromthe of the cross-task variance, much smaller than for the other gold standard enables us to find three susceptible spammers. two dimensions. Therefore, the remaining experiments were To see whether these three susceptible spammers can also be focused on evaluating the regularity of image valences in the detected by our method, we find that their reliability scores data. τ [0,1]are0.11,0.22,0.35respectively.InFig.9,weplotthe ∈ distribution of τ of the entire subject pool. These three scores 3.2 BaselinesforComparison are clearly on the low end with respect to the scores of the othersubjects.Thusthethreespammersarealsoassessedtobe We discuss below several baseline methods or models with highlysusceptiblebyourmodel. whichwecompareourmethod. 8 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 In summary, while we were able to compare our method contains highly reliable subjects with high values of τ. Each with the first two baselines quantitatively, with results to be subjecttakesonerow.Fortheconvenienceofvisualization,we presentedshortly,comparisonwiththethirdbaselineislimited represent the three-dimensional emotion scores given to any duetothewaytheAMTdatawerecollected[8]. image by a particular color whose RGB values are mapped fromthevaluesinthethreedimensionsrespectively.Theemo- 3.3 ModelSetup tion labels for every image by one subject are then condensed into one color bar. The labels provided by each subject for Sinceourhypothesesincludedarandomagreementratioγthat all his images are then shown as a palette in one row. For ispre-selected,weadjustedtheparameterγ from0.3to0.48to clarity,thecolorbarsaresortedinlexicographicorderoftheir seeempiricallyhowitaffectstheresultinpractice. RGB values. One can clearly see that those labels given by the subjects from these two categories exhibit quite different patterns. The palettes of the susceptible spammers are more extreme in terms of saturation or brightness. The abnormality of label distributions of the first category naturally originates 0 1. from the fact that spammers intended to label the data by 8 exerting the minimal efforts and without paying attention to 0. thequestions. 6 0. au 3.4 BasicStatisticsofManuallyAnnotatedSpammers t 4 0. Foreachsubjectinthepool,byobservingallhisorherlabels indifferentemotiondimensions,therewasareasonablechance 2 0. of spotting abnormality solely by visualizing the distribution. 0 Ifonewereaspammer,itoftenhappenedthathisorherlabels 0. werehighlycorrelated,skewedordeviatedinanextrememan- 0.30 0.35 0.40 0.45 nerfromaneuralemotionalongdifferentdimensions.Insuch cases,itwaspossibletomanuallyexcludehisorherresponses gamma (a) fromthedataduetohisorherhighsusceptibility.Weapplied this same practice to identifying highly susceptible subjects fromthepool.Wefoundabout200susceptibleparticipants. We studied several basic statistics of this subset in com- 3 parison with the whole population: total number of tasks completed, average time duration spent on image viewing a and survey per task. The histograms of these quantities are h 2 alp plotted in Fig. 6. One can see that the annotated spammers did not necessarily spend less time or finish fewer tasks than 1 the others, and the time duration has shown only marginal sensitivity to those annotated spammers (See Fig. 6). The figuresdemonstratethatthosestatisticsarenoteffectivecriteria 0 forspammerfiltering. -1 0 1 2 3 4 Wewillusethissubsetofsusceptiblesubjectsasa”pseudo- goldstandard”setforquantitativecomparisonsofourmethod beta (b) andthebaselinesinthesubsequentstudies.Asexplainedpre- viouslyin3.2,otherchoicesofconstructingagoldstandardset Figure5.(a)Reliabilityscoresversusγ ∈ [0.3,0.48]forthetop15users either conflict the high variation nature of emotion responses whoprovidedthemostnumbersofratings.(b)Visualizationoftheestimated oryieldonlyatiny(ofsizethree)setofspammers. regularity parameters of each worker at a given γ. Green dots are for workerswithhighreliabilityandreddotsforlowreliability.Theslopeofthe redlineequalsγ. 3.5 Top-K Precision Performance in Retrieving the Real Fig. 5 depicts how the reliability parameter τ varies with Spammers γ for different workers in our data set. Results are shown for We conducted experiments on each affective dimension, and the top 15 users who provided the most numbers of ratings. evaluated whether the subjects with the lowest estimated τ Generallyspeaking,ahigherγ correspondstoahigherchance weresupposedtoberealspammersaccordingtothe”pseudo- of agreement between workers purely out of random. From gold standard” subset constructed in Section 3.4. Since there the figure, we can see that a worker providing more ratings wasnogoldstandardtocorrectlyclassifywhetheronesubject is not necessarily more reliable. It is quite possible that some wastrulyaspammerornot,wehavebeenagnostichere.Based workers took advantage of the AMT study to earn monetary on that subset, we were able to partially evaluate the top-K compensation without paying enough attention to the actual precision in retrieving the real spammers, especially the most questions. susceptibleones. InTable2,wedemonstratethevalence,arousal,anddomi- Specifically, we computed the reliability parameter τ for nancelabelsfortwocategoriesofsubjects.Onthetop,thefirst each subject and chose the K subjects with the lowest values category contains susceptible spammers with low estimated as the most susceptible spammers. Because τ depends on the reliabilityparameterτ;andonthebottom,thesecondcategory random agreement rate γ, we computed τ’s using 10 values YEetal.:PROBABILISTICMULTIGRAPHMODELING... 9 0.025 0.6 0.20 Annotated Spammers Annotated Spammers Annotated Spammers Other Subjects 0.5 Other Subjects Other Subjects 0.020 0.15 Relative Density00..001105 Relative Density000...234 Relative Density0.10 0.05 0.005 0.1 0.000 0.0 0.00 0 100 200 300 400 500 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Counts Image Viewing Duration (sec) Survey Duration (sec) Figure6.Normalizedhistogramofbasicstatisticsincludingtotalnumberoftaskscompletedandaveragetimedurationspentateachofthetwostages pertask. τi αi βi reportedemotions(sorted) 0.19 1.17 2.43 0.08 0.75 2.20 0.08 1.16 2.50 0.09 0.67 1.70 0.03 0.94 1.90 0.17 0.72 1.47 0.06 1.14 2.50 0.17 0.86 1.79 0.04 1.01 2.63 0.03 1.08 2.84 0.92 2.29 1.49 0.94 2.55 1.98 0.95 2.61 1.68 0.92 2.40 1.66 0.91 2.21 1.40 0.92 2.45 1.97 0.93 2.38 1.69 0.93 1.76 1.40 0.91 2.44 1.86 0.92 2.30 1.85 0.92 2.45 1.82 0.91 1.64 1.29 0.90 1.68 1.12 0.91 2.72 2.22 Table2 OraclesintheAMTdataset.Upper:maliciousoracleswhoseαi/βiisamongthelowest30,meanwhile|∆i|isgreaterthan10.Lower:reliableoracles whoseτiisamongthetop30,meanwhileαi/βi>1.2.TheirreportedemotionsarevisualizedbyRGBcolors.TheestimatesofΘisbasedonthe valencedimension. 10 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 000000 time duration method, used in the practice of AMT host, is 1.1.1.1.1.1. better than the Dawin-Skene method, yet substantially worse thantheperformanceofourmethod. 888888 0.0.0.0.0.0. GGGLLLBBBAAA aaavvveeerrr... We also tested this same method of identifying spammers GGGLLLBBBAAA γγγ===000...333 GGGGGGLLLLLLBBBBBBAAAAAA γγγγγγ======000000......343434747474 using affective dimensions other than valence. As shown in DDDSSS nnnnnn 0.60.60.60.60.60.6 DDDuuurrraaatttiiiooonnn Fig. 8, the two most discerning dimensions were valence and oooooo ecisiecisiecisiecisiecisiecisi arousal. It is not surprising that people can reach relatively PrPrPrPrPrPr 444444 higherconsensuswhenratingimagesbythesetwodimensions 0.0.0.0.0.0. than by dominance or likeness. Dominance is much more likely to draw on evidence from context and social situation 222222 0.0.0.0.0.0. in most circumstances and hence less likely to have its nature determinedtoalargerextentbythestimulusitself. 000000 0.0.0.0.0.0. 000000......000000 000000......222222 000000......444444 000000......666666 000000......888888 111111......000000 3.6 RecallPerformanceinRetrievingtheSimulatedSpam- mers RRRRRReeeeeeccccccaaaaaallllllllllll Theevaluationoftop-Kprecisionwaslimitedintworespects: (1) the susceptible subjects were identified because we could Figure7.TheagnosticPrecision-Recallcurve(byvalence)basedonman- clearly observe their abnormality in terms of the multivariate ually annotated spammers. The top 20, top 40 and top 60 precision is 100%,95%,78%respectively(blackline).Itisexpectedthatprecisiondrops distribution of provided labels. If the participant labeled the quickly with increasing recalls, because the manually annotation process data by acting exactly the same as the distribution of the canonlyidentifyaspecialtypeofspammers,whileothertypesofspammers population,wecouldnotmanuallyidentifyhim/herusingthe can be identified by the algorithm. The PR curves at γ = 0.3,0.37,0.44 arealsoplotted.Twobaselinesarecompared:theDawidandSkene(DS) aforementionedmethodology.(2)Westillneedtodetermineif approachandthetimedurationbasedapproach. oneisaspammer,howlikelywearetospothim/her. In this study, we simulated several highly “intelligent” 0000 spammers, who labeled the data by exactly following the 1.1.1.1. label distribution of the whole population. Every time, we valence 0.80.80.80.8 adroomuisnaalnce gTehneelraabteedls 1o0f ssipmaumlamteedrss,pwamhomrearnsdwoemrelynolatboevleedrla5p0piinmga.gWese. likeness onononon 0.60.60.60.6 mixedthoselabelsofthesimulatedspammerswiththeexisting sisisisi data set, and then conducted our method again to determine cicicici PrePrePrePre 0.40.40.40.4 hsiomwulaactecdurastpeamoumrearps.prWoaechrepweaastewditthhisrepsproecctestso1fi0ndtiimngesthine ordertoestimatetheτ distributionofthesimulatedspammers. 2222 0.0.0.0. Results are reported Fig. 9. We drew the histogram of the estimatedreliabilityofallrealworkersandcomparedthemto 0000 0.0.0.0. theestimatedreliabilityofsimulatedspammers(inthetablein- cludedinFig.9).Wenotedthatmorethanhalfofthesimulated 0000....0000 0000....2222 0000....4444 0000....6666 0000....8888 1111....0000 spammerswereidentifiedashighlysusceptiblebasedontheτ RRRReeeeccccaaaallllllll estimation ( 0.2), and none of them were supposed to have ≤ a high reliability score ( 0.6). This result validates that our ≥ method is robust enough to spot the “intelligent” spammers, Figure8.TheagnosticPrecision-Recallcurvebasedonmanuallyannotated even if they disguise themselves as random labelers within a spammerscomputedfromdifferentaffectivedimensions:valence,arousal, dominance,andlikeness. population. 3.7 Qualitative Comparison Based on Controversial Ex- of γ evenly spaced out over interval [0.3,0.48]. The average amples value of τ was then used for ranking. The Precision Recall To re-rank the emotion dimensions and likenesses of stimuli Curves are shown in Fig. 7. Our method achieves high top- withthereliabilityofthesubjectaccountedfor,weadoptedthe K precision by retrieving the most susceptible subjects from the pool according to the average τ. In particular, the top- following formula to find the stimuli with “reliably” highest 20 precision is 100%, the top-40 precision is 95%, and the ratings.Assumeeachratingai [0,1].Wedefinethefollowing ∈ top-60 precision is 78%. Clearly, our algorithm has yielded toreplacetheusualaverage: rrsbeeuysssuuficllexttpisnatigwtbγγleeltl=oona0le0i.sg3..3,nI70en.d3iFs7iw,gb0.eit.t7h4t,e4wrtahetnehdaahlnusuosmtiphnaelgonttohtPjuherdeecrcogirmtsrwieoesonnptaoRconerndcoaisntlslhgCerτeucm.raTvolhlessest, bk :=(cid:124)(cid:80)(cid:80)i∈Ωi∈(cid:123)kΩ(cid:122)τkiaτi(ik)(cid:125)·(cid:124) 1−i∈(cid:89)Ω(cid:123)k(cid:122)(1−τi)(cid:125), (22) est.score indicating that a proper level of the random agreement rate confidence (cid:0) (cid:81) (cid:1) canbeimportantforachievingthebestperformance.Thetwo where 1− i∈Ωk(1−τi) ∈ [0,1] is the cumulative confidence baseline methods are clearly not competitive in this evalua- scoreforimagek.Thisadjustedratingbk notonlyallowsmore tion. The Dawin-Skene method [9], widely used in processing reliablesubjectstoplayabiggerroleviatheweightedaverage crowdsourced data with objective ground truth labels, drops (thefirsttermoftheproduct)butalsomodulatestheweighted quicklytoaremarkablylowprecisionevenatalowrecall.The average by the cumulative confidence score for the image.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.