Table Of ContentIEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016 1
Probabilistic Multigraph Modeling for Improving
the Quality of Crowdsourced Affective Data
Jianbo Ye, Jia Li, Michelle G. Newman, Reginald B. Adams, Jr. and James Z. Wang
(cid:70)
7Abstract—We proposed a probabilistic approach to joint modeling of is the availability of large-scale data collected from human
1participants’ reliability and humans’ regularity in crowdsourced affective subjects, remedying the high complexity and heterogeneity
0studies. Reliability measures how likely a subject will respond to a ques- that the real-world has to offer. With the growing attention
2tion seriously; and regularity measures how often a human will agree on affective computing (initiated from the seminal discussion
with other seriously-entered responses coming from a targeted popula-
n [3] to recent communications [4]), multiple data-driven ap-
tion. Crowdsourcing-based studies or experiments, which rely on human
a proaches have been developed to understand what particular
self-reported affect, pose additional challenges as compared with typical
J
environmental factors drive the feelings of humans [5, 6], and
crowdsourcingstudiesthatattempttoacquireconcretenon-affectivelabelsof
6objects.Thereliabilityofparticipantshasbeenmassivelypursuedfortypical how those effects differ among various sociological structures
non-affectivecrowdsourcing studies,whereas the regularityof humansin andbetweenhumangroups.
]
Lanaffectiveexperimentinitsownrighthasnotbeenthoroughlyconsidered. One crucial hurdle for those affective computing ap-
MIthasbeenoftenobservedthatdifferentindividualsexhibitdifferentfeelings proaches is the lack of full-spectrum annotated stimuli data
on the same test question, which does not have a sole correct response
at a large scale. To address this bottleneck, crowdsourcing-
.in the first place. High reliability of responses from one individual thus
t basedapproachesarehighlyhelpfulforcollectinguncontrolled
acannot conclusively result in high consensus across individuals. Instead,
humandatafromanonymousparticipants[7].Inarecentstudy
tgloballytestingconsensusofapopulationisofinteresttoinvestigators.Built
s reported in [8], anonymous subjects from the Internet were
[upontheagreementmultigraphamongtasksandworkers,ourprobabilistic
modeldifferentiatessubjectregularityfrompopulationreliability.Wedemon- recruited to annotate a set of visual stimuli (images): at each
2stratethemethod’seffectivenessforin-depthrobustanalysisoflarge-scale time point, after being presented with an image stimulus,
vcrowdsourcedaffectivedata,includingemotionandaestheticassessments participants were asked to assess their personal psychologi-
6collectedbypresentingvisualstimulitohumansubjects. cal experiences using ordinal scales for each of the affective
9
dimensions: valence, arousal, dominance and likeness (which
0Index Terms—Emotions, human subjects, crowdsourcing, probabilistic
1graphicalmodel,visualstimuli means the degree of appreciation in our context). This study
0 also collected demographics data to analyze individual dif-
. ference predictors of affective responses. Because labeling a
11 INTRODUCTION large number of visual stimuli can become tedious, even with
0
7Humans’ sensitivity to affective stimuli intrinsically varies crowdsourcing, each image stimulus was examined by only a
1from one person to another. Differences in gender, age, soci- few subjects. This study allowed tens of thousands of images
:ety,culture,personality,socialstatus,andpersonalexperience to obtain at least one label from a participant, which created
v
ican contribute to its high variability between people. Further, a large data set for environmental psychology and automated
X
inconsistencies may also exist for the same individual across emotionanalysisofimages.
renvironmental contexts and current mood or affective state. Oneinterestingquestiontoinvestigate,however,iswhether
a
The causal effects and factors for such affective experiences the affective labels provided by subjects are reliable. A related
have been extensively investigated, as evident in the litera- question is how to separate spammers from reliable subjects,
ture on psychological and human studies, where controlled or at least to narrow the scope of data to a highly reliable
experiments are commonly conducted within a small group subgroup. Here, spammers are defined as those participants
ofhumansubjects—toensurethereliabilityofcollecteddata. who provide answers without serious consideration of the
To complement the shortcomings of those controlled experi- presented questions. No answer from a statistical perspective
ments, ecological psychology aims to understand how objects isknownyetforcrowdsourcedaffectivedata.
and things in our surrounding environments effect human A great difficulty in analyzing affective data is caused by
behaviorsandaffectiveexperiences,inwhichreal-worldstudies the absence of ground truth in the first place, that is, there is
are favored over those within artificial laboratory environ- no correct answer for evoked emotion. It is generally accepted
ments[1,2].Thekeyingredientofthoseecologicalapproaches that even the most reliable subjects can naturally have varied
emotions.Indeed,withvariabilityamonghumanresponsesan-
Manuscriptreceived ;revised .
J.YeandJ.Z.WangarewiththeCollegeofInformationSciencesandTechnology, ticipated,psychologicalstudiesoftencareaboutquestionssuch
The Pennsylvania State University, University Park, PA 16802, USA. J. Li is as where humans are emotionally consistent and where they
withtheDepartmentofStatistics,ThePennsylvaniaStateUniversity,University
are not, and which subgroups of humans are more consistent
Park,PA16802,USA.M.NewmanandR.B.Adams,Jr.arewiththeDepartment
thananother.Givenapopulation,many,ifnotthevastmajority
ofPsychology,ThePennsylvaniaStateUniversity,UniversityPark,PA16802,
USA.(e-mails:{jxy198,jiali,mgn1,radams,jwang}@psu.edu) of stimuli may not have a consensus emotion at all. Majority
2 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016
framework. With this method, investigators running affective
AnnotatorID Valence Reliability
humansubjectstudiescansubstantiallyreduceoreliminatethe
3474 5.1/8 0.08
contaminationcausedbyspammers,henceimprovethequality
2500 0.0/8 0.56 andusefulnessofcollecteddata(Fig.2).
3475 0.0/8 0.34
1.1 RelatedWork
2540 8.0/8 0.04
Estimating the reliability of subjects is necessary in
ImageConfidence:75%( 90%)
≤ crowdsourcing-baseddatacollectionbecausetheincentivesof
Figure 1. An example illustrating one may need to acquire more reliable participantsandtheinterestofresearchersdiverge.Therewere
labels,ensuringtheimageconfidenceismorethan0.9. twolevelsofassumptionsexploredforthecrowdsourceddata,
which we name as the first-order assumption (A1) and the
second-order assumption (A2). Let a task be the provision
votingor(weighted)averagingtoforcean”objectivetruth”of
of emotion responses for one image. Consider a task or test
theemotionalresponseorprobablyforthesakeofconvenience,
conductedbyanumberofparticipants.Theirresponseswithin
asisroutinelydoneinaffectivecomputingsothatclassification
thistaskformasubgroupofdata.
on a single quantity can be carried out, is a crude treatment
bound to erase or disregard information essential for many A1 There exists a true label of practical interest for each task.
interesting psychological studies, e.g., to discover connections The dependencies between collected labels are mediated
betweenvariedaffectiveresponsesandvarieddemographics. by this unobserved true label, of which noisy labels are
The involvement of spammers as participating subjects otherwiseconditionallyindependent.
introduces an extra source of variation to the emotional re- A2 The uncertainty model for a subgroup of data does not
sponses, which unfortunately is tangled with the ”appropri- depend on its actual specified task. The performance of a
ate”variation.Ifresponsesassociatedwithanimagestimulus participant is consistent across subgroups of data subject
containanswersbyspammers,theinter-annotatorvariationfor toasinglefixedeffect.
the specific question could be as large as the variation across Existing approaches that model the complexities of tasks
different questions, reducing the robustness of any analysis. or reliability of participants often require one or both of these
An example is shown in Fig. 1. Most annotators labeling this twoassumptions.UndertheumbrellaofassumptionA1,most
image are deemed unreliable, and two of them are highly probabilisticapproachesusingtheobservermodels [9,10,11,
susceptibleasspammersaccordingtoourmodel.Investigators 12] focus on estimating the ground truth from multiple noisy
mayberecommendedtoeliminatethisimageoracquiremore labels. For example, the modeling of one reliability parameter
reliablelabelsforitsuse.Yet,oneshouldnotbeswayedbythis persubjectisanestablishedpracticeforestimatingtheground
example into the practice of discarding images that solicited truth label [12]. For the case of categorical labels, modeling
responsesofalargerange.Certainimagesarecontroversialin of one free parameter per class per subject is a more general
nature and will stimulate quite different emotions to different approach [9, 13]. Our approach does not model the ground
viewers. Our system acquired the reliability scores shown in truthof labels, hence it isnot viableto compareour approach
Fig.1byexaminingtheentiredataset;thedataonthisimage with other methods in this regard. Instead, we sidestep this
alonewouldnotbeconclusive,infact,farfromso. issue to tackle whether the labels from one subject can agree
Facing the intertwined ”appropriate” and ”inappropriate” withlabelsfromanotheronasingletask.Agreementisjudged
variationsinthesubjectsaswellasthevariationsintheimages, subjecttoapreselectedcriterion.Suchtreatmentmaybemore
we are motivated to unravel the sources of uncertainties by realistic as a means to process sparse ordinal labels for each
taking a global approach. The judgment on the reliability of task.
a subject cannot be a per-image decision, and has to leverage Assumption A2 is also widely exploited among methods,
the whole data. Our model was constructed to integrate these often conditioned on A1. It assumes that all of the tasks have
uncertainties, attempting to discern them with the help of the same level of difficulty [14, 15]. Modeling one difficulty
big data. In addition, due to the lack of ground truth labels, parameter per task has been explored in [16] for categorical
we model the relational data that code whether two subjects’ labels.However,inourapproach,taskdifficultyismodeledas
emotion responses on an image agree, bypassing the thorny arandomeffectwithoutsubscribingatask-specificparameter.
questionsofwhatthetruelabelsareandiftheyexistatall. Wisely choosing the modeling complexity and assumptions
Forthesakeofautomatedemotionanalysisofimages,one should be based on availability and purity of data. As sug-
alsoneedstonarrowthescopetopartsofdata,eachofwhich gested in [17], more complexity in a model could challenge
havesufficientnumberofqualifiedlabels.Ourworkcomputes the statistical estimation subject to the constraint of real data.
imageconfidences,whichcansupportoff-linedatafilteringor Choices with respect to our model attempted to properly
guideon-linebudgetedcrowdsourcingpractices. analyzetheaffectivedataweobtained.
Insummary,systematicanalysisofcrowdsourcedaffective Ifthemutualagreementratebetweentwoparticipantsdoes
data is of great importance to human subject studies and not depend on the actual specified task (i.e., when A2 holds),
affective computing, while remains an open question. To sub- we can essentially convert the resulting problem to a graph
stantially address the aforementioned challenges and expand mining problem, where subjects are vertices, agreements are
the evidential space for psychological studies, we propose a edges, and the proximity between subjects is modeled by
probabilistic approach, called Gated Latent Beta Allocation how likely they agree with each other in a general sense.
(GLBA). This method computes maximum a posteriori prob- Probabilisticmodelsforsuchrelationaldatacanbetracedback
ability (MAP) estimates of each subject’s reliability and reg- toearlystochasticblockmodels[18,19],latentspacemodel[20],
ularity based on a variational expectation-maximization (EM) andtheirlaterextensionswithmixedmembership[21,22]and
YEetal.:PROBABILISTICMULTIGRAPHMODELING... 3
rawavg.:4.06outof8 4.1 2.94 4.78 3.33 4.25 1.9 4.54 2.75 4.53 3
→ → → → →
new:2.51outof8
→
4.06→3.03 4.05 2.87 4.7 2.06 5.08 3.94 5.24 3.93
→ → → →
5.02→3.58 5.7 3.87 5.6 3 5.17 3.19 5.32 2.98 5.38 3.76
→ → → → →
Figure 2. Images shown are considered of lower valence than their average valence ratings (i.e., evoking a higher degree of negative emotions) after
processingthedatasetusingourproposedmethod.Ourmethodeliminatesthecontaminationintroducedbyspammers.Therangeofvalenceratingsis
between0and8.
2.63 3.77 2.8 4.14 3.0 4.7 4.4 6.21 4.7 6.26
→ → → → →
Figure3.Imagesshownareconsideredofhighervalencethan theiraveragevalenceratings(i.e.,evoking ahigherdegreeof positiveemotions)after
processing the data set using our proposed method. Our method again eliminates the contamination introduced by spammers. The range of valence
ratingsisbetween0and8.
nonparametric Bayes [23]. We adopt the idea of mixed mem- 1.2 OurContributions
berships wherein two particular modes of memberships are
To our knowledge, this is the first attempt to connect prob-
modeledforeachsubject,onebeingthereliablemodeandthe
abilistic observer models with probabilistic graphs, and to
other the random mode. For the random mode, the behavior
exploremodelingatthiscomplexityfromthejointperspective.
is assumed to be shared across different subjects, whereas the
Wesummarizeourcontributionsasfollows:
regularbehaviorsofsubjectsinthereliablemodeareassumed
tobedifferent.Therefore,wecanextendthisframeworkfrom • Wedevelopedaprobabilisticmultigraphmodeltoanalyze
graphtomultigraphintheinterestofcrowdsourceddataanal- crowdsourced data and its approximate variational EM
ysis.Specifically,dataarecollectedassubgroups,eachofwhich algorithm for estimation. The new method, accepting the
iscomposedofasmallagreementgraphsforasingletask,such intrinsicvariationinsubjectiveresponses,doesnotassume
thatthecovariatewithinasubgroupismodeled.Ourapproach the existence of ground truth labels, in stark contrast
does not rely on A2. Instead, it models the random effects to previous work having devoted much effort to obtain
addedtosubjects’performanceineachtaskviathemultigraph objectivetruelabels.
approach. Assumption A1 and A2 implies a bipartite graph • Ourmethodexploitstherelationaldataintheconstruction
structurebetweentasksandsubjects.Incontrast,ourapproach and application of the statistical model. Specifically, in-
starts from the multigraph structure among subjects that is steadofthedirectlabels,thepair-wisestatusofagreement
coordinatedbytasks.Findingtheproperandflexiblestructure between labels given by different subjects is used. As
thatdatapossessiscrucialformodeling[24]. a result, the multigraph agreement model is naturally
applicabletomoreflexibletypesofresponses,easilygoing
4 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016
beyond binary and categorical labels. Our work serves as well as that between ai + 1 and aj + 1 in order to reduce
aproofofconceptforthisnewrelationalperspective. the effect of imposing discrete values on the answers that are
• Our experiments have validated the effectiveness of our by nature continuous. If the condition does not hold, they
approachonreal-worldaffectivedata.Becauseourexper- disagree and I(k) = 0. Here we assume that if two scores for
i,j
imental setup was of a larger scale and more challenging the same image are within a 20% percentile interval, they are
than settings addressed by existing methods, we believe considered to reach an agreement. Compared with setting a
ourmethodcanfillsomegapsfordemandsinthepractical threshold on their absolute difference, such rule adapts to the
world,forinstance,whengoldstandardsarenotavailable. non-uniformity of score distribution. Two subjects can agree
with each other by chance or they indeed experience similar
2 THE METHOD emotionsinresponsetothesamevisualstimulus.
While the choice of the percentile threshold δ is inevitably
In this section, we describe our proposed method. Let us
subjective,theselectioninourexperimentswasguidedbythe
present the mathematical notations first. A symbol with sub-
scriptomittedalwaysindicatesanarray,e.g.,x=(...,xi,...). desire to trade-off the preservation of the original continuous
scaleofthescores(favoringsmallvalues)andasufficientlevel
Thearithmeticoperationsperformoverarraysintheelement-
wisemanner,e.g.,x+y =(...,xi+yi,...).Randomvariables of error tolerance (favoring large values). This threshold con-
trols the sparsity level of the multi-graph, and influences the
are denoted as capital English letters. The tilde sign indicates
thevalueofparametersinthelastiterationofEM,e.g.,θ˜.Given marginal distribution of estimated parameters. Alternatively,
afunctionfθ,wedenotefθ˜byf˜θ orsimplyf˜,iftheparameter one may assess different values of the threshold and make a
θ˜is implied. Additional notations, as summarized in Table 1, selection based on some other criteria of preference (if exist)
appliedtothefinalresults.
willbeexplainedinmoredetailslater.
Table1
Symbolsanddescriptionsofparameters,randomvariables,andstatistics.
Symbols Descriptions 2.2 GatedLatentBetaAllocation
Oi subjecti
τi rateofsubjectreliability This subsection describes the basic probabilistic graphical
αi,βi shapeofsubjectregularity model we used to jointly model subject reliability, which is
γ rateofagreementbychance independent from the supplied questions, and regularity. We
Θ unionofparameters refrain from carrying out a full Bayesian inference because it
Tj(k) whetherOj reliablyresponse isimpracticaltoendusers.Instead,weusethemode(s)ofthe
Ji(k) rateofOiagreeingwithotherreliableresponses posterioraspointestimates.
(k)
Ii,j whetherOiagreeswiththeresponsesfromOj We assume each subject i has a reliability parameter τi
ωi(k)(·) cumulativedegreeofresponsesagreedbyOi [0,1]andregularityparametersαi,βi >0characterizinghiso∈r
ψ(k)(·) cumulativedegreeofresponses heragreementbehaviorwiththepopulation,fori = 1,...,m.
i
rj(k)(·) aratioamplifiesordiscountsthereliabilityofOj We also use parameter γ for the rate of agreement between
τ˜i(k) sufficientstatisticsofposteriorTi(k),givenΘ˜ subjects out of pure chance. Let Θ = ({τi,αi,βi}mi=1,γ) be
α˜(k),β˜(k) sufficientstatisticsofposteriorJ(k),givenΘ˜ the set of parameters. Let Ωk be the a random sub-sample
i i i from subjects 1,...,m who labeled the stimulus k, where
{ }
k = 1,...,n. We also assume sets Ωk’s are created inde-
pendently from each other. For each image k, every subject
2.1 AgreementMultigraph pair from Ω2, i.e., (i,j) with i = j, has a binary indicator
Werepresentthedataasadirectedmultigraph,whichdoesnot I(k) 0,1k coding whether t(cid:54)heir opinions agree on the
assume a particular type of crowdsourced response. Suppose rei,sjpec∈tiv{e sti}mulus. We assume I(k) are generated from the
we have prepared m questions in the study, the answers can i,j
following probabilistic process with two latent variables. The
be binary, categorical, ordinal, and multidimensional. Given a
subject pair (i,j) who are asked to look at the k-th question, firstlatentvariableTj(k)indicateswhethersubjectOj isreliable
or not. Given that it is binary, a natural choice of model is the
one designs an agreement protocol that determines whether
the answer from subject i agrees with that from subject j. If Bernoulli distribution. The second latent variable Ji(k), lying
subjecti’sagreeswithsubjectj’sontaskk,thenwesetI(k) = between 0 and 1, measures the extent subject Oi agrees with
i,j
1.OInthoeruwricsaes,eI,i(,wkj)e=ar0e.givenordinaldatafrommultiplechan- tthereizoetdhebryreαliiaabnledrβeisptoonmseosd.eWlJei(uks)ebBeceatausdeisittriisbuatwioindeplayraumseed-
nels, we define I(k) = 1 if (sum of) the percentile difference parametricdistributionforquantitiesoninterval[0,1]andthe
i,j shape of the distribution is relatively flexible. In a nutshell,
betweentwoanswersai,aj ∈{1,...,A}satisfies T(k)isalatentswitch(aka,gate)thatcontrolswhetherI(k)can
j i,j
1(cid:12)(cid:12)P (cid:104)a(k)(cid:105) P (cid:104)a(k)(cid:105)(cid:12)(cid:12)+ 1(cid:12)(cid:12)P (cid:104)a(k)+1(cid:105) P (cid:104)a(k)+1(cid:105)(cid:12)(cid:12) δ, be used for the posterior inference of the latent variable J(k).
2(cid:12) i − j (cid:12) 2(cid:12) i − j (cid:12)≤ Hence, we call our model Gated Latent Beta Allocation (GLBiA).
(1)
ThepercentileP[]iscalculatedfromthewholepoolofanswers AgraphicalillustrationofthemodelisshowninFig.4.
·
for each discrete value, and δ = 0.2. In the above equation, We now present the mathematical formulation of the
we measure the percentile difference between ai and aj as model.Fork =1,...,n,wegenerateasetofrandomvariables
YEetal.:PROBABILISTICMULTIGRAPHMODELING... 5
independentlyvia Θ=({τi,αi,βi}mi=1,γ)[26].Theparameterγ isassumedtobe
pre-selectedbytheuseranddoesnotneedtobeestimated.To
Tj(k) i.i.d. ∼ Bernoulli(τj), j ∈Ωk , (2) regularize the other parameters in estimation, we use the em-
Ji(k) i.i.d. ∼ Beta(αi,βi),(cid:16) i∈(cid:17)Ωk , (3) ppirriiocraslBayesapproachtochoosepriors.Assumethefollowing
(cid:12) Bernoulli J(k) ifT(k) =1
Ii(,kj)(cid:12)(cid:12)Tj(k),Ji(k) ∼ Bernoulli(γ)i ifTj(k) =0 (4) τi ∼ Beta(τ0,1−τ0), (6)
j
αi+βi Gamma(2,s0). (7)
wihearnedthielaΩsktrwanitdhokm=p1ro,.ce..ss,nh,oalnddsfγoirsatnhyerjat∈eoΩf¬kaig:r=eeΩmken−t By empirical Bayes, τ0, s∼0 are adjusted. For the ease of nota-
b{y}chance∈if one of i,j turns out to be unreliable. Here {Ii(,kj)} tions,wedefinetwoauxiliaryfunctionsωi(k)(·)andψi(k)(·):
areobserveddata. ωi(k)(x):= (cid:88) xjIi(,kj), ψi(k)(x):= (cid:88) xj . (8)
j∈Ω¬ki j∈Ωk
τ m α ,β m
{ j}j=1 γ { i i}i=1 Similarly,wedefinetheirsiblings
ω¯(k)(x)=ω(k)(1 x), ψ¯(k)(x)=ψ(k)(1 x). (9)
i i − i i −
Wealsodefinetheauxiliaryfunctionrj()as
T(k) I(k) J(k) ·
j j Ω i,j i i Ωk rj(k)(x)= (cid:89) (cid:18)xγi(cid:19)Ii(,kj)(cid:18)11−xγi(cid:19)1−Ii(,kj). (10)
∈ k ∈ i∈Ω¬j −
k
k = 1,...,n
Nowwedefinethefulllikelihoodfunction:
Figure4.ProbabilisticgraphicalmodeloftheproposedGatedLatentBeta
Allocation. Lk(Θ;T(k),J(k),I(k)):= (cid:89) (cid:16)(τj)Tj(k)(1 τj)1−Tj(k)(cid:17)
−
j∈Ωk
If a spammer is in the subject pool, his or her reliability
(cid:16) (cid:17)α(k)(cid:16) (cid:17)β(k)
parameter τi is zero, though others can still agree with his J(k) i 1 J(k) i φ(k)
or her answers by chance at rate γ. On the other hand, if (cid:89) i − i i , (11)
· B(α ,β )
one is very reliable yet often provides controversial answers, i∈Ωk i i
his reliability τi can be one, while he typically disagrees with
others,indicatedbyhishighirregularityE[J(k)] = αi 0. whereauxiliaryvariablessimplifyingtheequationsare
i αi+βi ≈ (cid:16) (cid:17)
We are interested in finding both types of subjects. However, α(k) = α +ω(k) T(k) ,
i i i
mostofsubjectslieinbetweenthesetwoextremes.
(cid:16) (cid:17)
Asaninterestingnote,Eq.(4)isasymmetric,meaningthat β(k) = β +ψ(k) ω(k) T(k) ,
Idi(e,kjfi)n(cid:54)=itioInj(,ksi)oifstphoestswiboleq,uaasncteintiaersi.oWtheaptrsohpoousledtoneavcehrieovcecusrymby- φi(ik) = γiω¯i(k)(Ti(k))−(1−i γ)ψ¯i(k)(T(k))−ω¯i(k)(T(k)) ,
metry in the final model by using the conditional distribution
andB(, )istheBetafunction.Consequently,assumetheprior
of Ii(,kj) and Ij(,ki) given that Ii(,kj) = Ij(,ki), and call this model likeliho·od· isLΘ(Θ),theMAPestimateofΘistominimize
the symmetrized model. With details omitted, we state that
conditioned on Ti(k), Tj(k), Ji(k), and Jj(k), the symmetrized L(Θ;T,J,I):=LΘ(Θ)(cid:89)n Lk(Θ;T(k),J(k),I(k)). (12)
modelisstillaBernoullidistribution:
k=1
I(k)
i,j ∼ We solve the estimation using variational EM method with a
(cid:32) (cid:32) (cid:33)(cid:33)
Bernoulli H (cid:16)Ji(k)(cid:17)Ti(k)γ1−Ti(k),(cid:16)Jj(k)(cid:17)Tj(k)γ1−Tj(k) , fitoxeadpp(rτo0x,sim0)aatendthveaproysintegriγo.rTbhyeaidfaecatoorfivzaabrilaetitoenmapllmateet,hwohdossies
probabilitydistributionminimizesitsKLdivergencetothetrue
(5)
posterior. Once the approximate posterior is solved, it is then
where pq usedintheE-stepintheEMalgorithmasthealternativetothe
H(p,q)= . trueposterior.TheusualM-stepisunchanged.EachtimeΘis
pq+(1 p)(1 q)
− − estimated, we adjust prior (τ0,s0) to match the mean of the
Wmoedtealcfkolresitmhepliincifteyr.ence and estimation of the asymmetric MAP estimates of {τi} and (cid:26)αi+2 βi(cid:27) respective until they
aresufficientlyclose.
E-step.Weuse thefactorizedQ-approximationwith varia-
2.3 VariationalEM
tionalprinciple:
Variational inference is an optimization based strategy for
approximating posterior distribution in complex distribu- p (cid:16)T(k),J(k)(cid:12)(cid:12)I(k)(cid:17) (cid:89) q∗ (cid:16)T(k)(cid:17) (cid:89) q∗ (cid:16)J(k)(cid:17).
tions [25]. Since the full posterior is highly intractable, we Θ (cid:12) ≈ Tj,Θ j Ji,Θ i
j∈Ωk i∈Ωk
consider to use variational EM to estimate the parameters (13)
6 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016
Let subject i. We set ∂L/∂αi = 0 and ∂L/∂βi = 0 for each i,
•
whichreads
(cid:16) (cid:17)
q∗ T(k)
eTxjp,Θ(cid:16)EJj,T¬j(cid:104)∝logLk(cid:16)Θ;T(k),J(k),I(k)(cid:17)(cid:105)(cid:17) , (14) (cid:18)αis+0βi −log(αi+βi)(cid:19)· 11
whosedistributioncanbewrittenas = (cid:88) ∇B(α˜i(k),β˜i(k)) ∇B(αi,βi)
B(α˜(k),β˜(k)) − B(αi,βi)
Bernoulli(cid:32)τjRj(τkj)R+j(k1)−τj(cid:33), =kk(cid:88)∈∈∆∆ii ΨΨ((iαβ˜˜i(i(kk))))i−−ΨΨ((αα˜˜i(i(kk))++ββ˜˜ii((kk))))
(cid:104) (cid:16) (cid:17)(cid:105)
gwehseterde lboygJRohj(kn)so=naEnJd(cid:80)Koi∈tzΩ[¬k2j7l]o,gtheri(gke)(oJm(ke)tr)icm. Aeasnscuagn- −|∆i|· ΨΨ((αβi))−ΨΨ((ααi++ββi)) , (19)
i i i
−
benumericallyapproximatedby
where Ψ(x) [log(x 1),logx] is the Digamma function.
∈ −
TheabovetwoequationscanbepracticallysolvedbyNewton-
R(k) (cid:89) 1 (cid:32)αi(k)(cid:33)Ii(,kj)(cid:32) βi(k) (cid:33)1−Ii(,kj), Raphsonmethodwithaprojectedmodification(ensuringα,β
j ≈ α(k)+β(k) γ 1 γ alwaysaregreaterthanzero).
i∈Ω¬kj i i − Compute the derivatives of L with respect to τi and set
ifbothαi(k) andβi(k) aresufficientlylargerthan1. (15) ∂L/∂τi =0,whichreads
• Let τi = ∆ 1+1τ0+ (cid:88) τ˜i(k) . (20)
q∗ (J(k)) | i| k∈∆i
Ji,Θ(cid:16) i ∝(cid:104) (cid:16) (cid:17)(cid:105)(cid:17) (16) Compute the derivatives of L w.r.t. γ and set to zero, which
exp E logL Θ;T(k),J(k),I(k) ,
T,J¬i k reads
(cid:80) ω¯(k)(τ˜(k))
γ = i∈Ωk i i . (21)
whosedistributionis (cid:80) ψ¯(k)(τ˜(k))
i∈Ωk i i
Beta(α +ω(k)(τ),β +ψ(k)(τ) ω(k)(τ)). Inpractice,theupdateformulaforγ needsnottobeusedifγ
i i i i − i ispre-fixed.SeeAlgorithm1fordetails.
Given parameter Ω˜ = τ˜i,α˜i,β˜i i=1, we can compute the
{ } 2.4 TheAlgorithm
approximateposteriorexpectationoftheloglikelihood,which
reads We present our final algorithm to estimate all parameters by
knowing the multigraph data I(k) . Our algorithm is de-
{ i,j }
E logL (Θ;T(k),J(k),I(k)) signedbasedonEqs.(19),(20),and(21).IneachEMiteration,
T,J|Θ˜,I k ≈
there are two loops: one for collecting relevant statistics for
const.+logL (Θ)+
Θ
(cid:88) (cid:16)τ˜i(k)logτj +(1−τ˜i(k))log(1−τj)(cid:17)+ eeastcihmsautebsgrfaoprhe,aacnhdstuhebjeoctht.erPlfeoarsree-rceofmerputotinAglgthoeritphamram1eftoerr
j∈Ωk details.
(cid:88) (cid:42) αi ,∇B(α˜i(k),β˜i(k))(cid:43) Algorithm1VariationalEMalgorithmofGLBA
i(cid:88)∈ΩklogBβ(αi i,βi)B+(lα˜ogi(kγ),β(cid:88)˜i(k))ω¯i(k)−(cid:16)τ˜i(k)(cid:17)+ OInuptuIpntu:ittiA:alsimsuabutijleotnci-tg:prτaa0rpa=hm{0eI.t5ike,,jrαs∈iΘ={0=β,1i(}{=}(τiτ,ij,i∈α=Ωik,1,β.0i0),<}imi==γ1,<1γ,).0..5.,m
i∈Ωk i∈Ωk 1: repeat
log(1 γ) (cid:88) (cid:16)ψ¯(k)(cid:16)τ˜(k)(cid:17) ω¯(k)(cid:16)τ˜(k)(cid:17)(cid:17), (17) 2: fork =1tondo
− i∈Ωk i i − i i 3: computestatisticsα˜i(k),β˜i(k),τ˜i(k) byEq.(18);
4: endfor
whererelevantstatisticsaredefinedas 5: fori=1tomdo
6: solve(αi,βi)fromEq.(19)(Newton-Raphson);
α˜(k) = α˜ +ω(k)(τ˜), 7: computeτi byEq.(20);
i i i
8: endfor
β˜i(k) = β˜i+ψi(k)(τ˜)−ωi(k)(τ˜), and (18) 9: (optional)updateγ fromEq.(21);
τ˜(k) = R˜i(k)τ˜i . 10: until{(τi,αi,βi)}mi=1 areallconverged.
i R˜(k)τ˜ +1 τ˜ 11: return Θ
i i − i
RemarkB(, )istheBetafunction,andR˜(k)iscalculatedfrom 3 EXPERIMENTS
· · i
approximationEq.(15) 3.1 DataSets
M-step. Compute the partial derivatives of L with respect We studied a crowdsourced affective data set acquired from
to αi and βi: let ∆i be the set of images that are labeled by theAmazonMechanicalTurk(AMT)platform[8].Theaffective
YEetal.:PROBABILISTICMULTIGRAPHMODELING... 7
datasetisacollectionofimagestimuliandtheiraffectivelabels Dawid and Skene [9]. Our method falls into the general
includingvalence,arousal,dominanceandlikeness(degreeof category of consensus methods in the literature of statistics
appreciation). Labels for each image are ordinal: 1, ... , 9 and machine learning, where the spammer filtering decision
{ }
for the first three dimensions, and 1, ..., 7 for the likeness ismadecompletelybasedonthelabelsprovidedbyobservers.
{ }
dimension. The study setup and collected data statistics have Thoseconsensusmethodshavebeendevelopedalongtheline
beendetailedin[8],whichwedescribebrieflyhereforthesake ofDawidandSkene[9],andtheymainlydealwithcategorical
ofcompleteness. labelsbymodelingeachobserverusingadesignatedconfusion
At the beginning of a session, the AMT study host pro- matrix.Morerecentdevelopmentsoftheobservermodelshave
vides the subject brief training on the concepts of affective beendiscussedin[17],whereabenchmarkhasshownthatthe
dimensions. Here are descriptions used for valence, arousal, Dawid-Skenemethodisstillquitecompetitiveinunsupervised
dominance,andlikeness. settings according to a number of real-world data sets for
• Valence:degreeoffeelinghappyvs.unhappy whichground-truthlabelsarebelievedtoexistalbeitunknown.
• Arousal:degreeoffeelingexcitedvs.calm However,thismethodisnotdirectlyapplicabletoourscenario.
• Dominance:degreeoffeelingsubmissivevs.dominant To enable comparison with this baseline method, we first
• Likeness:howmuchyoulikeordisliketheimage convert each affective dimension into a categorical label by
thresholding.Wecreatethreecategories:high,neural,andlow,
Thequestionspresentedtothesubjectforeachimagearegiven
each covering a continuous range of values on the scale. For
belowinexactwording.
example, high valence category implies a score greater than
• Slide the solid bubble along each of the bars associated
a neural score (i.e., 5) by more than a threshold (e.g., 0.5).
with the 3 scales (Valence, Arousal, and Dominance) in
Suchathresholdingapproachhasbeenadoptedindeveloping
ordertoindicatehowyouACTUALLYFELTWHILEYOU
affectivecategorizationsystems,e.g.[5,6].
OBSERVEDTHEIMAGE.
Time duration. In the practice of data collection, the host
• How did you like this image? (Like extremely, Like
filtered spammers by a simple criterion—to declare a subject
very much, Like slightly, Neither like nor dislike, Dislike
spammer if he spends substantially less time on every task.
slightly,Dislikeverymuch,Dislikeextremely)
The labels provided by the identified spammers were then
Each AMT subject is asked to finish a set of labeling tasks,
excluded from the data set for subsequent use, and the host
and each task is to provide affective labels on a single image
alsodeclinedtopayforthetask.However,somesubjectswho
from a prepared set, called the EmoSet. This set contains
were declined to be paid wrote emails to the host arguing for
around40,000imagescrawledfromtheInternetusingaffective
their cases. Under this spirit, in our experiments, we form a
keywords. Each task is divided into two stages. First, the
baseline method that uses the average time duration of each
subject views the image; and second, he/she provides ratings
subjecttored-flagaspammer.
in the emotion dimensions through a Web interface. Subjects
Filtering based on gold standard examples. A widely used
usually spend three to ten seconds to view each image, and
spammer detection approach in crowdsourcing is to create a
fivetotwentysecondstolabelit.Thesystemrecordsthetime
small set with known ground truth labels and use it to spot
durations respectively for the two stages of each task and
anyonewhogivesincorrectlabels.However,suchapolicywas
calculates the average cost (at a rate of about 1.4 US Dollars
notimplementedinourdatacollectionprocessbecauseaswe
per hour). Around 4,000 subjects were recruited in total. For
arguedearlier,thereissimplynogroundtruthfortheemotion
the experiments below, we retained image stimuli that have
responses to an image in a general sense. On the other hand,
receivedaffectivelabelsfromatleastfoursubjects.Underthis
just for the sake of comparison, it seems reasonable to find a
screening, the AMT data have 47,688 responses from 2,039
subsetofimagesthatevokesuchextremeemotionsthatground
subjects on 11,038 images. Here, one response refers to the
truthlabelscanbeaccepted.Thissubsetwillthenservetherole
labelingofoneimagebyonesubjectconductedinonetask.
of gold standard examples. We used our method to retrieve
Because humans can naturally feel differently from each
a subset of images which evoke extreme emotions with high
otherintheiraffectiveexperiences,therewasnogoldstandard
confidence (see Section 3.7 for confidence score and emotion
criterion to identify spammers. Such a human emotion data
scorecalculation).Forthevalencedimension,wewereableto
set is difficult to analyze and the quality of data is hard to identify at most 101 images with valence score 8 (on the
assess. Among several emotion dimensions, we found that scale of 1...9) with over 90% confidence and 37≥images with
participantsweremoreconsistentinthevalencedimension.As valence score 2 with over 90% confidence. We also looked
areminder,valenceistherateddegreeofpositivityofemotion ≤
atthoseimagesonebyone(asprovidedinthesupplementary
evoked by looking at an image. We call the variance of the
materials) and believe that within a reasonable tolerance of
ratings from different subjects on the same image the within-
doubtthoseimagesshouldevokeclearemotionsinthevalence
task variance, while the variance of the ratings from all the
dimension. Unfortunately, only a small fraction of subjects in
subjects on all the images the cross-task variance. For valence
ourpoolhavelabeledatleastoneimagefromthis”goldstan-
and likeness, the within-task variance accounts for about 70%
dard”subset.Amongthissmallgroup,theirdisparityfromthe
of the cross-task variance, much smaller than for the other
gold standard enables us to find three susceptible spammers.
two dimensions. Therefore, the remaining experiments were
To see whether these three susceptible spammers can also be
focused on evaluating the regularity of image valences in the
detected by our method, we find that their reliability scores
data. τ [0,1]are0.11,0.22,0.35respectively.InFig.9,weplotthe
∈
distribution of τ of the entire subject pool. These three scores
3.2 BaselinesforComparison are clearly on the low end with respect to the scores of the
othersubjects.Thusthethreespammersarealsoassessedtobe
We discuss below several baseline methods or models with
highlysusceptiblebyourmodel.
whichwecompareourmethod.
8 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016
In summary, while we were able to compare our method contains highly reliable subjects with high values of τ. Each
with the first two baselines quantitatively, with results to be subjecttakesonerow.Fortheconvenienceofvisualization,we
presentedshortly,comparisonwiththethirdbaselineislimited represent the three-dimensional emotion scores given to any
duetothewaytheAMTdatawerecollected[8]. image by a particular color whose RGB values are mapped
fromthevaluesinthethreedimensionsrespectively.Theemo-
3.3 ModelSetup tion labels for every image by one subject are then condensed
into one color bar. The labels provided by each subject for
Sinceourhypothesesincludedarandomagreementratioγthat
all his images are then shown as a palette in one row. For
ispre-selected,weadjustedtheparameterγ from0.3to0.48to
clarity,thecolorbarsaresortedinlexicographicorderoftheir
seeempiricallyhowitaffectstheresultinpractice.
RGB values. One can clearly see that those labels given by
the subjects from these two categories exhibit quite different
patterns. The palettes of the susceptible spammers are more
extreme in terms of saturation or brightness. The abnormality
of label distributions of the first category naturally originates
0
1. from the fact that spammers intended to label the data by
8 exerting the minimal efforts and without paying attention to
0. thequestions.
6
0.
au 3.4 BasicStatisticsofManuallyAnnotatedSpammers
t
4
0. Foreachsubjectinthepool,byobservingallhisorherlabels
indifferentemotiondimensions,therewasareasonablechance
2
0. of spotting abnormality solely by visualizing the distribution.
0 Ifonewereaspammer,itoftenhappenedthathisorherlabels
0. werehighlycorrelated,skewedordeviatedinanextrememan-
0.30 0.35 0.40 0.45 nerfromaneuralemotionalongdifferentdimensions.Insuch
cases,itwaspossibletomanuallyexcludehisorherresponses
gamma
(a) fromthedataduetohisorherhighsusceptibility.Weapplied
this same practice to identifying highly susceptible subjects
fromthepool.Wefoundabout200susceptibleparticipants.
We studied several basic statistics of this subset in com-
3 parison with the whole population: total number of tasks
completed, average time duration spent on image viewing
a and survey per task. The histograms of these quantities are
h 2
alp plotted in Fig. 6. One can see that the annotated spammers
did not necessarily spend less time or finish fewer tasks than
1 the others, and the time duration has shown only marginal
sensitivity to those annotated spammers (See Fig. 6). The
figuresdemonstratethatthosestatisticsarenoteffectivecriteria
0
forspammerfiltering.
-1 0 1 2 3 4 Wewillusethissubsetofsusceptiblesubjectsasa”pseudo-
goldstandard”setforquantitativecomparisonsofourmethod
beta
(b) andthebaselinesinthesubsequentstudies.Asexplainedpre-
viouslyin3.2,otherchoicesofconstructingagoldstandardset
Figure5.(a)Reliabilityscoresversusγ ∈ [0.3,0.48]forthetop15users either conflict the high variation nature of emotion responses
whoprovidedthemostnumbersofratings.(b)Visualizationoftheestimated
oryieldonlyatiny(ofsizethree)setofspammers.
regularity parameters of each worker at a given γ. Green dots are for
workerswithhighreliabilityandreddotsforlowreliability.Theslopeofthe
redlineequalsγ.
3.5 Top-K Precision Performance in Retrieving the Real
Fig. 5 depicts how the reliability parameter τ varies with Spammers
γ for different workers in our data set. Results are shown for We conducted experiments on each affective dimension, and
the top 15 users who provided the most numbers of ratings. evaluated whether the subjects with the lowest estimated τ
Generallyspeaking,ahigherγ correspondstoahigherchance weresupposedtoberealspammersaccordingtothe”pseudo-
of agreement between workers purely out of random. From gold standard” subset constructed in Section 3.4. Since there
the figure, we can see that a worker providing more ratings wasnogoldstandardtocorrectlyclassifywhetheronesubject
is not necessarily more reliable. It is quite possible that some wastrulyaspammerornot,wehavebeenagnostichere.Based
workers took advantage of the AMT study to earn monetary on that subset, we were able to partially evaluate the top-K
compensation without paying enough attention to the actual precision in retrieving the real spammers, especially the most
questions. susceptibleones.
InTable2,wedemonstratethevalence,arousal,anddomi- Specifically, we computed the reliability parameter τ for
nancelabelsfortwocategoriesofsubjects.Onthetop,thefirst each subject and chose the K subjects with the lowest values
category contains susceptible spammers with low estimated as the most susceptible spammers. Because τ depends on the
reliabilityparameterτ;andonthebottom,thesecondcategory random agreement rate γ, we computed τ’s using 10 values
YEetal.:PROBABILISTICMULTIGRAPHMODELING... 9
0.025 0.6 0.20
Annotated Spammers Annotated Spammers Annotated Spammers
Other Subjects 0.5 Other Subjects Other Subjects
0.020
0.15
Relative Density00..001105 Relative Density000...234 Relative Density0.10
0.05
0.005
0.1
0.000 0.0 0.00
0 100 200 300 400 500 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Counts Image Viewing Duration (sec) Survey Duration (sec)
Figure6.Normalizedhistogramofbasicstatisticsincludingtotalnumberoftaskscompletedandaveragetimedurationspentateachofthetwostages
pertask.
τi αi βi reportedemotions(sorted)
0.19 1.17 2.43
0.08 0.75 2.20
0.08 1.16 2.50
0.09 0.67 1.70
0.03 0.94 1.90
0.17 0.72 1.47
0.06 1.14 2.50
0.17 0.86 1.79
0.04 1.01 2.63
0.03 1.08 2.84
0.92 2.29 1.49
0.94 2.55 1.98
0.95 2.61 1.68
0.92 2.40 1.66
0.91 2.21 1.40
0.92 2.45 1.97
0.93 2.38 1.69
0.93 1.76 1.40
0.91 2.44 1.86
0.92 2.30 1.85
0.92 2.45 1.82
0.91 1.64 1.29
0.90 1.68 1.12
0.91 2.72 2.22
Table2
OraclesintheAMTdataset.Upper:maliciousoracleswhoseαi/βiisamongthelowest30,meanwhile|∆i|isgreaterthan10.Lower:reliableoracles
whoseτiisamongthetop30,meanwhileαi/βi>1.2.TheirreportedemotionsarevisualizedbyRGBcolors.TheestimatesofΘisbasedonthe
valencedimension.
10 IEEETRANSACTIONSONAFFECTIVECOMPUTING,,VOL.1,NO.1,JANUARY2016
000000 time duration method, used in the practice of AMT host, is
1.1.1.1.1.1.
better than the Dawin-Skene method, yet substantially worse
thantheperformanceofourmethod.
888888
0.0.0.0.0.0. GGGLLLBBBAAA aaavvveeerrr... We also tested this same method of identifying spammers
GGGLLLBBBAAA γγγ===000...333
GGGGGGLLLLLLBBBBBBAAAAAA γγγγγγ======000000......343434747474 using affective dimensions other than valence. As shown in
DDDSSS
nnnnnn 0.60.60.60.60.60.6 DDDuuurrraaatttiiiooonnn Fig. 8, the two most discerning dimensions were valence and
oooooo
ecisiecisiecisiecisiecisiecisi arousal. It is not surprising that people can reach relatively
PrPrPrPrPrPr 444444 higherconsensuswhenratingimagesbythesetwodimensions
0.0.0.0.0.0.
than by dominance or likeness. Dominance is much more
likely to draw on evidence from context and social situation
222222
0.0.0.0.0.0. in most circumstances and hence less likely to have its nature
determinedtoalargerextentbythestimulusitself.
000000
0.0.0.0.0.0.
000000......000000 000000......222222 000000......444444 000000......666666 000000......888888 111111......000000 3.6 RecallPerformanceinRetrievingtheSimulatedSpam-
mers
RRRRRReeeeeeccccccaaaaaallllllllllll
Theevaluationoftop-Kprecisionwaslimitedintworespects:
(1) the susceptible subjects were identified because we could
Figure7.TheagnosticPrecision-Recallcurve(byvalence)basedonman-
clearly observe their abnormality in terms of the multivariate
ually annotated spammers. The top 20, top 40 and top 60 precision is
100%,95%,78%respectively(blackline).Itisexpectedthatprecisiondrops distribution of provided labels. If the participant labeled the
quickly with increasing recalls, because the manually annotation process data by acting exactly the same as the distribution of the
canonlyidentifyaspecialtypeofspammers,whileothertypesofspammers
population,wecouldnotmanuallyidentifyhim/herusingthe
can be identified by the algorithm. The PR curves at γ = 0.3,0.37,0.44
arealsoplotted.Twobaselinesarecompared:theDawidandSkene(DS) aforementionedmethodology.(2)Westillneedtodetermineif
approachandthetimedurationbasedapproach. oneisaspammer,howlikelywearetospothim/her.
In this study, we simulated several highly “intelligent”
0000 spammers, who labeled the data by exactly following the
1.1.1.1.
label distribution of the whole population. Every time, we
valence
0.80.80.80.8 adroomuisnaalnce gTehneelraabteedls 1o0f ssipmaumlamteedrss,pwamhomrearnsdwoemrelynolatboevleedrla5p0piinmga.gWese.
likeness
onononon 0.60.60.60.6 mixedthoselabelsofthesimulatedspammerswiththeexisting
sisisisi data set, and then conducted our method again to determine
cicicici
PrePrePrePre 0.40.40.40.4 hsiomwulaactecdurastpeamoumrearps.prWoaechrepweaastewditthhisrepsproecctestso1fi0ndtiimngesthine
ordertoestimatetheτ distributionofthesimulatedspammers.
2222
0.0.0.0. Results are reported Fig. 9. We drew the histogram of the
estimatedreliabilityofallrealworkersandcomparedthemto
0000
0.0.0.0. theestimatedreliabilityofsimulatedspammers(inthetablein-
cludedinFig.9).Wenotedthatmorethanhalfofthesimulated
0000....0000 0000....2222 0000....4444 0000....6666 0000....8888 1111....0000
spammerswereidentifiedashighlysusceptiblebasedontheτ
RRRReeeeccccaaaallllllll estimation ( 0.2), and none of them were supposed to have
≤
a high reliability score ( 0.6). This result validates that our
≥
method is robust enough to spot the “intelligent” spammers,
Figure8.TheagnosticPrecision-Recallcurvebasedonmanuallyannotated
even if they disguise themselves as random labelers within a
spammerscomputedfromdifferentaffectivedimensions:valence,arousal,
dominance,andlikeness. population.
3.7 Qualitative Comparison Based on Controversial Ex-
of γ evenly spaced out over interval [0.3,0.48]. The average
amples
value of τ was then used for ranking. The Precision Recall
To re-rank the emotion dimensions and likenesses of stimuli
Curves are shown in Fig. 7. Our method achieves high top-
withthereliabilityofthesubjectaccountedfor,weadoptedthe
K precision by retrieving the most susceptible subjects from
the pool according to the average τ. In particular, the top- following formula to find the stimuli with “reliably” highest
20 precision is 100%, the top-40 precision is 95%, and the ratings.Assumeeachratingai [0,1].Wedefinethefollowing
∈
top-60 precision is 78%. Clearly, our algorithm has yielded toreplacetheusualaverage:
rrsbeeuysssuuficllexttpisnatigwtbγγleeltl=oona0le0i.sg3..3,nI70en.d3iFs7iw,gb0.eit.t7h4t,e4wrtahetnehdaahlnusuosmtiphnaelgonttohtPjuherdeecrcogirmtsrwieoesonnptaoRconerndcoaisntlslhgCerτeucm.raTvolhlessest, bk :=(cid:124)(cid:80)(cid:80)i∈Ωi∈(cid:123)kΩ(cid:122)τkiaτi(ik)(cid:125)·(cid:124) 1−i∈(cid:89)Ω(cid:123)k(cid:122)(1−τi)(cid:125), (22)
est.score
indicating that a proper level of the random agreement rate confidence
(cid:0) (cid:81) (cid:1)
canbeimportantforachievingthebestperformance.Thetwo where 1− i∈Ωk(1−τi) ∈ [0,1] is the cumulative confidence
baseline methods are clearly not competitive in this evalua- scoreforimagek.Thisadjustedratingbk notonlyallowsmore
tion. The Dawin-Skene method [9], widely used in processing reliablesubjectstoplayabiggerroleviatheweightedaverage
crowdsourced data with objective ground truth labels, drops (thefirsttermoftheproduct)butalsomodulatestheweighted
quicklytoaremarkablylowprecisionevenatalowrecall.The average by the cumulative confidence score for the image.