ebook img

Clicktionary: A Web-based Game for Exploring the Atoms of Object Recognition PDF

9 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Clicktionary: A Web-based Game for Exploring the Atoms of Object Recognition

Clicktionary: A Web-based Game for Exploring the Atoms of Object Recognition D.Linsley* S.Eberhardt T.Sharma P.Gupta T.Serre CognitiveLinguisic&PsychologicalSciencesDepartment BrownUniversity Providence,RI,USA 7 1 *drew [email protected] 0 2 n Abstract Human DCN a Glasses realization importance J Understandingwhatvisualfeaturesandrepresentations 1 0 contribute to human object recognition may provide scaf- 1 folding for more effective artificial vision systems. While Sco ] recent advances in Deep Convolutional Networks (DCNs) r V e haveledtosystemsapproachinghumanaccuracy,itisun- C clear if they leverage the same visual features as humans 0 s. forobjectrecognition. c We introduce Clicktionary, a competitive web-based Figure 1. Humans and DCNs utilize different visual features for [ object recognition: “Realization” maps are derived from human game for discovering features that humans use for object observers playing Clicktionary, in which regions are scored ac- 1 recognition: One participant from a pair sequentially re- cording to their relevance for correct recognition and compared v vealspartsofanobjectinanimageuntiltheothercorrectly 4 withfeatureimportancemapsobtainedforDCNs. identifiesitscategory. Scoringimageregionsaccordingto 0 theirproximitytocorrectrecognitionyieldsmapsofvisual 7 2 featureimportanceforindividualimages.Wefindthatthese achievedbyourvisualsystemarisesthroughcorticalfeed- 0 “realization” maps exhibit only weak correlation with rel- backratherthanthroughstaticprocessingstages[6]. Ithas 1. evance maps derived from DCNs or image salience algo- alsobeenshownthatDCNsdonotgeneralizewelltoatyp- 0 rithms. Cueing DCNs to attend to features emphasized by ical scenes, such as when objects are presented outside of 7 thesemapsimprovestheirobjectrecognitionaccuracy.Our 1 their usual context [6, 24] or when features strongly di- resultsthussuggestthatrealizationmapsidentifyvisualfea- : agnostic for recognition are absent [31]. This raises the v turesthathumansdeemimportantforobjectrecognitionbut possibilitythatDCNsmayleverageentirelydifferentvisual i arenotadequatelycapturedbyDCNs. Torectifythisshort- X strategiesthanhumansduringobjectrecognition. coming,weproposeanovelweb-basedapplicationforac- r a quiringrealizationmapsatscale,withtheaimofimproving The most direct evidence for qualitatively different vi- thestate-of-the-artinobjectrecognition. sual strategies used by humans and DCNs was demon- strated recently [31]. The authors located patches in sev- eral object images where human recognition was correct, 1.Introduction butanyfurtherreductionofthesepatches(i.e.,bycropping or down-sampling) led to incorrect responses. These were AdvancesinDeepConvolutionalNetworks(DCNs)have calledtheminimallyreducibleconfigurationsoftheimages led to systems rivaling human accuracy in basic object forrecognition,orMIRCs. Computervisionalgorithmsin- recognition tasks [9]. While a growing body of work in- cluding DCNs failed to exhibit the same dramatic all-or- dicates this surge in performance carries concomitant im- nothing recognition of MIRCs as human participants. In provementinfittingbothneuraldatainhigherareasofthe otherwords,theobjectfeaturesvisibleinMIRCsweremore primate visual cortex (reviewed in [35]) and human psy- diagnostic of object category for humans than for DCNs. chophysicaldataduringobjectrecognition[14],keydiffer- Although promising, this approach required about 14,000 ences remain. It has been suggested that processing depth participantstoidentifyMIRCsfor10images,whichinto- 1 talyieldedlessthan200MIRCsatvaryingresolutionsand mentsduringtask-freeviewingofimagestoestimatelocal sizes. This makes it difficult to identify critical visual fea- saliencycuesforfixations[18,11]. turesinobjectimagesandseriouslylimitstheapplicability Cognitivepsychologistshaveclassicallyusedsimilarity ofthismethodtogatherenoughdataforamoresystematic judgmentsbetweenimagepairsasthegatewaytostudying analysisovermanyimageexemplarsandobjectcategories. visual representations. Recent work has compared simi- To address these limitations, we created Clicktionary, a larity judgments derived from human judgments and rep- collaborative web-based game for identifying visual fea- resentative DCNs and found good agreement between the tures that are necessary for object recognition. Pairs of two [22]. Related work has also evaluated the ability of human participants work together to identify objects: One DCNstopredictmemorability[5]andtypicalityofindivid- playerrevealsdiagnosticimageregionswhiletheothertries ualimages[17]. to recognize the object as quickly as possible. Amassing game-playdataacrossmanyparticipantsyieldsimportance Computationalmodels: Agrowingbodyofresearchfo- maps for individual images, in which each pixel is scored cusesonunderstandingthenatureofthevisualfeaturesused accordingtoitscontributiontowards“realizing”anobject’s by DCNs for object recognition. These methods fall into category. This is illustrated in the middle panel of Fig. 1: oneoftwogroups. Sensitivityanalysesuseeithergradient- thehotterthepixel,themorelikelyitcausedrecognition. based approaches [37, 28] or systematic perturbations of Here we compare the visual features used by humans, the stimulus to estimate local pixel-wise contributions of DCNs, and other computational models of vision. Using visual features to a classification decision [38]. Decision realization maps from Clicktionary we show that: (1) Re- analyses such as layer-wise relevance propagation (LRP) alizationmapsderivedfromhumansduringobjectrecogni- estimateglobalpixelresponsibilityfortheclassificationde- tionaredissimilartoimportancemapsderivedfromDCNs cision. Representative methods from both approaches are and salience maps derived from attention models. (2) Us- usedheretoderiveimportancemapsfromDCNsforcom- ingrealizationmapstoguideDCNstowardsobjectfeatures parisonwiththosederivedfromhumans. that are diagnostic for humans leads to a small but signifi- cantimprovementinrecognitionaccuracy.(3)Motivatedby Web-based games for data collection: There is a long these findings, we introduce clickme.ai, a web-based historyofleveragingthewisdomofthecrowdthroughweb- applicationforgatheringrealizationmapsatscaleandmea- basedapplicationstogatherhigh-qualitydataforcomputer- suringtheirimpactonDCNperformanceinreal-time. visionstudies. CloselyrelatedtoourproposedClicktionary game are the ESP game for identifying objects in real- 2.RelatedWork worldimages[33]andthePeek-a-boomgameforlocating Behavioralstudies: Acentralgoalinvisionscienceisto them [34]. In both of these games, participants work to- understand what features the human visual system uses to gethertorecognizeanimageofanobject. Wetakeparticu- process complex visual scenes. Classic approaches to un- lar inspiration from the mechanics of Peek-a-boom, where cover the internal representations contributing to behavior oneparticipantinapairrevealspartsofanimagetogetthe include reverse correlation methods. These generally in- other participant to respond in some way about it. How- volve analyzing the statistical relationship between recog- ever, the mechanics of Peek-a-boom and Clicktionary are nitionandvisualperturbationsappliedatdifferentlocations differentinseveralkeyways: (1)BecausethegoalofPeek- inastimulusacrossmanytrials. Whilethesemethodshave a-boomwastolocalizeobjectsinimages,theinterfacepro- helpedidentifyvisualfeaturesthatarediagnosticforfaces videdalargecursorforunveilingregionsoflow-resolution andothersyntheticobjectstimuli[26,20],theytypicallyre- images. Ahigher-resolutioninterfacewasneededtostudy quirethousandsoftrialspersubjecttodeducetheserepre- the importance of local visual features for object recogni- sentationsdespitelittleshapevariabilityforthestudiedob- tion. (2) Peek-a-boom players solve problems by sending jectclasses. Thus,theyseemimpracticalforcharacterizing eachothervisual“hints”aboutwhatshouldbelocated,and visualrepresentationsusedforcategorizinggeneralclasses ratetheaccuracyofeachothersguesses. Thisisobviously of objects with higher variability in stimulus appearance, likelytoconfoundanyinterpretationoftheresultsinterms location,orlighting. of the diagnosticity of visual features. (3) More generally, Another way of exploring visual feature importance is there were no constraints in how Peek-a-boom players re- byrecordingeyefixations. Patternsofeyefixationsrepre- vealedimages. Thismeantthatplayerscouldadoptstrate- sent observers’ efforts to center their high-acuity fovea on giesthatwouldinterferewiththegoalsofthecurrentstudy salient or task-relevant information, and capture diagnos- suchas“saltandpeppering”thescreenwithclicksorwait- tic features in images [36, 8, 12, 13]. However, it remains inglongintervalsbetweenclicks. difficultandcostlytoacquirelarge-scaleeyetrackingdata, Other web-based games have used similar interfaces as leadingresearchersinsteadtotrackcomputermousemove- Clicktionarytoexplorevariousquestionsabouthumanper- Figure 1 ception. Thisincludesagameforannotatingimageproper- Teacher ties,whereplayerstaketurnsoutliningimportantobjectsin Your pace Best pace real-worldscenesandguessingtheiridentities [30]. Simi- 10 clicks 30 clicks µ larly,visualQuestionAnswering(VQA)wasexploredina Realization gamewhereparticipantsde-blurredpartsofasceneimage that were important for answering a question about it [4]. The game style has also been used to answer questions in biologicalandphysicalsciences,suchaspredictingprotein structure[2]orneuronalconnectivity[15]. Your partner is typing… Human-in-the-loopcomputing: Despiterapidadvances inmachineperformanceonvisiontaskshumansremainthe Your pace Best pace 10 clicks 30 clicks gold standard, particularly when making decisions on im- ages depicting atypical views or containing many occlu- Car 1 sions. Researchershavefoundthathumanperceptualjudg- R ments in cases like these can augment the performance of e a modelsforvision. Thishasleadtosignificantgainsinface liz a recognition and localization [25], action recognition [32], tio n objectdetectionandsegmentation[21,27]. 0 Your partner is clicking… 3.Identifyingfeaturesforobjectrecognition Student Clicktionary: Clicktionary was constructed to support the identification of visual features that cause recognition Figure 2. Overview of Clicktionary. Pairs of participants, one decisionsinhumans. Uponstartingthegame,playerspro- teacher and one student, played together to categorize objects. videdinformedconsent,readinstructions,andwereplaced The teacher used the mouse to reveal image regions while the intoavirtualwaitingroom. Thewaitingroomcontaineda studentperformedavisualcategorizationtask. Scoringrevealed scoreboard listing the performance of the most successful regionsaccordingtotheirtemporalproximitytocorrectrecogni- teams to play the game. Our hope was to provide an in- tionyieldedper-participant“realizationmaps”. Averagingacross centive for players to compete with each other in order to participantsyieldedrealizationmapsthatidentifiedvisualfeatures causingrecognition. collectthehighestqualitybehavioraldatapossible. Partici- pantswereautomaticallypairedinthewaitingroom1.Once thegamebegan,playerscollaboratedoveraseriesofrounds category of the object in the image. As the teacher bub- tonamethecategoryofanobjectimage(Fig.2). bled in image regions, the corresponding locations of the Players accomplished this by alternating between two blacked-outimagewereunveiled. Studentswereinstructed differentrolesovermanygamerounds. Ineachround,one tonamethebasic-levelcategoryoftheobject. Forinstance, playerwasthestudentandtheothertheteacher. Thejobof the desired response for an image of a border collie was theteacherwastorevealareasofanimagethatweremost “Dog”. However, wealsoacceptedsubordinate-levelcate- informative for visual categorization. Teachers did this by gorylabelsinordertoexpeditegameplay. clickingontheimageanddraggingtheirmouse,likeapaint brush,torevealitsinformativeparts. Imageswere300x300 To provide players with incentive to work as quickly pixels, and each revealed image patch was 18x18 pixels. and efficiently as possible, their team’s performance ver- Thesearehereafterreferredtoas“bubbles”. Teacherswere sus the average performance of the top-10 teams was visi- instructedtochoosetheirfirstclickcarefullybecausebub- blethroughoutthegame. Teamperformancewasmeasured bleswereautomaticallyplacedafterthefirstonewitharan- as the number of image bubbles placed by teachers before domintervalbetween50msand300ms. Thecenterofeach studentsrecognizedtheimage. Althoughtherewasnoex- bubble was constrained to be within the radius of the pre- plicitpenaltyforwronganswers,participantsachievedbet- vious one. Translucent blue boxes on the teacher’s image terscoresbyavoidingthem.Iftheyfinishedthegameinthe markedtheimageregionsvisibletothestudent. top-10theywerecongratulatedandshowntheirranking. In parallel, the student began each round viewing a Incorrectguessescaused aredoutlinetoappear around blacked-outimageandatextboxrequestingaguessofthe theimageviewedbybothstudentandteacher. Followinga correctguess,theimageswerebrieflyoutlinedingreenbe- 1Ifnooneelseenteredthewaitingroomwithin120seconds,players foreparticipantsbeganthenextround. Ifthestudentcould playedagainstaDCNopponent.Thesedatawerenotincludedinthecur- rentstudy. not figure out the object’s class, he pressed a skip button Figure 2 VGG16 Realization Realization VGG16 Sensitivity Bottom-up maps Exp. 1 maps Exp. 2 LRP analysis salience DeepGaze II MIRC s e s s a Gl e k Bi e Ti g u B 0 1 Normalized score Figure3.Heatmapsdepictingobjectfeatureimportanceforhumansandmachines. Fromlefttoright: realizationmapsonimagesfrom Experiments 1 and 2 (the black box was added for emphasis), layer-wise relevance propagation maps (LRP) from VGG16, sensitivity analysisonVGG16,bottom-upsalience,predictedsaliencefromDeepGazeII,andMIRCs. that penalized his team’s performance with the equivalent bus, speedboat, sports car, and trailer truck. For Exper- of 100 bubbles. Student and teacher switched roles after iment 2, we chose the 5 animate and 5 inanimate cate- each round. The game was played for 110 rounds, with a gories that were the most difficult for VGG16 [29] to cat- differentimageeachround. Eachpairplayedonarandom egorize. These were english foxhound, husky, miniature orderingofimages. poodle, night snake, and polecat; cassette player, missile, screen,sunglasses,andwaterjug. Although participants could not communicate, we in- cluded features to make the game feel more collaborative. Games lasted about 20 minutes and participants were These were real-time notifications of what each player in only allowed to play once. A total of 46 participants took the pair was doing at any point in time: clicking, typing, part in Experiment 1, and 14 participants took part in Ex- correctandincorrectresponses,orthinkingaboutwhatpart periment 2. Participants were recruited through Amazon oftheimagetorevealfirst. Mechanical Turk or from introductory cognitive science classes,andreimbursedapproximately$8.00/hr. We ran two versions of the Clicktionary game, referred to hereafter as Experiment 1 and Experiment 2. Both ex- We created realization maps through a two-step proce- periments included the ten images used in [31], for which dure.First,theclickpathrecordedbetweeneachpairofpar- wealsohadtheMIRCsgraciouslyprovidedbytheauthors. ticipantsforanimagewasscoredtoemphasizepixelscloser Beyondthese,eachexperimenthadparticipantsjudgeadif- to recognition: realization(x,y) = i, where the realization j ferent set of 100 object images taken from the validation scoreatabubblecenteredat(x,y)isthecurrentbubblein- set of the 2012 ImageNet Large Scale Visual Recognition dex,i,dividedbythetotalnumberjofbubblesrevealedfor Challenge (ILSVRC; [23]). Images for each experiment the participant pair on the image. We used this formula- were selected from 10 categories, 5 animate and 5 inani- tionbecauseitfavorsvisualfeaturesthatcauserecognition. mate. ForExperiment1,wechoserepresentativecategories Bubbles that come right before recognition receive higher for animate objects: border collie, bald eagle, great white scores than those that do not, because we consider these shark, and sorrel; and inanimate objects: airliner, school featurestoreflectatippingpointinrecognition. Ourscore is also similar in spirit to the image features captured in Efficient Inefficient MIRCs[31]. 1 Minimally reducible image configurations: Another source of information regarding important object features S c canbeinferredfromresultsreportedin[31]. MIRCsrecov- o eredfor10imagescontainfeaturesthatwerecriticalforrec- r e ognizing the object these images contained. We converted MIRCs into heatmaps to support a systematic comparison with other measures of feature importance derived from 0 bothhumanparticipantsandDCNs. Thisinvolvedplacing a 2D gaussian over the location of where each MIRC was Figure 4. Realization map features vary according to how effi- discovered. TheheightofeachMIRCgaussianwasthein- cientlyaparticipantpairrecognizedimages.Ontheleft,themean realization map from pairs with above-median efficiency in rec- verseoftheratioofitspixelwidthandtheMIRCwiththe ognizingglasses(i.e. fasterrecognition). Ontheright,themean smallestpixelwidthforthatimage. Inotherwords,MIRCs realizationmapfrombelow-medianpairs.Theaboveimageisrep- spanninglargeareashadsmallercontributionstotheMIRC resentativeofthetypicaldifferencesbetweenthesegroups. heatmaps than MIRCs spanning small areas. This scoring system was adopted to maximize our chances in identify- ingvisualfeaturescontributingtoMIRCs. Eachpixelina Wemeasuredtheinter-subjectandinter-experimentcon- MIRC heatmap was computed as the mean intensity value sistency of the realization maps extracted from the Click- acrosstheseGaussians. Oneoftheimagesusedin[31]de- tionary game. Through two experiments, we observed pictedanimagecategorynotinILSVRC2012andwasex- thattheserealizationmapsarestronglystereotyped. Inter- cludedfromanalysis. subject consistency was computed using a randomization test(n=1000)bycorrelatingrealizationmapsderivedfrom participants randomly assigned into halves. There was a Models of object recognition: The primary objective of correlation of ρ = 0.79 (p <0.001) in Experiment 1 and thisstudyistocompareimportancemapsderivedfromhu- ρ = 0.60 (p <0.001) in Experiment 2. We also mea- manparticipantsandDCNs. Insupportofthisweproduced suredconsistencybetweenparticipantsinExperiment1and DCNheatmapsofobjectimportanceforeachoftheimages Experiment 2 on the MIRC images that both groups saw. usedinClicktionary. ThiswasdoneusingVGG16,avari- Again,therewasastrongcorrespondencebetweenrealiza- ant of the popular VGG architecture [29]. We calculated tionmapsderivedfromtheseimagesforparticipantsacross heatmapsusingasensitivityanalysisasdonein[38]anda the two experiments (ρ = 0.55, p <0.001). Motivated by decisionanalysis(LRP)[1]forthismodel. thisagreement,allanalyseshereafteruseapoolofsubjects frombothexperimentsunlessotherwisenoted. Modelsofattention: Ascontrols,weconsideredtheex- Despite the strong agreement we observed between tenttowhichrealizationmapswereexplainedthroughdif- Clicktionary participants, we found that realization maps ferent models of attention. Theory holds that attention were affected by player performance. Applying a median is driven by both bottom-up and top-down mechanisms splittothenumberofbubblesittookbeforeastudentina [10]. We therefore compared realization maps to impor- pairrecognizedanimagerevealedtwoqualitativelydiffer- tance maps extracted from models for both kinds of atten- entrealizationmaps(Fig.4). Mapsfromtheefficientgroup tion, referred to hereafter as bottom-up saliency [10] and (i.e. numberofbubblesisbelowmediansplit)weresparser DeepGazeII[16]. than those from the less efficient group. More work is neededtounderstandifthesedifferencesindicateincreased 4.Experiments noise from the latter group or belie different sets of visual featuresselectedbyeach. Realizationmaps: Wesystematicallycomparedfeatures usedbyhumanstocategorizerepresentativeobjectstothose used by DCNs. We use rank-order correlation throughout Humansversusmachines: Weseparatelycomparedim- tomeasuretheseassociationsbecauserealizationmapsare portance maps from humans, DCNs, and attention models sparse and do not satisfy normality assumptions (no click on images taken from ImageNet and the MIRC images. mappassedaKolmogorov-Smirnovtestfornormality). All Strikingly, there was only a weak albeit significant rela- tests for significance used independent sample t-tests with tionship between realization maps and LRP derived from two-tailedp-values. VGG16 (ρ = 0.33, p < 0.001; Fig. 5). By contrast, real- Figure 4 Figure 3 ns s 0.6 *** -1 Rank-order correlations 1 op tia MIRC am 0.4 el Realization rron maps Exp. 1 -0.01 coati 0.2 der aliz mRaepasli zEaxtpio. n2 0.09 0.55*** re -o r 0 VGG16 LRP 0.08 0.28* 0.36*** kh anwit VGG16 R -0.2 sensitivity -0.03 0.03 0.03 0.05 Bottom-up 0.00 -0.01 -0.02 -0.01 0.00 salience DeepGaze II 0.05 0.04 0.03 0.01 0.00 0.00 Figure5.Correlationsbetweenhumanrealizationmapsandfea- tureimportancemapsderivedfromvisionmodels. Eachdotrep- resentsthemeancorrelationforindividualimagecategory. Inde- pendentsamplest-testsmeasureddeviationfrom0.***:p<0.001 Figure6.Meanpairwisecorrelationsbetweenrealizationmapsde- izationmapswerenotcorrelatedwithasensitivityanalysis rivedfromhumansandDCNsusingimagesfrom[31]. Cellsare performed on VGG16 (ρ = 0.03, n.s.). Realization maps coloredaccordingtothestrengthanddirectionofassociationbe- werealsonotcorrelatedwithbottom-upsalience(ρ=0.00, tweenmaps. n.s.) or eye fixation predicted by DeepGaze II (ρ = 0.03, n.s.; Fig. 5; also see supplementary information for repre- sentativeheatmapsderivedfromhumanparticipantsinEx- tionship may be due to the coarseness of MIRC heatmaps periments1and2.) orthesmallsamplesizeused,thecurrentevidencedoesnot Asafollowup,weexploredthesimilarityofhumanand supportrealizationmapsasanalternatemethodforacquir- machine feature importance maps on the images used to ingMIRCs. identifyMIRCs[31]. ThisallowedustoincludeMIRCsin the comparison and understand their overlap with the fea- DrivingDCNstowardshumanrealization: Whatisthe tureimportancemapsdiscussedhere.WefoundthatDCNs, explanationfortheweakoverlapbetweenimportancemaps attention models, and realization maps qualitatively share derivedfromhumanparticipantsvs. DCNs? Itcouldreflect atleastsomeoverlapwiththeregionsoccupiedbyMIRCs basic mechanistic differences between humans and DCNs, (Fig. 3) although the correlation was weak (Fig. 6; corre- suchasadifferentnumberofprocessinglayersorthepres- lations depicted in the first column). As with ImageNet ence/absenceoftop-downsignals[6].Anotherpossibilityis images, we found significant, albeit weak correspondence that it is indicative of possible biases and other limitations between realization maps and LRP (ρ = 0.32, p <0.001; associated with the datasets and routines used for training seeFig.6fordetailsoneachexperiment).However,neither DCNs. If this is the case, one would expect that cueing the sensitivity analysis nor the attention models were cor- DCNs to attend to features that are diagnostic for human related with realization maps, and LRP was not correlated objectrecognitionwouldimproveobjectclassificationper- withthesensitivityanalysis(Fig.6). formance. These findings support our key assertion: visual strate- We tested this hypothesis on VGG16 and VGG19 top- gies used by humans and DCNs during object recognition 1 classification accuracy on the ImageNet images used in are not aligned. Realization maps capture mostly distinct Clicktionary.Wechosetop-1classification(i.e.whereclas- information from DCNs, as measured by either LRP or a sifier chance ≈ 1/1000) because VGG has not yet reached sensitivityanalysis. Therewereevenmorepronounceddif- ceilingaccuracy(unliketop-5classification). Wecompared ferences between realization maps and feature importance theuseofrealizationmapsderivedfromhumanparticipants predictedbyattentionalgorithms2.Giventhemotivationfor to salience maps derived from representative attention al- ourconstructionofrealizationmaps, weexpectedastrong gorithms to cue DCN to attend to relevant object features relationship with MIRCs. While the absence of this rela- duringinference. 2Thealgorithmforpredictingbottom-upsaliencemapswastunedto Ineachcase,DCNactivityforanimagewasmodulated havequalitativelysimilarsparsityastherealizationmaps. with a heatmap for that image: Z (cid:12) a, where Z is each Figure 5 A B ] 20% e n10% VGG 16 Exp. 1 VGG 19 yeli Exp. 2 * * Δ racbas 8% *** * orycy 1 accuon] – [ 46%% * ** Categaccura p-ati 2% Toul d o 0% -15% m 0 0.7 [ [LRP, Realization] correlations Figure7.ModulatingVGG16andVGG19activitywithfeatureimportancemaps.(A)RealizationmapsandpredictedfixationsfromDeep GazeIIbutnotsaliencemapsimproveVGG16classificationaccuracy.AsimilarresultwasobservedforVGG19.Notethathere,0%means nodifferencebetweenanimportancemapmodulatedVGGanditsbaselineperformance. Asterisksabovebarsaretestsofclassification accuracyimprovementoverbaselineacrossExp. 1andExp. 2. Asterisksattachedtoanarrowarepairwisecomparisons. Errorbarsare S.E.M.*: p<0.05;**: p<0.01;***: p<0.001(B)RealizationmapaugmentationofVGG16correlateswiththeassociationbetweenits LRPsandrealizationmaps. model’s activity after the first layer of convolutions, and a thefeaturesemphasizedbythosemaps. is a heatmap scaled to the interval [1,2] and resized with DCN accuracy was not significantly different from 0 bilinear interpolation to match the height and width of Z. when it was modulated with bottom-up salience. VGG19 Each model’s top-1 accuracy was recorded after modula- modulated with bottom-up salience was also significantly tion.Rescalingheatmapswascrucialasitensuredthatthere less accurate than when it was modulated with realization was amplification of important features instead of attenua- maps or predicted fixations from DeepGaze II (Fig. 7A; tion, which can lead to artifacts in the activations that are right panel, realization map deviation from bottom-up pathologicallyinterpretedbydownstreamlayers. salience: p = 0.023; DeepGaze II deviation from bottom- up salience: p = 0.026). Although fixations predicted by We found that modulating VGG16 and VGG19 activ- DeepGazeIIleadtoasimilarimprovementinclassification ity with realization maps yielded a small but statistically accuracy as realization maps, these gains were mostly ac- significant improvement in top-1 accuracy. VGG16 mod- cruedondifferentobjectcategories. Therewasweakover- ulated with realization maps was on average 3.49 percent- lap between the categories improved by realization maps agepointsbetterintop-1accuracythanthebaselineperfor- andthoseimprovedbyDeepGazeII(ρ = 0.36,p=0.022). manceofVGG16(Fig.7A;leftpanel,realizationmapdevi- Webelievethatfutureworkincorporatingfeaturemapsinto ationfrom0:p=0.025). Similarly,VGG19wasonaverage training (instead of only during inference, as we do here) 4.00percentagepointsbetterthanthebaselineperformance canprovidemoreclaritytothesedifferences. Specifically, ofVGG19(Fig.7A;rightpanel, realizationmapdeviation realizationmapsarequalitativelysparserthanDeepGazeII, from0: p=0.004). Notethattherewasalargermagnitude which could provide important gains for the efficiency of ofimprovementforimagesinExperiment2thaninExperi- training. ment1,likelyreflectingthatthelatterexperiment’simages weremoredifficulttorecognize,asexpected. 5.Discussion These gains mean that DCNs benefit from focusing on featuresthattriggerobjectrecognitioninhumans. Support- Clicktionary is a novel approach to estimate feature ing this, the gains in accuracy we observed for each ob- importance maps derived from humans. The proposed ject category were associated with the fisher-transformed methodovercomessomeofthemainlimitationsfromexist- correlationcoefficientsbetweenrealizationmapsandLRPs ing psychophysical methods including reverse correlation (ρ = 0.41, p = 0.069; Fig. 7B). This means that in our and other image classification methods (see [19] for a re- currentimplementationDCNsreceivedthegreatestbenefit view). We have described what is, to our knowledge, the from realization maps when they were already attuned to first systematic study of feature importance maps derived Figure 6 from human participants using natural images over multi- A B pleobjectcategories. We found little or no overlap between realization maps Airplane and image salience maps predicted by attention models. Modulated This suggests that realization maps reflect computational accuracy mechanisms that are somewhat distinct from those that guideattentionandeyemovements. Itisimportanttonote thatrealizationmapsareofcoursegeneratedthroughapro- cess that necessarily depends on attention: Teachers’ bub- reP bles reflect a continuous ranking of the importance of im- alre VG atygpeicfaelatmureetshofodrsrfeocrogmneiatisounr.inHgo(worevperre,dtihcitsinisg)diastttienncttiofnromin izatiodicte G16 two ways. First, salience is usually measured passively or Time remaining nd for a search task, whereas relevance maps take advantage oftheteacher’sabilitytoselectimportantfeatures, trading salience for feature diagnosticity. Second, the interplay of Figure8.Adepictionoftheinterfaceatclickme.aiforlarge- teacherandstudentidentifiesthepointatwhichvisiblefea- scaleacquisitionofrealizationmaps,andcontinuousmeasurement tures become sufficient to trigger recognition. Our current ofitsimpactonDCNperformance.Participantsdrawbubblesover method for visualizing realization maps across participant importantregionsinanobjectimage,asaDCNtriestorecognize pairs remains relatively simple – potentially disregarding itbeforeatime-limitelapses. Afteraccruingmanyofthesereal- important information through averaging. We expect that izationmaps,theapplicationtrainsamodelforpredictingrealiza- future work in developing a more nuanced approach for tion maps in images. These maps are applied to VGG16 during measuring realization maps from Clicktionary will prove trainingtodriveittowardshumanperception. useful for further characterizing visual strategies for both humansandDCNs. Uponarrivingtothewebsite,playersarepresentedwithan We found a weak association between realization maps objectimageandaskedtoplaytheroleofteacher,usingthe and importance maps in VGG16 derived from LRP. This mouse to place bubbles on its most critical features. The indicates that there exists at least some overlap between student in this case is a DCN, attempting to classify the visual features used by humans and DCNs for object cat- objectfromthebubbledimageregionswithinatime-limit. egorization. However, the magnitude of this association PlayerscompeteforprizestogettheDCNtorecognizethe was around half of what it was between Clicktionary par- mostimages. ticipants,suggestingthatthevisualrepresentationsusedby DCNsandhumansarestillmeaningfullydifferent. After accruing a significant amount of these human- Surprisingly, emphasizing realization maps in DCNs DCN realization maps, clickme.ai triggers a two-step during inference nonetheless improved classification accu- training sequence. First, a DCN designed for predicting racy. Inaddition,thestrengthofthisaugmentationwaspre- eye fixations is trained to predict the location of partici- dicted by the amount of overlap between realization maps pants’ bubbles [3]. Second, a VGG16 is cued to hot im- and LRPs. These results therefore indicate that driving age regions in the predicted realization maps during train- DCNs towards human recognition is beneficial, and real- ingforobjectrecognition. Theresultingaccuracyisposted izationmapscansupportthisoptimization. to clickme.ai to provide a running tally of the bene- One way of achieving this is to train computer vision fitofpredictedrealizationmapsonaccuracyoverbaseline, models that can predict the location of realization maps in whichhasnomodulation. Thus, incontrasttopopularad- images. Recentworkhasdemonstratedsupportforthisap- versarial learning methods that have achieved spectacular proach: Models that leverage DCN features achieve state- successinrecentyears(e.g.,[7]),clickme.airepresents of-the-artaccuracyinpredictingeyefixations[16]. Wehy- anattemptat“cooperativetraining”,inwhichahuman-in- pothesizethatmodelsofthisilkwillhavesimilarsuccessin the-looptrainsanauxiliarysystemthatwillyieldmoreef- predictingrealizationmaps.Unfortunately,whiletheexper- fectivemodelsforobjectrecognition. imentsreportedherearesignificantlylarger-scalethanany Overall, Clicktionary makes significant contributions to others for measuring visual feature sensitivity in humans, ourunderstandingofbiologicalvisionandrevealsasignif- they do not yield enough data for training these kinds of icant gap between feature importance for humans and ma- models. chines. ThesefindingswillinspirenewdirectionsinDCN To rectify this, we present clickme.ai, a web-based research and narrow the gap between biological and com- applicationforgatheringrecognitionmapsatscale(Fig.8). putationalvision. Acknowledgments [14] S. R. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, and T. Masquelier. Humans and deep networks largely agree TheoriginalideaforClicktionarywassuggestedbyProf. onwhichkindsofvariationmakeobjectrecognitionharder. Todd Zickler (Harvard University). We are also thankful 21Apr.2016. 1 to Matthias Ku¨mmerer for running DeepGaze on our im- [15] J.S.Kim,M.J.Greene,A.Zlateski,K.Lee,M.Richardson, ages and for Danny Harari and Shimon Ullman for pro- S.C.Turaga,M.Purcaro,M.Balkam,A.Robinson,B.F.Be- viding MIRC stimuli used in [31]. We are also indebted habadi,M.Campos,W.Denk,H.S.Seung,andEyeWirers. toJunkyungKimandMatthewRicciforcommentsonthe Space-time wiring specificity supports direction selectivity manuscript. This work was supported by NSF early ca- intheretina. Nature,509(7500):331–336,15May2014. 3 reeraward(IIS-1252951)andDARPAyoungfacultyaward [16] M.Ku¨mmerer, T.S.A.Wallis, andM.Bethge. DeepGaze (N66001-14-1-4037). II: Reading fixations from deep features trained on object recognition. 5Oct.2016. 5,8 References [17] B. M. Lake, W. Zaremba, R. Fergus, and T. M. Gureckis. Deepneuralnetworkspredictcategorytypicalityratingsfor [1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. images.InProceedingsofthe37thAnnualConferenceofthe Mu¨ller,andW.Samek.OnPixel-WiseexplanationsforNon- CognitiveScienceSociety,2015. 2 LinearclassifierdecisionsbyLayer-Wiserelevancepropaga- [18] T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ra- tion. PLoSOne,10(7):e0130140,10July2015. 5 manan, P. Dolla´r, and C. Lawrence Zitnick. Microsoft [2] S.Cooper,F.Khatib,A.Treuille,J.Barbero,J.Lee,M.Bee- COCO:Commonobjectsincontext. pages740–755. 2 nen, A. Leaver-Fay, D. Baker, Z. Popovic´, and F. Players. [19] R. F. Murray. Classification images: A review. J. Vis., Predictingproteinstructureswithamultiplayeronlinegame. 11(5):2–,1Jan.2011. 7 Nature,466(7307):756–760,5Aug.2010. 3 [20] K. J. Nielsen, N. K. Logothetis, and G. Rainer. Object [3] M.Cornia,L.Baraldi,G.Serra,andR.Cucchiara. Adeep features used by humans and monkeys to identify rotated multi-level network for saliency prediction. In 23rd Inter- shapes. J.Vis.,8(2):9.1–15,22Feb.2008. 2 national Conference on Pattern Recognition (ICPR), 2016. [21] D.P.Papadopoulos,A.D.F.Clarke,F.Keller,andV.Ferrari. 8 Trainingobjectclassdetectorsfromeyetrackingdata.pages [4] A. Das, H. Agrawal, C. Lawrence Zitnick, D. Parikh, and 361–376. 3 D.Batra. Humanattentioninvisualquestionanswering:Do [22] J. C. Peterson, J. T. Abbott, and T. L. Griffiths. Adapting humansanddeepnetworkslookatthesameregions?11June deep network features to capture psychological representa- 2016. 3 tions. 6Aug.2016. 2 [5] R. Dubey, J. Peterson, A. Khosla, M.-H. Yang, and [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, B. Ghanem. What makes an object memorable? In Pro- S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, ceedingsoftheIEEEInternationalConferenceonComputer A. C. Berg, and L. Fei-Fei. ImageNet large scale visual Vision,pages1089–1097,2015. 2 recognitionchallenge. CoRR,abs/1409.0:43, 1Sept.2014. [6] S.Eberhardt,J.Cader,andT.Serre. Howdeepisthefeature 4 analysisunderlyingrapidvisualcategorization? InNeural [24] B. Saleh, A. Elgammal, and J. Feldman. The role of typi- InformationProcessingSystems,2016. 1,6 calityinobjectclassification: Improvingthegeneralization [7] I.J.Goodfellow,J.Shlens,andC.Szegedy. Explainingand capacityofconvolutionalneuralnetworks. 9Feb.2016. 1 harnessingadversarialexamples. 20Dec.2014. 8 [25] W.J.Scheirer,S.E.Anthony,K.Nakayama,andD.D.Cox. [8] M. R. Greene, T. Liu, and J. M. Wolfe. Reconsidering Perceptualannotation: Measuringhumanvisiontoimprove yarbus: afailuretopredictobservers’taskfromeyemove- computer vision. IEEE Trans. Pattern Anal. Mach. Intell., mentpatterns. VisionRes.,62:1–8,1June2012. 2 36(8):1679–1686,Aug.2014. 3 [9] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearn- [26] P.G.Schyns,L.Bonnar,andF.Gosselin. Showmethefea- ingforimagerecognition. InComputerVisionandPattern tures! understandingrecognitionfromtheuseofvisualin- Recognition,2016. 1 formation. Psychol.Sci.,13(5):402–409,Sept.2002. 2 [10] L.IttiandC.Koch. Computationalmodellingofvisualat- [27] K.ShanmugaVadivel,T.Ngo,M.Eckstein,andB.S.Manju- tention. Nat.Rev.Neurosci.,2(3):194–203,2001. 5 nath.Eyetrackingassistedextractionofattentionallyimpor- [11] M. Jiang, S. Huang, J. Duan, and Q. Zhao. SALICON: tantobjectsfromvideos. InProceedingsoftheIEEECon- Saliencyincontext. In2015IEEEConferenceonComputer ferenceonComputerVisionandPatternRecognition,pages VisionandPatternRecognition(CVPR),pages1072–1080, 3241–3250,2015. 3 June2015. 2 [28] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in- [12] Y.V.Jiang,B.-Y.Won,andK.M.Swallow. Firstsaccadic side convolutional networks: Visualising image classifica- eyemovementrevealspersistentattentionalguidancebyim- tion models and saliency maps. In ICLR Workshop, 2014. plicit learning. J. Exp. Psychol. Hum. Percept. Perform., 2 40(3):1161–1173,June2014. 2 [29] K. Simonyan and A. Zisserman. Very deep convolutional [13] T.Judd,K.Ehinger,F.Durand,andA.Torralba.Learningto networksforLarge-Scaleimagerecoginition. Intl.Conf.on predictwherehumanslook. CVPR,2009. 2 LearningRepresentations(ICLR),pages1–14,2015. 4,5 [30] J.StegginkandC.G.M.Snoek.Addingsemanticstoimage- ConferenceonHumanFactorsinComputingSystems,CHI regionannotationswiththeName-It-Game.MultimediaSys- ’06,pages55–64,NewYork,NY,USA,2006.ACM. 2 tems,17(5):367–378,9Dec.2010. 3 [35] D.L.K.YaminsandJ.J.DiCarlo. Usinggoal-drivendeep [31] S. Ullman, L. Assif, E. Fetaya, and D. Harari. Atoms of learning models to understand sensory cortex. Nat. Neu- recognitioninhumanandcomputervision.pages1–6,2016. rosci.,19(3):356–365,23Feb.2016. 1 1,4,5,6,9 [36] A. L. Yarbus. Eye Movements and Vision. Plenum press, [32] E.Vig,M.Dorr,andD.Cox. Space-Variantdescriptorsam- NewYork,1967. 2 plingforactionrecognitionbasedonsaliencyandeyemove- [37] M.D.ZeilerandR.Fergus. Visualizingandunderstanding ments. pages84–97. 3 convolutional networks. arXiv preprint arXiv:1311.2901, [33] L.vonAhnandL.Dabbish. Labelingimageswithacom- pages818–833,2013. 2 puter game. In Proceedings of the SIGCHI Conference on [38] B.Zhou,A.Khosla,A.Lapedriza,A.Oliva,andA.Torralba. HumanFactorsinComputingSystems,CHI’04,pages319– ObjectdetectorsemergeindeepsceneCNNs.22Dec.2014. 326,NewYork,NY,USA,2004.ACM. 2 2,5 [34] L.vonAhn,R.Liu,andM.Blum. Peekaboom: Agamefor locating objects in images. In Proceedings of the SIGCHI

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.