NonamemanuscriptNo. (willbeinsertedbytheeditor) Predicting Entry-Level Categories VicenteOrdonez · WeiLiu · JiaDeng · YejinChoi · AlexanderC.Berg · TamaraL.Berg Received:date/Accepted:date Abstract Entry-levelcategories—thelabelspeopleuseto name an object — were originally defined and studied by psychologistsinthe1970sand80s.Inthispaperweextend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry- levelcategoriesforimages.Ourmodelscombinevisualrecog- nitionpredictionswithlinguisticresourceslikeWordNetand proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness ofourmodelsforpredictingnouns(entry-levelwords)asso- Recognition Prediction What should I Call It? ciatedwithimagesbypeople,andforlearningmappingsbe- tweenconceptspredictedbyexistingvisualrecognitionsys- grampus griseus dolphin temsandentry-levelconcepts.Inthisworkwemakeuseof recent successful efforts on convolutional network models Fig.1 ExampletranslationbetweenaWordNetbasedobjectcategory forvisualrecognitionbytrainingclassifiersfor7,404object predictionandwhatpeoplemightcallthedepictedobject. categoriesonConvNetactivationfeatures.Resultsforcate- gorymappingandentry-levelcategorypredictionforimages showpromiseforproducingmorenaturalhuman-likelabels. 1 Introduction Wealsodemonstratethepotentialapplicabilityofourresults tothetaskofimagedescriptiongeneration. Computationalvisualrecognitionisbeginningtowork.Al- though far from solved, algorithms have now advanced to the point where they can recognize or localize thousands Keywords Recognition · Categorization · Entry-Level of object categories with reasonable accuracy (Deng et al., Categories·Psychology 2010;Perronninetal.,2012;Krizhevskyetal.,2012;Dean etal.,2013;SimonyanandZisserman,2014;Szegedyetal., 2014). Russakovsky et al. (2014) present an overview of recent advances in classification and localization for up to V.Ordonez((cid:0))·W.Liu·A.C.Berg·T.L.Berg 1000objectcategories.Whileonecouldpredictanyoneof DepartmentofComputerScience many relevant labels for an object, the question of “What UniversityofNorthCarolinaatChapelHill,ChapelHill,NC,USA should I actually call it?” is becoming important for large- E-mail:[email protected] scale visual recognition. For instance, if a classifier were J.Deng luckyenoughtogettheexampleinFigure1correct,itmight DepartmentofElectricalEngineeringandComputerScience UniversityofMichigan,AnnArbor,MI,USA output grampus griseus, while most people would simply callthisobjectadolphin.Weproposetodevelopcategoriza- Y.Choi ComputerScienceandEngineering tionsystemsthatareawareofthesekindsofhumannaming UniversityofWashington,Seattle,WA,USA choices. 2 VicenteOrdonezetal. egories–e.g.,leafnodesinWordNet(Fellbaum,1998)–to what people are likely to call them (entry-level categories) and 2) learning to map from outputs of thousands of noisy computervisionclassifiers/detectorsevaluatedonanimage towhatapersonislikelytocalladepictedobject. Evaluations show that our models can effectively em- ulatethenamingchoicesofhumanobservers.Furthermore, weshowthatusingnoisyvisionestimatesforimagecontent, Superordinates: animal, vertebrate Superordinates: animal, vertebrate oursystemcanoutputwordsthataresignificantlycloserto Basic Level: bird Basic Level: bird Entry Level: bird Entry Level: penguin human annotations than either raw visual classifier predic- Subordinates:American robin Subordinates: Chinstrap penguin tions or the results of using a state of the art hierarchical classificationsystem(Dengetal.,2012)thatcanoutputob- Fig.2 AnAmericanRobinisamoreprototypicaltypeofbirdhence jectlabelsatvaryinglevelsofabstractionfromveryspecific itsentry-levelcategorycoincideswithitsbasiclevelcategorywhilefor termstoverygeneralcategories. penguinwhichisalessprototypicalexampleofbird,theentry-level categoryisatalowerlevelofabstraction. Thisnotioniscloselyrelatedtoideasofbasicandentry- 1.1 InsightsintoEntry-LevelCategories levelcategoriesformulatedbypsychologistssuchasEleanor Rosch(Rosch,1978)andStephenKosslyn(Jolicoeuretal., 1984).Roschdefinesbasic-levelcategoriesasroughlythose Atfirstglance,thetaskoffindingtheentry-levelcategories mayseemlikealinguisticproblemoffindingahypernymof categories at the highest level of generality that still share anygivenword.Althoughthereisaconsiderableconceptual manycommonattributesandhavefewerdistinctiveattributes. An example of a basic level category is bird where most connection between entry-level categories and hypernyms, therearetwonotabledifferences: instances share attributes like having feathers, wings, and beaks.Super-ordinate,moregeneral,categoriessuchasan- imalwillsharefewerattributesanddemonstratemorevari- 1. Although“bird”isahypernymofboth“penguin”,and ability.Subordinate,morespecificcategories,suchasAmer- “sparrow”, “bird” may be a good entry-level category icanRobinwillshareevenmoreattributeslikeshape,color, for“sparrow”,butnotfor“penguin”.Thisphenomenon and size. Rosch studied basic level categories through hu- — that some members of a category are more proto- manexperiments,e.g.askingpeopletoenumeratecommon typical than others — is discussed in Prototype Theory attributesforagivencategory.TheworkofJolicoeuretal. (Rosch,1978). (1984) further studied the way people identify categories, 2. Entry-levelcategoriesarenotconfinedby(inherited)hy- defining the concept of entry-level categories. Entry level pernyms,inpartbecauseencyclopedicknowledgeisdif- categories are essentially the categories that people natu- ferentfromcommonsenseknowledge.Forexample“rhea” rally use to identify objects. The more prototypical an ob- isnotakindof“ostrich”inthestricttaxonomicalsense. ject,themorelikelyitwillhaveitsentrypointatthebasic- However, due to their visual similarity, people gener- levelcategory.Forlesstypicalobjectstheentrypointmight ally refer to a “rhea” as an “ostrich”. Adding to the beatalowerlevelofabstraction.ForexampleanAmerican challengeisthatalthoughextensive,WordNetisneither robinorapenguinarebothmembersofthesamebasic-level completenorpracticallyoptimalforourpurpose.Forex- bird category. However, the American robin is more proto- ample, according to WordNet, “kitten” is not a kind of typical, sharing many features with other birds and thus its “cat”,and“tulip”isnotakindof“flower”. entry-level category coincides with its basic-level category ofbird,whiletheentry-levelcategoryforapenguinwould In fact, both of the above points have a connection to beatalowerlevelofabstraction(seeFigure2). visualinformationofobjects,asvisuallysimilarobjectsare So,whileobjectsaremembersofmanycategories–e.g. more likely to belong to the same entry-level category. In MrEdisapalomino,butalsoahorse,anequine,anodd-toed thiswork,wepresentthefirstextensivestudythat(1)char- ungulate,aplacentalmammal,amammal,andsoon–most acterizesentry-levelcategoriesinthecontextoftranslating peoplelookingatMrEdwouldtendtocallhimahorse,his encyclopedicvisualcategoriestonaturalnamesthatpeople entrylevelcategory(unlesstheyarefansoftheshow).Our commonly use, and (2) provides methods to predict entry- paper focuses on the problem of object naming in the con- level categories for input images guided by semantic word textofentry-levelcategories.Weconsidertworelatedtasks: knowledgeorbyusingalarge-scalecorpusofimageswith 1)learningamappingfromfine-grained/encyclopediccat- text. PredictingEntry-LevelCategories 3 1.2 PaperOverview 1998) hierarchy, or from the nouns used in a large set of carefully selected image captions that directly refer to im- Our paper is divided as follows. Section 2 presents a sum- ages.ThemorerecentworksofChenetal.(2013)andDiv- maryofrelatedwork.Section3introducesalarge-scaleim- vala et al. (2014) present systems capable of learning any age categorization system based on convolutional network type of visual concept from images on the web, including activations.InSection4welearntranslationsfromsubordi- effortstolearnsimplecommonsenserelationshipsbetween nate concepts to entry-level concepts. In Section 5 we pro- visual concepts (Chen et al., 2013). We provide a related posetwomodelsthatcantakeanimageasinputandpredict output in our work, learning mappings between entry-level entry-levelconcepts.Finally,inSection6weprovideexper- categoriesandsubordinate/leaf-nodecategories.Therecent imentalevaluations. workofFengetal.(2015)proposesthatentry-levelcatego- Themajoradditionsinthisjournalversioncomparedto rization can be viewed as lexical semantic knowledge, and our previous publication (Ordonez et al., 2013) include an presentsaglobalinferenceformulationtomapallencyclo- expanded discussion about related work in Section 2. We pediccategoriestotheirentry-levelcategoriescollectively. havealsoreplacedthelargescaleimagecategorizationsys- Onatechnicallevel,ourworkisrelatedto(Dengetal., tem based on hand-crafted SIFT + LLC features used in 2012)thattriesto“hedge”predictionsofvisualcontentby our previous work with a system based on state-of-the-art optimally backing off in the WordNet hierarchy. One key convolutional network activations obtained using the Caffe difference is that ourapproach uses a reward function over framework (Jia, 2013) (Section 3). Sections 4 and 5 have the WordNet hierarchy that is non-monotonic along paths beenupdatedaccordinglyandincludeadditionalqualitative from the root to the leaves. Another difference is that we examples. Section 6 contains a more thorough evaluation havereplacedtheunderlyingleafnodeclassifiersfromDeng studying the effect of the number of predicted outputs on et al. (2012) with recent convolutional network activation precision and recall, and an additional extrinsic evaluation features. Our approach also allows mappings to be learned ofthesysteminasentenceretrievalapplication. fromaWordNetleafnode,l,tonaturalwordchoicesthatare notalongapathfroml totheroot,“entity”.Inevaluations, our results significantly outperform those of (Deng et al., 2 Relatedwork 2012)becausealthoughoptimalinsomesense,theyarenot optimalwithrespecttohowpeopledescribeimagecontent. Questions about entry-level categories are directly relevant to recent work on the connection between computer vision Ourworkisalsorelatedtothegrowingchallengeofhar- outputs and (generating) natural language descriptions of nessing the ever increasing number of pre-trained recogni- images(Farhadietal.,2010;Ordonezetal.,2011;Kuznetsova tion systems, thus avoiding “starting from scratch” when- et al., 2012; Mitchell et al., 2012; Yang et al., 2011; Gupta ever developing new applications. It is wasteful not to take etal.,2012;Kulkarnietal.,2013;Hodoshetal.,2013;Ram- advantage of the CPU weeks (Felzenszwalb et al., 2010; nath et al., 2014; Mason and Charniak, 2014; Kuznetsova Krizhevskyetal.,2012),months(Dengetal.,2010,2012), et al., 2014). Previous works have not directly addressed or even millennia (Le et al., 2012) invested in developing naming preference choices for entry-level categories when recognitionmodelsforincreasinglylargelabeleddatasets(Ev- generating sentences. Often the computer vision label pre- eringhametal.,2010;Russelletal.,2008;Xiaoetal.,2010; dictionsareuseddirectlyduringsurfacerealization (Mitchell Deng et al., 2009; Torralba et al., 2008). However, for any et al., 2012; Kulkarni et al., 2013), resulting in choosing specificend-userapplication,thecategoriesofobjects,scenes, non-humanlikenamingsforconstructingsentencesevenwhen andattributeslabeledinaparticulardatasetmaynotbethe handlingarelativelysmallnumberofcategories(i.e.Pascal mostusefulpredictions.Onebenefitofourworkcanbeseen VOCcategorieslikepotted-plant,tv-monitororperson).For as exploring the problem of translating the outputs of a vi- these methods, our entry-level category predictions could sionsystemtrainedwithonevocabularyoflabels(WordNet be used to generate more natural names for objects. Other leaf nodes) to labels in a new vocabulary (commonly used methods handle naming choices indirectly in a data-driven visuallydescriptivenouns). fashionbyborrowinghumanreferencesfromothervisually Ourproposedmethodstakeintoaccountseveralsources similarobjects (Kuznetsovaetal.,2012,2014;Masonand ofstructureandinformation:thestructureofWordNet,fre- Charniak,2014). quenciesofworduseinlargeamountsofwebtext,outputs Our work is also related to previous works that aim to ofalarge-scalevisualrecognitionsystem,andlargeamounts discovervisualcategoriesfromlarge-scaledata.Theworks ofpairedimageandtextdata.Inparticular,weusetheSBU ofYanaiandBarnard(2005)andBarnardandYanai(2006) CaptionedPhotoDataset(Ordonezetal.,2011),whichcon- learnmodelsforasetofcategoriesbyexploringimageswith sistsof1millionimageswithnaturallanguagedescriptions, loosely associated text from the web. We learn our set of and Google n-gram frequencies collected for all words on categories directly as a subset of the WordNet (Fellbaum, theweb.Takingalloftheseresourcestogether,weareable 4 VicenteOrdonezetal. to study patterns for choice of entry-level categories at a Animal 656M muchlargerscalethanpreviouspsychologyexperiments. 𝜓𝜓(𝑤𝑤,𝑣𝑣) 𝜙𝜙(𝑤𝑤) 366M Bird Mammal 15M Se n-gram m Frequency a 3 ALarge-ScaleImageCategorizationSystem 128M Seabird Cetacean 0.9M ntic 1.2M D Penguin 88M Cormorant Whale 55M ista Large-scale image categorization has improved drastically nc 30M e inrecentyears.Thecomputervisioncommunityhasmoved King 22M Dolphin 6.4M Sperm penguin whale fromhandling101categories(Fei-Feietal.,2007)to100,000 categories (Dean et al., 2013) in a few years. Large-scale Grampus 0.08M griseus datasetslikeImageNet(Dengetal.,2009)andrecentprogress intrainingdeeplayeredarchitectures(Krizhevskyetal.,2012) Fig.3 OurfirstcategoricaltranslationmodelusestheWordNethier- havesignificantlyimprovedthestate-of-the-art.Weleverage archytofindanhypernymthatisclosetotheleafnodeconcept(se- manticdistance)andhasalargenaturalnessscorebasedonitsn-gram asystembasedontheseasthestartingpointforourwork. frequency.Thegreenarrowsindicatetheidealcategorythatwouldcor- Forfeatures,weuseactivationsfromaninternallayerof respondtotheentry-levelcategoryforeachleaf-nodeinthissample a convolutional network, following the approach of (Don- semantichierarchy. ahue et al., 2013). In particular, we use the pre-trained ref- erence model from the Caffe framework (Jia, 2013) which synonyminw.Aspossibletranslationconceptsforagiven isinturnbasedonthemodelfromKrizhevskyetal.(2012). category, v, we consider all nodes, w in v(cid:48)s inherited hy- This model was trained on the 1,000 ImageNet categories pernymstructure(allofthesynsetsalongtheWordNetpath from the ImageNet Large Scale Visual Recognition Chal- fromwtotheroot). lenge 2012. We compute the 4,096 activations in the 7th Wedefineatranslationfunction,τ(v,λ),thatmaximizes layer of this network for images in 7,404 leaf node cate- a trade-off between naturalness, φ(w), and semantic prox- gories from ImageNet and use them as features to train a imity,ψ(w,v),measuringthedistancebetweenleafnodev linear SVM for each category. We further use a validation andnodewintheWordNethypernymstructure: set to calibrate the output scores of each SVM with Platt scaling(Platt,1999). τ(v,λ)=argmax[φ(w)−λψ(w,v)],w ∈Π(v), (1) w where Π(v) is the set of (inherited) hypernyms from v to 4 TranslatingEncyclopedicConcepts the root, including v. For instance, given an input category toEntry-LevelConcepts v = Kingpenguin we consider all categories along its set ofinheritedhypernyms,e.g.penguin,seabird,bird,animal Our objective in this section is to discover mappings be- (seeFigure3).Anidealpredictionforthisconceptwouldbe tweensubordinateencyclopedicconcepts(ImageNetleafcat- penguin. To control how the overall system trades off nat- egories,e.g.Chlorophyllummolybdites)tooutputconcepts uralness vs semantic proximity, we perform line search to that are more natural (e.g. mushroom). In Section 4.1 we setλ.Forthispurposeweuseaheldoutsetofsubordinate- present an approach that relies on the WordNet hierarchy category, entry-level category pairs (x ,y ) collected using and frequency of words in a web scale corpus. In Section i i AmazonMechanicalTurk(MTurk)(fordetailsrefertoSec- 4.2wefollowanapproachthatusesvisualrecognitionmod- tion6.1).Ourobjectiveistomaximizethenumberofcorrect elslearnedonapairedimage-captiondataset. translationspredictedbyourmodel: (cid:88) Φ(D,λ)= 1[τ(x ,λ)=y ], (2) i i 4.1 Language-basedTranslation i We first consider a translation approach that relies only on where1[·]istheindicatorfunction.Weshowtherelation- language-basedinformation:thehierarchicalsemanticstruc- shipbetweenλandvocabularysizeinFigure4(a),andbe- turefromWordNet(Fellbaum,1998)andtextstatisticsfrom tween λ and overall translation accuracy, Φ(D,λ), in Fig- the Google Web 1T corpus (Brants and Franz., 2006). We ure 4(b). As we increase λ, Φ(D,λ) increases initially and posit that the frequencies of terms computed from massive thendecreasesastoomuchgeneralizationorspecificityre- amountsoftextonthewebreflectthe“naturalness”ofcon- ducesthenaturalnessofthepredictions.Forexample,gen- cepts. We use the n-gram counts of the Google Web 1T eralizingfromgrampusgriseustodolphinisgoodfor“nat- corpus (Brants and Franz., 2006) as a proxy for natural- uralness”,butgeneralizingallthewayto“entity”decreases ness. Specifically, for a synset w, we quantify naturalness “naturalness”. In Figure 4(b) the red line shows accuracy as,φ(w),thelogofthecountforthemostcommonlyused forpredictingthemostagreeduponwordforasynset,while PredictingEntry-LevelCategories 5 6000 5000 0.5 Friesian, Holstein, vocabulary size234000000000 Φλ(D,)000...234 H(1o.l9st0e7in1-)F rcioeswia n 1000 (1.1851) orange_tree 0.1 (0.6136) stall 0 (0.5630) mushroom 1 10 20 30 40 50 60 10 20 30 40 50 60 (0.3825) pasture lambda lambda (0.3156) sheep (a) vocabularysizevs.λ (b) naturalnessvs.λ (0.3321) black_bear Vision (0.3015) puppy (0.2409) pedestrian_bridge System (0.2353) nest Fig.4 Left:showstherelationshipbetweenparameterλandthetarget vocabularysize.Right:showstherelationshipbetweenparameterλ andagreementaccuracywithhumanlabeledsynsetsevaluatedagainst Fig.5 WeshowthesysteminstancesofthecategoryFriesian,Hol- themostagreedhumanlabel(red)andanyhumanlabel(cyan). stein,Holstein-Friesianandthevisionsystempre-trainedwithcandi- dateentry-levelcategoriesranksasetofcandidatekeywordsandout- putsthemostrelevant,inthiscasecow. the cyan line shows the accuracy for predicting any word collected from any user. Our experiment also supports that (Section5.1)and2)amethodthatlearnsvisualmodelsfor entry-level categories seem to lie at a certain level of ab- entry-levelcategorypredictiondirectlyfromalargecollec- straction where there is a discontinuity. Going beyond this tionofimageswithassociatedcaptions(Section5.2). levelofabstractionsuddenlymakesourpredictionsconsid- erablyworse(seeFigure4(b)). Rosch(1978)indeedargues inthecontextofbasiclevelcategoriesthatbasiccutsincat- 5.1 Linguistically-guidedNaming egorization happen precisely at these discontinuities where therearebundlesofinformation-richfunctionalandpercep- We estimate image content for an image, I, using the pre- tualattributes. trained models from Section 3. These models predict pres- ence or absence of 7,404 leaf node concepts in ImageNet 4.2 Visual-basedTranslation (WordNet). Following the approach of Deng et al. (2012), wecomputeestimatesofvisualcontentforinternalnodesby Next, we try to make use of pre-trained visual classifiers hierarchicallyaccumulatingallpredictionsbelowanode:1 to improve translations between input concepts and entry- levelconcepts.Foragivenleafsynset,v,wesampleasetof fˆ(v,I), ifvisaleafnode, n=100imagesfromImageNet.Foreachimage,i,wepre- f(v,I)= (cid:80) fˆ(v(cid:48),I), ifvisaninternalnode, (3) dictsomepotentialentry-levelnouns,Ni,usingpre-trained v(cid:48)∈Z(v) visual classifiers that we will describe later in Section 5.2. where Z(v) is the set of all leaf nodes under node v and WeusetheunionofthissetoflabelsN =N ∪N ...∪N 1 2 n fˆ(v,I) is a score predicting the presence of leaf node cat- as keyword annotations for synset v and rank them using egory v from our large scale image categorization system a TFIDF information retrieval measure. We consider each introducedinSection3.SimilartoourapproachinSection category v as a document for computing the inverse docu- 4.1, we define for every node in the ImageNet hierarchy a mentfrequency(IDF)term.Wepickthemosthighlyranked trade-off function between “naturalness” φ (ngram counts) nounforeachnode,v,asitsentry-levelcategoricaltransla- and specificity ψ˜ (relative position in the WordNet hierar- tion(seeanexampleinFigure5). chy): γ(v,λˆ)=[φ(w)−λˆψ˜(w)], (4) 5 PredictingEntry-LevelConceptsforImages whereφ(w)iscomputedasthelogcountsofthenounsand In Section 4 we proposed models to translate between one compoundnounsinthetextcorpusfromtheSBUCaptioned linguistic concept, e.g. grampus griseus, to a more natural Dataset(Ordonezetal.,2011),andψ˜(w)isanupperbound concept,e.g.dolphin.Ourobjectiveinthissectionistoex- onψ(w,v)fromequation(1)equaltothemaximumpathin plore methods that can take an image as input and predict theWordNetstructurefromnodev tonodew.Weparame- entry-level labels for the depicted objects. The models we terizethistrade-offbyλˆ. proposeare:1)amethodthatcombines“naturalness”mea- suresfromtextstatisticswithdirectestimatesofvisualcon- 1 This function might bias decisions toward internal nodes. Other tent computed at leaf nodes and inferred for internal nodes alternativescouldbeexploredtoestimateinternalnodescores. 6 VicenteOrdonezetal. Input Concept Language-based Visual-based Human Translation Translation Translation 1 eastern kingbird bird bird bird 2 cactus wren bird bird bird 3 buzzard, Buteo buteo hawk hawk hawk 4 whinchat, Saxicola rubetra chat bird bird 6 Weimaraner dog dog dog 7 Gordon setter dog dog dog 8 numbat, banded anteater, anteater anteater dog anteater 9 rhea, Rhea americana bird grass ostrich 10 Africanized bee, killer bee, Apis mellifera bee bee bee 11 conger, conger eel eel fish fish 12 merino, merino sheep sheep sheep sheep 13 Europ.black grouse, heathfowl, Lyrurus tetrix bird bird bird 14 yellowbelly marmot, rockchuck, Marm.flaviventris marmot male squirrel 15 snorkeling, snorkel diving swimming seaturtle snorkel 16 cologne, colognewater, eau de cologne essence bottle perfume Fig.6 TranslationsfromImageNetleafnodesynsetcategoriestoentry-levelcategoriesusingourautomaticapproachesfromSections4.1(left) and4.2(center)andcrowd-sourcedhumanannotationsfromSection6.1(right). Forentry-levelcategorypredictioninimages,wewould liketomaximizeboth“naturalness”andestimatesofimage 25 Linguistically-guided naming (SBU Captions) content. For example, text based “naturalness” will tell us Linguistically-guided naming (Google Web 1T) Hedging (Deng et.al.2012) that both cat and dog are good entry-level categories, but 20 a confident visual prediction for German shepherd for an imagetellsusthatdogisamuchbetterentry-levelprediction thancatforthatimage. 15 n Therefore,foraninputimage,wewanttooutputasetof o si conceptsthathavealargepredictionforboth“naturalness” eci and content estimate score. For our experiments we output Pr10 thetopK WordNetsynsetswiththehighestf scores: nat 5 f (v,I,λˆ)=f(v,I)γ(v,λˆ). (5) nat As we change λˆ we expect a similar behavior as in our 0 50 100 150 200 250 300 350 400 450 500 language-based concept translations (Section 4.1). We can Vocabulary size tuneλˆtocontrolthedegreeofspecificitywhiletryingtopre- serve “naturalness” using n-gram counts. We compare our Fig.7 Relationshipbetweenaverageprecisionagreementandwork- frameworktothe“hedging”techniqueofDengetal.(2012) ing vocabulary size (on a set of 1000 images) for the hedging for different settings of λˆ. For a side by side comparison method (Deng et al., 2012) (red) and our linguistically-guided nam- ingmethodthatusestextstatisticsfromthegenericGoogleWeb1T we modify hedging to output the top K synsets based on dataset(magenta)andfromtheSBUCaptionDataset(Sec.5.1).We their scoring function. Here, the working vocabulary is the useK =5togeneratethisplotandarandomsetof1000imagesfrom uniquesetofpredictedlabelsoutputforeachmethodonthis theSBUCaptionedDataset. testset.Resultsdemonstrate(Figure7)thatunderdifferent parametersettingsweconsistentlyobtainmuchhigherlevels ofprecisionforpredictingentry-levelcategoriesthanhedg- ing (Deng et al., 2012). We also obtain an additional gain in performance than in our previous work (Ordonez et al., SBUCaptionedDatasetratherthanthemoregenericGoogle 2013)byrelyingonthedataset-specifictext-statisticsofthe Web1Tcorpus. PredictingEntry-LevelCategories 7 5.2 Visually-guidedNaming in our target vocabulary follows a power-law distribution. Hence we only have a very large amount of training data In the previous section we rely on WordNet structure to forafewmostcommonlyoccurringnounconcepts.Specif- compute estimates of image content, especially for inter- ically, we learn linear SVMs followed by Platt scaling for nal nodes. However, this is not always a good measure of eachofourtargetconcepts.Wekeepd = 1,169ofthebest contentbecause:1)TheWordNethierarchydoesn’tencode performing models. Our scoring function f for a target svm knowledge about some semantic relationships between ob- conceptv isthen: i jects (i.e. functional or contextual relationships), 2) Even withthevastcoverageof7,404ImageNetleafnodesweare 1 missing models for many potentially important entry-level fsvm(vi,I,θi)= 1−exp(a θ(cid:62)X+b ), (7) categoriesthatarenotattheleaflevel. i i i Asanalternative,wecandirectlytrainmodelsforentry- whereθ arethemodelparametersforpredictingconceptv , i i levelcategoriesfromdatawherepeoplehaveprovidedentry- and a and b are Platt scaling parameters learned for each i i level labels – in the form of nouns present in visually de- targetconceptv onaheldoutvalidationset. i scriptiveimagecaptions.Wepostulatethatthesenounsrep- resentexamplesofentry-levellabelsbecausetheyhavebeen |D| naturally annotated by people to describe what is present R(θ )= 1(cid:107)θ (cid:107)+c(cid:88)max(0,1−y θ(cid:62)X(j))2. (8) in an image. For this task, we leverage the SBU Captioned i 2 i ij i j=1 PhotoDataset(Ordonezetal.,2011),whichcontains1mil- lion captioned images. We transform this dataset into a set Welearntheparametersθibyminimizingthesquaredhinge- D = {X(j),Y(j) | X(j) ∈ X,Y(j) ∈ Y}, where X = losswith(cid:96)1regularization(eqn8).Thelatterprovidesanat- [0–1]s is a vector of estimates of visual content for s = ural way of modeling the relationships between the input 7,404 ImageNet leaf node categories and Y = [0,1]d is a and output label spaces that encourages sparseness (exam- setofbinaryoutputlabelsfordtargetcategories. ples in Figure 8). We find c = 0.01 to yield good results Input content estimates are provided by the SVM con- forourproblemandusethisvaluefortrainingallindividual tent predictors based on convolutional network activation models. featuresdescribedinSection3.WeruntheseSVMpredic- One of the drawbacks of using the ImageNet hierarchy tors over the whole image as opposed to the max-pooling toaggregateestimatesofvisualconcepts(Section5.1)isthat approachoverboundingboxesfromourpreviouspaper(Or- itignoresmorecomplexrelationshipsbetweenconcepts.Here, donez et al., 2013) so that we have a more uniform com- our data-driven approach to the problem implicitly discov- parisontoourlinguistically-guidednamingapproach(Sec- erstheserelationships.Forinstanceaconceptliketreehasa tion5.1)whichdoesthesame.Therewassomeminordrop co-occurrencerelationshipwithbirdthatmaybeusefulfor inperformancewhenrunningourmodelsexclusivelyonthe wholeimage.Comparedtoourpreviouswork,ourvisually- PR curve Most confident correct predictions Most confident wrong predictions guidednamingapproachstillhasasignificantgainfromus- ingtheConvNetfeaturesintroducedinsection3. house Fortrainingourdtargetcategories,weobtainlabelsY from the million captions by running a POS-tagger (Bird, market 2006)anddefiningY(j) ={y }suchthat: ij (cid:40) 1, ifcaptionforimagej hasnouni, girl y = (6) ij 0, if otherwise. boy The POS-tagger helps clean up some word sense am- biguity due to polysemy, by only selecting those instances where a word is used as a noun. d is determined experi- cat mentallyfromdatabylearningmodelsforthemostfrequent nounsinthisdataset.Thisprovidesuswithatargetvocabu- bird larythatisbothlikelytocontainentry-levelcategories(be- cause we expect entry-level category nouns to commonly Fig.9 Samplepredictionsfromourexperimentsonatestsetforeach occurinourvisualdescriptions)andtocontainsufficientim- typeofcategory.Notethatimagelabelscomefromcaptionnouns,so agesfortrainingeffectiverecognitionmodels.Weuseupto someimagesmarkedascorrectpredictionsmightnotdepictthetar- 10,000imagesfortrainingeachmodel.Sinceweareusing getconceptwhereassomeimagesmarkedaswrongpredictionsmight humanlabelsfromreal-worlddata,thefrequencyofwords actuallydepictthetargetcategory. 8 VicenteOrdonezetal. iron tree, iron-tree, ironwood, ironwood tree snag European silver fir, Christmas tree, Abies alba baobab, monkey-bread tree, Adansonia digitata Japanese black pine, black pine, Pinus thunbergii huisache, cassie, mimosa bush, sweet wattle, sweet acacia, scented wattle, tree flame tree, Acacia farnesiana feeder bird feeder, birdfeeder, feeder koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus flying fox damask American basswood, American lime, Tilia americana furnishing, trappings cat box reformer dining area writing desk Staffordshire bullterrier, Staffordshire bull terrier desk rubber eraser, rubber, pencil eraser shoebox flash, photoflash, flash lamp, flashgun, flashbulb, flash bulb control room sausage dog, sausage hound mouse, computer mouse workstation riverbank, riverside waterside fishbowl, fish bowl, goldfish bowl organza diving duck bathe water hand towel, face towel pier horseshoe crab, king crab, Limulus polyphemus, Xiphosurus polyphemus background, desktop, screen background cling film, clingfilm, Saran Wrap water jump camouflage, camo farmhouse detached house, single dwelling toolshed, toolhouse chalet fixer-upper lowboy house vibraphone, vibraharp, vibes banded purple, white admiral, Limenitis arthemis ladies' room, powder room cream-of-tartar tree, sour gourd, Adansonia gregorii windowsill bomb shelter, air-raid shelter, bombproof kennel, doghouse, dog house chalet firebox leash, tether, lead flamethrower fairy bluebird, bluebird dog_house chicken coop, coop, hencoop, henhouse pajama, pyjama shadow box treasure chest Newfoundland, Newfoundland dog whitewash playpen, pen Fig.8 Entry-levelcategorieswiththeircorrespondingtopweightedleafnodefeaturesaftertraininganSVMonournoisydataandavisual- izationofweightsgroupedbyanarbitrarycategorizationofleafnodes.vegetation(green),birds(orange),instruments(blue),structures(brown), mammals(red),others(black). PredictingEntry-LevelCategories 9 prediction.Achairisoftenoccludedbytheobjectssittingon timewithhumansuppliedtranslationsandthevisual-based thechair,butevidenceofthosetypesofobjects,e.g.people translation (Section 4.2) agrees 33% of the time, indicat- or cat or co-occurring objects, e.g. table can help us pre- ingthattranslationlearningisanon-trivialtask.Ourvisual- dictthepresenceofachair.SeeFigure8forsomeexample basedtranslationbenefitssignificantlyfromusingConvNet learnedrelationships. features(Section3)comparedtothe21%agreementthatwe Giventhislargedatasetofimageswithnoisyvisualpre- previously reported in Ordonez et al. (2013). Note that our dictionsandtextlabels,wemanagetolearnquitegoodes- visual-based translation unlike our language-based transla- timatorsofhigh-levelcontent,evenforcategorieswithrela- tion does not use the WordNet semantic hierarchy to con- tivelyhighintra-classvariation(e.g.girl,boy,market,house). straintheoutputcategoriestothesetofinheritedhypernyms Weshowsomeresultsofimageswithpredictedoutputlabels oftheinputcategory. foragroupofimagesinFigure9. Thisexperimentexpandsonpreviousstudiesinpsychol- ogy(Rosch,1978;Jolicoeuretal.,1984).Readilyavailable and inexpensive online crowdsourcing enables us to gather 6 ExperimentalEvaluation theselabelsforamuchlargersetof(500)conceptsthanpre- viousexperimentsandtolearngeneralizationsforasubstan- Weevaluatetworesultsfromourpaper–modelsthatlearn tiallylargersetofImageNetsynsets. general translations from encyclopedic concepts to entry- level concepts (Section 6.1) and models that predict entry- levelconceptsforimages(Section6.2).Weadditionallypro- 6.2 EvaluatingImageEntry-LevelPredictions videanextrinsicevaluationofournamingpredictionmeth- We measure the accuracy of our proposed entry-level cate- odsbyusingthemforasentenceretrievalapplication(Sec- gorypredictionmethodsbyevaluatinghowwellwecanpre- tion6.3). dictnounsfreelyassociatedwithimagesbyusersonAma- zon Mechanical Turk. We initially selected two evaluation 6.1 EvaluatingTranslations image sets. Dataset A: contains 1000 images selected at randomfromthemillionimagedataset.DatasetB:contains WeobtaintranslationsfromImageNetsynsetstoentry-level 1000 images selected from images displaying high confi- categoriesusingAmazonMechanicalTurk(MTurk).Inour denceinconceptpredictions.Weadditionallycollectedan- experiments,usersarepresentedwitha2x5arrayofimages notationsforanother2000imagessothatwecantunetrade- sampled from an ImageNet synset, x , and asked to label offparametersinourmodels.Bothsetsarecompletelydis- i thedepictedconcept.Resultsareobtainedfor500ImageNet jointfromthesetsofimagesusedforlearning.Foreachim- synsets and aggregated across 8 users per task. We found age,weinstruct3usersonMTurktowritedownanynouns agreement(measured asat least3 of8 usersin agreement) thatarerelevanttotheimagecontent.Becausetheseanno- among users for 447 of the 500 concepts, indicating that tations are free associations we observe a large and varied eventhoughtherearemanypotentiallabelsforeachsynset set of associated nouns – 3,610 distinct nouns total in our (e.g. Sarcophaga carnaria could conceivably be labeled as evaluationsets.Thismakesnounpredictionextremelychal- fly, dipterous insect, insect, arthropod, etc) people have a lenging! strong preference for particular categories. We denote our Forevaluation,wemeasurehowwellwecanpredictall resulting set of reference translations as: D = {(x ,y )}, nounsassociatedwithanimagebyTurkers(Figure10)and i i whereeachelementpaircorrespondstoatranslationfroma howwellwecanpredictthenounscommonlyassociatedby leafnodex toanentry-levelword y . Turkers(assignedbyatleast2of3Turkers,Figure11).For i i We show sample results from each of our methods to referencewecomputetheprecisionofonehumanannotator learnconcepttranslationsinFigure6.Insomecaseslanguage- against the other two and found that on Dataset A humans based translation fails. For example, whinchat (a type of were able to predict what the previous annotators labeled bird)translatesto“chat”mostlikelybecauseoftheinflated with0.35precisionandwith0.45precisionforDatasetB. counts for the most common use of “chat”. Visual-based Resultsshowprecisionandrecallforpredictiononeach translationfailswhenitlearnstoweightcontextwordshighly, of our Datasets, comparing: leaf node classification perfor- for example “snorkeling” → “water”, or “African bee” → mance (flat classifier), the outputs of hedging (Deng et al., “flower” even when we try to account for common con- 2012),andourproposedentry-levelcategorypredictors(lin- text words using TFIDF. Finally, even humans are not al- guisticallyguidednaming(Section5.1)andvisuallyguided ways correct, for example “Rhea americana” looks like an naming (Section 5.2)). Qualitative examples for Dataset A ostrich, but is not taxonomically one. Even for categories are shown in Figure 13 and for Dataset B in Figure 14. like“marmot”mostpeoplenamedit“squirrel”.Overall,our Performance at this task on Dataset B is in general better language-basedtranslation(Section4.1)agrees37%ofthe than performance on Dataset A. This is unsurprising since 10 VicenteOrdonezetal. 0.6 0.6 Visually-guided naming Visually-guided naming 0.55 Linguistically-guided naming 0.55 Linguistically-guided naming Hedging (Deng et.al.2012) Hedging (Deng et.al.2012) 0.5 Flat Classifiers 0.5 Flat Classifiers 0.45 0.45 0.4 0.4 n0.35 n0.35 o o si si ci 0.3 ci 0.3 e e Pr0.25 Pr0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Recall Recall (a) DatasetA (b) DatasetB Fig. 10 Precision-recall curves for different entry-level prediction methods when using the top K categorical predictions for K = 1,3,5,10,15,20,50.Thegroundtruthistheunionoflabelsfromallusersforeachimage. 0.5 0.5 Visually-guided naming Visually-guided naming 0.45 Linguistically-guided naming 0.45 Linguistically-guided naming Hedging (Deng et.al.2012) Hedging (Deng et.al.2012) Flat Classifiers Flat Classifiers 0.4 0.4 0.35 0.35 0.3 0.3 n n o o si si ci0.25 ci0.25 e e Pr Pr 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 0.050.10.150.20.250.30.350.40.450.50.550.60.650.7 0 0.050.10.150.20.250.30.350.40.450.50.550.60.650.7 Recall Recall (a) DatasetA (b) DatasetB Fig. 11 Precision-recall curves for different entry-level prediction methods when using the top K categorical predictions for K = 1,3,5,10,15,20,50.Thegroundtruthisthesetoflabelswhereatleasttwousersagreed. Dataset B contains images which have confident classifier ing(Dengetal.,2012)andourlinguistically-guidednaming scores. Surprisingly their difference in performance is not method. extremeandperformanceonbothsetsisadmirableforthis Onthetwodatasetswefindthevisually-guidednaming challengingtask.Whencomparedtoourpreviouswork(Or- modeltoperformbetter(Section5.2)thanthelinguistically- donez et al., 2013) that relies on SIFT + LLC features, we guidednamingprediction(Section5.1).Inaddition,weout- foundthattheinclusionofConvNetfeaturesprovidedasig- performbothleafnodeclassificationandthehedgingtech- nificant improvement in the performance for the visually- nique(Dengetal.,2012). guided naming predictions but it did not improve the re- WeadditionallycollectedathirdtestsetDatasetCcon- sultsusingtheWordNetsemantichierarchyforbothHedg- sistingofrandomImageNetimagesbelongingtothe7,404
Description: