Table Of ContentNonamemanuscriptNo.
(willbeinsertedbytheeditor)
Predicting Entry-Level Categories
VicenteOrdonez · WeiLiu · JiaDeng · YejinChoi ·
AlexanderC.Berg · TamaraL.Berg
Received:date/Accepted:date
Abstract Entry-levelcategories—thelabelspeopleuseto
name an object — were originally defined and studied by
psychologistsinthe1970sand80s.Inthispaperweextend
these ideas to study entry-level categories at a larger scale
and to learn models that can automatically predict entry-
levelcategoriesforimages.Ourmodelscombinevisualrecog-
nitionpredictionswithlinguisticresourceslikeWordNetand
proxies for word “naturalness” mined from the enormous
amount of text on the web. We demonstrate the usefulness
ofourmodelsforpredictingnouns(entry-levelwords)asso-
Recognition Prediction What should I Call It?
ciatedwithimagesbypeople,andforlearningmappingsbe-
tweenconceptspredictedbyexistingvisualrecognitionsys- grampus griseus dolphin
temsandentry-levelconcepts.Inthisworkwemakeuseof
recent successful efforts on convolutional network models
Fig.1 ExampletranslationbetweenaWordNetbasedobjectcategory
forvisualrecognitionbytrainingclassifiersfor7,404object
predictionandwhatpeoplemightcallthedepictedobject.
categoriesonConvNetactivationfeatures.Resultsforcate-
gorymappingandentry-levelcategorypredictionforimages
showpromiseforproducingmorenaturalhuman-likelabels. 1 Introduction
Wealsodemonstratethepotentialapplicabilityofourresults
tothetaskofimagedescriptiongeneration. Computationalvisualrecognitionisbeginningtowork.Al-
though far from solved, algorithms have now advanced to
the point where they can recognize or localize thousands
Keywords Recognition · Categorization · Entry-Level
of object categories with reasonable accuracy (Deng et al.,
Categories·Psychology
2010;Perronninetal.,2012;Krizhevskyetal.,2012;Dean
etal.,2013;SimonyanandZisserman,2014;Szegedyetal.,
2014). Russakovsky et al. (2014) present an overview of
recent advances in classification and localization for up to
V.Ordonez((cid:0))·W.Liu·A.C.Berg·T.L.Berg 1000objectcategories.Whileonecouldpredictanyoneof
DepartmentofComputerScience
many relevant labels for an object, the question of “What
UniversityofNorthCarolinaatChapelHill,ChapelHill,NC,USA
should I actually call it?” is becoming important for large-
E-mail:vicente@cs.unc.edu
scale visual recognition. For instance, if a classifier were
J.Deng
luckyenoughtogettheexampleinFigure1correct,itmight
DepartmentofElectricalEngineeringandComputerScience
UniversityofMichigan,AnnArbor,MI,USA output grampus griseus, while most people would simply
callthisobjectadolphin.Weproposetodevelopcategoriza-
Y.Choi
ComputerScienceandEngineering tionsystemsthatareawareofthesekindsofhumannaming
UniversityofWashington,Seattle,WA,USA choices.
2 VicenteOrdonezetal.
egories–e.g.,leafnodesinWordNet(Fellbaum,1998)–to
what people are likely to call them (entry-level categories)
and 2) learning to map from outputs of thousands of noisy
computervisionclassifiers/detectorsevaluatedonanimage
towhatapersonislikelytocalladepictedobject.
Evaluations show that our models can effectively em-
ulatethenamingchoicesofhumanobservers.Furthermore,
weshowthatusingnoisyvisionestimatesforimagecontent,
Superordinates: animal, vertebrate Superordinates: animal, vertebrate
oursystemcanoutputwordsthataresignificantlycloserto
Basic Level: bird Basic Level: bird
Entry Level: bird Entry Level: penguin human annotations than either raw visual classifier predic-
Subordinates:American robin Subordinates: Chinstrap penguin tions or the results of using a state of the art hierarchical
classificationsystem(Dengetal.,2012)thatcanoutputob-
Fig.2 AnAmericanRobinisamoreprototypicaltypeofbirdhence jectlabelsatvaryinglevelsofabstractionfromveryspecific
itsentry-levelcategorycoincideswithitsbasiclevelcategorywhilefor termstoverygeneralcategories.
penguinwhichisalessprototypicalexampleofbird,theentry-level
categoryisatalowerlevelofabstraction.
Thisnotioniscloselyrelatedtoideasofbasicandentry-
1.1 InsightsintoEntry-LevelCategories
levelcategoriesformulatedbypsychologistssuchasEleanor
Rosch(Rosch,1978)andStephenKosslyn(Jolicoeuretal.,
1984).Roschdefinesbasic-levelcategoriesasroughlythose Atfirstglance,thetaskoffindingtheentry-levelcategories
mayseemlikealinguisticproblemoffindingahypernymof
categories at the highest level of generality that still share
anygivenword.Althoughthereisaconsiderableconceptual
manycommonattributesandhavefewerdistinctiveattributes.
An example of a basic level category is bird where most connection between entry-level categories and hypernyms,
therearetwonotabledifferences:
instances share attributes like having feathers, wings, and
beaks.Super-ordinate,moregeneral,categoriessuchasan-
imalwillsharefewerattributesanddemonstratemorevari- 1. Although“bird”isahypernymofboth“penguin”,and
ability.Subordinate,morespecificcategories,suchasAmer- “sparrow”, “bird” may be a good entry-level category
icanRobinwillshareevenmoreattributeslikeshape,color, for“sparrow”,butnotfor“penguin”.Thisphenomenon
and size. Rosch studied basic level categories through hu- — that some members of a category are more proto-
manexperiments,e.g.askingpeopletoenumeratecommon typical than others — is discussed in Prototype Theory
attributesforagivencategory.TheworkofJolicoeuretal. (Rosch,1978).
(1984) further studied the way people identify categories, 2. Entry-levelcategoriesarenotconfinedby(inherited)hy-
defining the concept of entry-level categories. Entry level pernyms,inpartbecauseencyclopedicknowledgeisdif-
categories are essentially the categories that people natu- ferentfromcommonsenseknowledge.Forexample“rhea”
rally use to identify objects. The more prototypical an ob- isnotakindof“ostrich”inthestricttaxonomicalsense.
ject,themorelikelyitwillhaveitsentrypointatthebasic- However, due to their visual similarity, people gener-
levelcategory.Forlesstypicalobjectstheentrypointmight ally refer to a “rhea” as an “ostrich”. Adding to the
beatalowerlevelofabstraction.ForexampleanAmerican challengeisthatalthoughextensive,WordNetisneither
robinorapenguinarebothmembersofthesamebasic-level completenorpracticallyoptimalforourpurpose.Forex-
bird category. However, the American robin is more proto- ample, according to WordNet, “kitten” is not a kind of
typical, sharing many features with other birds and thus its “cat”,and“tulip”isnotakindof“flower”.
entry-level category coincides with its basic-level category
ofbird,whiletheentry-levelcategoryforapenguinwould In fact, both of the above points have a connection to
beatalowerlevelofabstraction(seeFigure2). visualinformationofobjects,asvisuallysimilarobjectsare
So,whileobjectsaremembersofmanycategories–e.g. more likely to belong to the same entry-level category. In
MrEdisapalomino,butalsoahorse,anequine,anodd-toed thiswork,wepresentthefirstextensivestudythat(1)char-
ungulate,aplacentalmammal,amammal,andsoon–most acterizesentry-levelcategoriesinthecontextoftranslating
peoplelookingatMrEdwouldtendtocallhimahorse,his encyclopedicvisualcategoriestonaturalnamesthatpeople
entrylevelcategory(unlesstheyarefansoftheshow).Our commonly use, and (2) provides methods to predict entry-
paper focuses on the problem of object naming in the con- level categories for input images guided by semantic word
textofentry-levelcategories.Weconsidertworelatedtasks: knowledgeorbyusingalarge-scalecorpusofimageswith
1)learningamappingfromfine-grained/encyclopediccat- text.
PredictingEntry-LevelCategories 3
1.2 PaperOverview 1998) hierarchy, or from the nouns used in a large set of
carefully selected image captions that directly refer to im-
Our paper is divided as follows. Section 2 presents a sum- ages.ThemorerecentworksofChenetal.(2013)andDiv-
maryofrelatedwork.Section3introducesalarge-scaleim- vala et al. (2014) present systems capable of learning any
age categorization system based on convolutional network type of visual concept from images on the web, including
activations.InSection4welearntranslationsfromsubordi- effortstolearnsimplecommonsenserelationshipsbetween
nate concepts to entry-level concepts. In Section 5 we pro- visual concepts (Chen et al., 2013). We provide a related
posetwomodelsthatcantakeanimageasinputandpredict output in our work, learning mappings between entry-level
entry-levelconcepts.Finally,inSection6weprovideexper- categoriesandsubordinate/leaf-nodecategories.Therecent
imentalevaluations. workofFengetal.(2015)proposesthatentry-levelcatego-
Themajoradditionsinthisjournalversioncomparedto rization can be viewed as lexical semantic knowledge, and
our previous publication (Ordonez et al., 2013) include an presentsaglobalinferenceformulationtomapallencyclo-
expanded discussion about related work in Section 2. We pediccategoriestotheirentry-levelcategoriescollectively.
havealsoreplacedthelargescaleimagecategorizationsys-
Onatechnicallevel,ourworkisrelatedto(Dengetal.,
tem based on hand-crafted SIFT + LLC features used in
2012)thattriesto“hedge”predictionsofvisualcontentby
our previous work with a system based on state-of-the-art optimally backing off in the WordNet hierarchy. One key
convolutional network activations obtained using the Caffe
difference is that ourapproach uses a reward function over
framework (Jia, 2013) (Section 3). Sections 4 and 5 have
the WordNet hierarchy that is non-monotonic along paths
beenupdatedaccordinglyandincludeadditionalqualitative
from the root to the leaves. Another difference is that we
examples. Section 6 contains a more thorough evaluation
havereplacedtheunderlyingleafnodeclassifiersfromDeng
studying the effect of the number of predicted outputs on
et al. (2012) with recent convolutional network activation
precision and recall, and an additional extrinsic evaluation
features. Our approach also allows mappings to be learned
ofthesysteminasentenceretrievalapplication. fromaWordNetleafnode,l,tonaturalwordchoicesthatare
notalongapathfroml totheroot,“entity”.Inevaluations,
our results significantly outperform those of (Deng et al.,
2 Relatedwork
2012)becausealthoughoptimalinsomesense,theyarenot
optimalwithrespecttohowpeopledescribeimagecontent.
Questions about entry-level categories are directly relevant
to recent work on the connection between computer vision Ourworkisalsorelatedtothegrowingchallengeofhar-
outputs and (generating) natural language descriptions of nessing the ever increasing number of pre-trained recogni-
images(Farhadietal.,2010;Ordonezetal.,2011;Kuznetsova tion systems, thus avoiding “starting from scratch” when-
et al., 2012; Mitchell et al., 2012; Yang et al., 2011; Gupta ever developing new applications. It is wasteful not to take
etal.,2012;Kulkarnietal.,2013;Hodoshetal.,2013;Ram- advantage of the CPU weeks (Felzenszwalb et al., 2010;
nath et al., 2014; Mason and Charniak, 2014; Kuznetsova Krizhevskyetal.,2012),months(Dengetal.,2010,2012),
et al., 2014). Previous works have not directly addressed or even millennia (Le et al., 2012) invested in developing
naming preference choices for entry-level categories when recognitionmodelsforincreasinglylargelabeleddatasets(Ev-
generating sentences. Often the computer vision label pre- eringhametal.,2010;Russelletal.,2008;Xiaoetal.,2010;
dictionsareuseddirectlyduringsurfacerealization (Mitchell Deng et al., 2009; Torralba et al., 2008). However, for any
et al., 2012; Kulkarni et al., 2013), resulting in choosing specificend-userapplication,thecategoriesofobjects,scenes,
non-humanlikenamingsforconstructingsentencesevenwhen andattributeslabeledinaparticulardatasetmaynotbethe
handlingarelativelysmallnumberofcategories(i.e.Pascal mostusefulpredictions.Onebenefitofourworkcanbeseen
VOCcategorieslikepotted-plant,tv-monitororperson).For as exploring the problem of translating the outputs of a vi-
these methods, our entry-level category predictions could sionsystemtrainedwithonevocabularyoflabels(WordNet
be used to generate more natural names for objects. Other leaf nodes) to labels in a new vocabulary (commonly used
methods handle naming choices indirectly in a data-driven visuallydescriptivenouns).
fashionbyborrowinghumanreferencesfromothervisually Ourproposedmethodstakeintoaccountseveralsources
similarobjects (Kuznetsovaetal.,2012,2014;Masonand ofstructureandinformation:thestructureofWordNet,fre-
Charniak,2014). quenciesofworduseinlargeamountsofwebtext,outputs
Our work is also related to previous works that aim to ofalarge-scalevisualrecognitionsystem,andlargeamounts
discovervisualcategoriesfromlarge-scaledata.Theworks ofpairedimageandtextdata.Inparticular,weusetheSBU
ofYanaiandBarnard(2005)andBarnardandYanai(2006) CaptionedPhotoDataset(Ordonezetal.,2011),whichcon-
learnmodelsforasetofcategoriesbyexploringimageswith sistsof1millionimageswithnaturallanguagedescriptions,
loosely associated text from the web. We learn our set of and Google n-gram frequencies collected for all words on
categories directly as a subset of the WordNet (Fellbaum, theweb.Takingalloftheseresourcestogether,weareable
4 VicenteOrdonezetal.
to study patterns for choice of entry-level categories at a
Animal 656M
muchlargerscalethanpreviouspsychologyexperiments. 𝜓𝜓(𝑤𝑤,𝑣𝑣) 𝜙𝜙(𝑤𝑤)
366M Bird Mammal 15M Se n-gram
m Frequency
a
3 ALarge-ScaleImageCategorizationSystem 128M Seabird Cetacean 0.9M ntic
1.2M D
Penguin 88M Cormorant Whale 55M ista
Large-scale image categorization has improved drastically nc
30M e
inrecentyears.Thecomputervisioncommunityhasmoved King 22M Dolphin 6.4M Sperm
penguin whale
fromhandling101categories(Fei-Feietal.,2007)to100,000
categories (Dean et al., 2013) in a few years. Large-scale Grampus 0.08M
griseus
datasetslikeImageNet(Dengetal.,2009)andrecentprogress
intrainingdeeplayeredarchitectures(Krizhevskyetal.,2012) Fig.3 OurfirstcategoricaltranslationmodelusestheWordNethier-
havesignificantlyimprovedthestate-of-the-art.Weleverage archytofindanhypernymthatisclosetotheleafnodeconcept(se-
manticdistance)andhasalargenaturalnessscorebasedonitsn-gram
asystembasedontheseasthestartingpointforourwork.
frequency.Thegreenarrowsindicatetheidealcategorythatwouldcor-
Forfeatures,weuseactivationsfromaninternallayerof
respondtotheentry-levelcategoryforeachleaf-nodeinthissample
a convolutional network, following the approach of (Don- semantichierarchy.
ahue et al., 2013). In particular, we use the pre-trained ref-
erence model from the Caffe framework (Jia, 2013) which
synonyminw.Aspossibletranslationconceptsforagiven
isinturnbasedonthemodelfromKrizhevskyetal.(2012).
category, v, we consider all nodes, w in v(cid:48)s inherited hy-
This model was trained on the 1,000 ImageNet categories
pernymstructure(allofthesynsetsalongtheWordNetpath
from the ImageNet Large Scale Visual Recognition Chal-
fromwtotheroot).
lenge 2012. We compute the 4,096 activations in the 7th
Wedefineatranslationfunction,τ(v,λ),thatmaximizes
layer of this network for images in 7,404 leaf node cate-
a trade-off between naturalness, φ(w), and semantic prox-
gories from ImageNet and use them as features to train a
imity,ψ(w,v),measuringthedistancebetweenleafnodev
linear SVM for each category. We further use a validation
andnodewintheWordNethypernymstructure:
set to calibrate the output scores of each SVM with Platt
scaling(Platt,1999).
τ(v,λ)=argmax[φ(w)−λψ(w,v)],w ∈Π(v), (1)
w
where Π(v) is the set of (inherited) hypernyms from v to
4 TranslatingEncyclopedicConcepts
the root, including v. For instance, given an input category
toEntry-LevelConcepts
v = Kingpenguin we consider all categories along its set
ofinheritedhypernyms,e.g.penguin,seabird,bird,animal
Our objective in this section is to discover mappings be-
(seeFigure3).Anidealpredictionforthisconceptwouldbe
tweensubordinateencyclopedicconcepts(ImageNetleafcat-
penguin. To control how the overall system trades off nat-
egories,e.g.Chlorophyllummolybdites)tooutputconcepts
uralness vs semantic proximity, we perform line search to
that are more natural (e.g. mushroom). In Section 4.1 we
setλ.Forthispurposeweuseaheldoutsetofsubordinate-
present an approach that relies on the WordNet hierarchy
category, entry-level category pairs (x ,y ) collected using
and frequency of words in a web scale corpus. In Section i i
AmazonMechanicalTurk(MTurk)(fordetailsrefertoSec-
4.2wefollowanapproachthatusesvisualrecognitionmod-
tion6.1).Ourobjectiveistomaximizethenumberofcorrect
elslearnedonapairedimage-captiondataset.
translationspredictedbyourmodel:
(cid:88)
Φ(D,λ)= 1[τ(x ,λ)=y ], (2)
i i
4.1 Language-basedTranslation
i
We first consider a translation approach that relies only on where1[·]istheindicatorfunction.Weshowtherelation-
language-basedinformation:thehierarchicalsemanticstruc- shipbetweenλandvocabularysizeinFigure4(a),andbe-
turefromWordNet(Fellbaum,1998)andtextstatisticsfrom tween λ and overall translation accuracy, Φ(D,λ), in Fig-
the Google Web 1T corpus (Brants and Franz., 2006). We ure 4(b). As we increase λ, Φ(D,λ) increases initially and
posit that the frequencies of terms computed from massive thendecreasesastoomuchgeneralizationorspecificityre-
amountsoftextonthewebreflectthe“naturalness”ofcon- ducesthenaturalnessofthepredictions.Forexample,gen-
cepts. We use the n-gram counts of the Google Web 1T eralizingfromgrampusgriseustodolphinisgoodfor“nat-
corpus (Brants and Franz., 2006) as a proxy for natural- uralness”,butgeneralizingallthewayto“entity”decreases
ness. Specifically, for a synset w, we quantify naturalness “naturalness”. In Figure 4(b) the red line shows accuracy
as,φ(w),thelogofthecountforthemostcommonlyused forpredictingthemostagreeduponwordforasynset,while
PredictingEntry-LevelCategories 5
6000
5000 0.5 Friesian,
Holstein,
vocabulary size234000000000 Φλ(D,)000...234 H(1o.l9st0e7in1-)F rcioeswia n
1000 (1.1851) orange_tree
0.1 (0.6136) stall
0 (0.5630) mushroom
1 10 20 30 40 50 60 10 20 30 40 50 60 (0.3825) pasture
lambda lambda (0.3156) sheep
(a) vocabularysizevs.λ (b) naturalnessvs.λ (0.3321) black_bear Vision
(0.3015) puppy
(0.2409) pedestrian_bridge System
(0.2353) nest
Fig.4 Left:showstherelationshipbetweenparameterλandthetarget
vocabularysize.Right:showstherelationshipbetweenparameterλ
andagreementaccuracywithhumanlabeledsynsetsevaluatedagainst
Fig.5 WeshowthesysteminstancesofthecategoryFriesian,Hol-
themostagreedhumanlabel(red)andanyhumanlabel(cyan).
stein,Holstein-Friesianandthevisionsystempre-trainedwithcandi-
dateentry-levelcategoriesranksasetofcandidatekeywordsandout-
putsthemostrelevant,inthiscasecow.
the cyan line shows the accuracy for predicting any word
collected from any user. Our experiment also supports that
(Section5.1)and2)amethodthatlearnsvisualmodelsfor
entry-level categories seem to lie at a certain level of ab-
entry-levelcategorypredictiondirectlyfromalargecollec-
straction where there is a discontinuity. Going beyond this
tionofimageswithassociatedcaptions(Section5.2).
levelofabstractionsuddenlymakesourpredictionsconsid-
erablyworse(seeFigure4(b)). Rosch(1978)indeedargues
inthecontextofbasiclevelcategoriesthatbasiccutsincat-
5.1 Linguistically-guidedNaming
egorization happen precisely at these discontinuities where
therearebundlesofinformation-richfunctionalandpercep-
We estimate image content for an image, I, using the pre-
tualattributes.
trained models from Section 3. These models predict pres-
ence or absence of 7,404 leaf node concepts in ImageNet
4.2 Visual-basedTranslation (WordNet). Following the approach of Deng et al. (2012),
wecomputeestimatesofvisualcontentforinternalnodesby
Next, we try to make use of pre-trained visual classifiers hierarchicallyaccumulatingallpredictionsbelowanode:1
to improve translations between input concepts and entry-
levelconcepts.Foragivenleafsynset,v,wesampleasetof fˆ(v,I), ifvisaleafnode,
n=100imagesfromImageNet.Foreachimage,i,wepre- f(v,I)= (cid:80) fˆ(v(cid:48),I), ifvisaninternalnode, (3)
dictsomepotentialentry-levelnouns,Ni,usingpre-trained v(cid:48)∈Z(v)
visual classifiers that we will describe later in Section 5.2.
where Z(v) is the set of all leaf nodes under node v and
WeusetheunionofthissetoflabelsN =N ∪N ...∪N
1 2 n fˆ(v,I) is a score predicting the presence of leaf node cat-
as keyword annotations for synset v and rank them using
egory v from our large scale image categorization system
a TFIDF information retrieval measure. We consider each
introducedinSection3.SimilartoourapproachinSection
category v as a document for computing the inverse docu-
4.1, we define for every node in the ImageNet hierarchy a
mentfrequency(IDF)term.Wepickthemosthighlyranked
trade-off function between “naturalness” φ (ngram counts)
nounforeachnode,v,asitsentry-levelcategoricaltransla-
and specificity ψ˜ (relative position in the WordNet hierar-
tion(seeanexampleinFigure5).
chy):
γ(v,λˆ)=[φ(w)−λˆψ˜(w)], (4)
5 PredictingEntry-LevelConceptsforImages
whereφ(w)iscomputedasthelogcountsofthenounsand
In Section 4 we proposed models to translate between one
compoundnounsinthetextcorpusfromtheSBUCaptioned
linguistic concept, e.g. grampus griseus, to a more natural
Dataset(Ordonezetal.,2011),andψ˜(w)isanupperbound
concept,e.g.dolphin.Ourobjectiveinthissectionistoex-
onψ(w,v)fromequation(1)equaltothemaximumpathin
plore methods that can take an image as input and predict
theWordNetstructurefromnodev tonodew.Weparame-
entry-level labels for the depicted objects. The models we
terizethistrade-offbyλˆ.
proposeare:1)amethodthatcombines“naturalness”mea-
suresfromtextstatisticswithdirectestimatesofvisualcon- 1 This function might bias decisions toward internal nodes. Other
tent computed at leaf nodes and inferred for internal nodes alternativescouldbeexploredtoestimateinternalnodescores.
6 VicenteOrdonezetal.
Input Concept Language-based Visual-based Human
Translation Translation Translation
1 eastern kingbird bird bird bird
2 cactus wren bird bird bird
3 buzzard, Buteo buteo hawk hawk hawk
4 whinchat, Saxicola rubetra chat bird bird
6 Weimaraner dog dog dog
7 Gordon setter dog dog dog
8 numbat, banded anteater, anteater anteater dog anteater
9 rhea, Rhea americana bird grass ostrich
10 Africanized bee, killer bee, Apis mellifera bee bee bee
11 conger, conger eel eel fish fish
12 merino, merino sheep sheep sheep sheep
13 Europ.black grouse, heathfowl, Lyrurus tetrix bird bird bird
14 yellowbelly marmot, rockchuck, Marm.flaviventris marmot male squirrel
15 snorkeling, snorkel diving swimming seaturtle snorkel
16 cologne, colognewater, eau de cologne essence bottle perfume
Fig.6 TranslationsfromImageNetleafnodesynsetcategoriestoentry-levelcategoriesusingourautomaticapproachesfromSections4.1(left)
and4.2(center)andcrowd-sourcedhumanannotationsfromSection6.1(right).
Forentry-levelcategorypredictioninimages,wewould
liketomaximizeboth“naturalness”andestimatesofimage 25 Linguistically-guided naming (SBU Captions)
content. For example, text based “naturalness” will tell us Linguistically-guided naming (Google Web 1T)
Hedging (Deng et.al.2012)
that both cat and dog are good entry-level categories, but
20
a confident visual prediction for German shepherd for an
imagetellsusthatdogisamuchbetterentry-levelprediction
thancatforthatimage.
15
n
Therefore,foraninputimage,wewanttooutputasetof o
si
conceptsthathavealargepredictionforboth“naturalness” eci
and content estimate score. For our experiments we output Pr10
thetopK WordNetsynsetswiththehighestf scores:
nat
5
f (v,I,λˆ)=f(v,I)γ(v,λˆ). (5)
nat
As we change λˆ we expect a similar behavior as in our 0
50 100 150 200 250 300 350 400 450 500
language-based concept translations (Section 4.1). We can Vocabulary size
tuneλˆtocontrolthedegreeofspecificitywhiletryingtopre-
serve “naturalness” using n-gram counts. We compare our Fig.7 Relationshipbetweenaverageprecisionagreementandwork-
frameworktothe“hedging”techniqueofDengetal.(2012) ing vocabulary size (on a set of 1000 images) for the hedging
for different settings of λˆ. For a side by side comparison method (Deng et al., 2012) (red) and our linguistically-guided nam-
ingmethodthatusestextstatisticsfromthegenericGoogleWeb1T
we modify hedging to output the top K synsets based on
dataset(magenta)andfromtheSBUCaptionDataset(Sec.5.1).We
their scoring function. Here, the working vocabulary is the useK =5togeneratethisplotandarandomsetof1000imagesfrom
uniquesetofpredictedlabelsoutputforeachmethodonthis theSBUCaptionedDataset.
testset.Resultsdemonstrate(Figure7)thatunderdifferent
parametersettingsweconsistentlyobtainmuchhigherlevels
ofprecisionforpredictingentry-levelcategoriesthanhedg-
ing (Deng et al., 2012). We also obtain an additional gain
in performance than in our previous work (Ordonez et al., SBUCaptionedDatasetratherthanthemoregenericGoogle
2013)byrelyingonthedataset-specifictext-statisticsofthe Web1Tcorpus.
PredictingEntry-LevelCategories 7
5.2 Visually-guidedNaming in our target vocabulary follows a power-law distribution.
Hence we only have a very large amount of training data
In the previous section we rely on WordNet structure to forafewmostcommonlyoccurringnounconcepts.Specif-
compute estimates of image content, especially for inter- ically, we learn linear SVMs followed by Platt scaling for
nal nodes. However, this is not always a good measure of eachofourtargetconcepts.Wekeepd = 1,169ofthebest
contentbecause:1)TheWordNethierarchydoesn’tencode performing models. Our scoring function f for a target
svm
knowledge about some semantic relationships between ob- conceptv isthen:
i
jects (i.e. functional or contextual relationships), 2) Even
withthevastcoverageof7,404ImageNetleafnodesweare
1
missing models for many potentially important entry-level fsvm(vi,I,θi)= 1−exp(a θ(cid:62)X+b ), (7)
categoriesthatarenotattheleaflevel. i i i
Asanalternative,wecandirectlytrainmodelsforentry- whereθ arethemodelparametersforpredictingconceptv ,
i i
levelcategoriesfromdatawherepeoplehaveprovidedentry- and a and b are Platt scaling parameters learned for each
i i
level labels – in the form of nouns present in visually de- targetconceptv onaheldoutvalidationset.
i
scriptiveimagecaptions.Wepostulatethatthesenounsrep-
resentexamplesofentry-levellabelsbecausetheyhavebeen
|D|
naturally annotated by people to describe what is present R(θ )= 1(cid:107)θ (cid:107)+c(cid:88)max(0,1−y θ(cid:62)X(j))2. (8)
in an image. For this task, we leverage the SBU Captioned i 2 i ij i
j=1
PhotoDataset(Ordonezetal.,2011),whichcontains1mil-
lion captioned images. We transform this dataset into a set Welearntheparametersθibyminimizingthesquaredhinge-
D = {X(j),Y(j) | X(j) ∈ X,Y(j) ∈ Y}, where X = losswith(cid:96)1regularization(eqn8).Thelatterprovidesanat-
[0–1]s is a vector of estimates of visual content for s = ural way of modeling the relationships between the input
7,404 ImageNet leaf node categories and Y = [0,1]d is a and output label spaces that encourages sparseness (exam-
setofbinaryoutputlabelsfordtargetcategories. ples in Figure 8). We find c = 0.01 to yield good results
Input content estimates are provided by the SVM con- forourproblemandusethisvaluefortrainingallindividual
tent predictors based on convolutional network activation models.
featuresdescribedinSection3.WeruntheseSVMpredic- One of the drawbacks of using the ImageNet hierarchy
tors over the whole image as opposed to the max-pooling toaggregateestimatesofvisualconcepts(Section5.1)isthat
approachoverboundingboxesfromourpreviouspaper(Or- itignoresmorecomplexrelationshipsbetweenconcepts.Here,
donez et al., 2013) so that we have a more uniform com- our data-driven approach to the problem implicitly discov-
parisontoourlinguistically-guidednamingapproach(Sec- erstheserelationships.Forinstanceaconceptliketreehasa
tion5.1)whichdoesthesame.Therewassomeminordrop co-occurrencerelationshipwithbirdthatmaybeusefulfor
inperformancewhenrunningourmodelsexclusivelyonthe
wholeimage.Comparedtoourpreviouswork,ourvisually-
PR curve Most confident correct predictions Most confident wrong predictions
guidednamingapproachstillhasasignificantgainfromus-
ingtheConvNetfeaturesintroducedinsection3. house
Fortrainingourdtargetcategories,weobtainlabelsY
from the million captions by running a POS-tagger (Bird,
market
2006)anddefiningY(j) ={y }suchthat:
ij
(cid:40)
1, ifcaptionforimagej hasnouni, girl
y = (6)
ij
0, if otherwise.
boy
The POS-tagger helps clean up some word sense am-
biguity due to polysemy, by only selecting those instances
where a word is used as a noun. d is determined experi- cat
mentallyfromdatabylearningmodelsforthemostfrequent
nounsinthisdataset.Thisprovidesuswithatargetvocabu- bird
larythatisbothlikelytocontainentry-levelcategories(be-
cause we expect entry-level category nouns to commonly
Fig.9 Samplepredictionsfromourexperimentsonatestsetforeach
occurinourvisualdescriptions)andtocontainsufficientim-
typeofcategory.Notethatimagelabelscomefromcaptionnouns,so
agesfortrainingeffectiverecognitionmodels.Weuseupto
someimagesmarkedascorrectpredictionsmightnotdepictthetar-
10,000imagesfortrainingeachmodel.Sinceweareusing getconceptwhereassomeimagesmarkedaswrongpredictionsmight
humanlabelsfromreal-worlddata,thefrequencyofwords actuallydepictthetargetcategory.
8 VicenteOrdonezetal.
iron tree, iron-tree, ironwood, ironwood tree
snag
European silver fir, Christmas tree, Abies alba
baobab, monkey-bread tree, Adansonia digitata
Japanese black pine, black pine, Pinus thunbergii
huisache, cassie, mimosa bush, sweet wattle, sweet acacia, scented wattle,
tree flame tree, Acacia farnesiana
feeder
bird feeder, birdfeeder, feeder
koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus
flying fox
damask
American basswood, American lime, Tilia americana
furnishing, trappings
cat box
reformer
dining area
writing desk
Staffordshire bullterrier, Staffordshire bull terrier
desk rubber eraser, rubber, pencil eraser
shoebox
flash, photoflash, flash lamp, flashgun, flashbulb, flash bulb
control room
sausage dog, sausage hound
mouse, computer mouse
workstation
riverbank, riverside
waterside
fishbowl, fish bowl, goldfish bowl
organza
diving duck
bathe
water hand towel, face towel
pier
horseshoe crab, king crab, Limulus polyphemus, Xiphosurus polyphemus
background, desktop, screen background
cling film, clingfilm, Saran Wrap
water jump
camouflage, camo
farmhouse
detached house, single dwelling
toolshed, toolhouse
chalet
fixer-upper
lowboy
house
vibraphone, vibraharp, vibes
banded purple, white admiral, Limenitis arthemis
ladies' room, powder room
cream-of-tartar tree, sour gourd, Adansonia gregorii
windowsill
bomb shelter, air-raid shelter, bombproof
kennel, doghouse, dog house
chalet
firebox
leash, tether, lead
flamethrower
fairy bluebird, bluebird
dog_house chicken coop, coop, hencoop, henhouse
pajama, pyjama
shadow box
treasure chest
Newfoundland, Newfoundland dog
whitewash
playpen, pen
Fig.8 Entry-levelcategorieswiththeircorrespondingtopweightedleafnodefeaturesaftertraininganSVMonournoisydataandavisual-
izationofweightsgroupedbyanarbitrarycategorizationofleafnodes.vegetation(green),birds(orange),instruments(blue),structures(brown),
mammals(red),others(black).
PredictingEntry-LevelCategories 9
prediction.Achairisoftenoccludedbytheobjectssittingon timewithhumansuppliedtranslationsandthevisual-based
thechair,butevidenceofthosetypesofobjects,e.g.people translation (Section 4.2) agrees 33% of the time, indicat-
or cat or co-occurring objects, e.g. table can help us pre- ingthattranslationlearningisanon-trivialtask.Ourvisual-
dictthepresenceofachair.SeeFigure8forsomeexample basedtranslationbenefitssignificantlyfromusingConvNet
learnedrelationships. features(Section3)comparedtothe21%agreementthatwe
Giventhislargedatasetofimageswithnoisyvisualpre- previously reported in Ordonez et al. (2013). Note that our
dictionsandtextlabels,wemanagetolearnquitegoodes- visual-based translation unlike our language-based transla-
timatorsofhigh-levelcontent,evenforcategorieswithrela- tion does not use the WordNet semantic hierarchy to con-
tivelyhighintra-classvariation(e.g.girl,boy,market,house). straintheoutputcategoriestothesetofinheritedhypernyms
Weshowsomeresultsofimageswithpredictedoutputlabels oftheinputcategory.
foragroupofimagesinFigure9. Thisexperimentexpandsonpreviousstudiesinpsychol-
ogy(Rosch,1978;Jolicoeuretal.,1984).Readilyavailable
and inexpensive online crowdsourcing enables us to gather
6 ExperimentalEvaluation theselabelsforamuchlargersetof(500)conceptsthanpre-
viousexperimentsandtolearngeneralizationsforasubstan-
Weevaluatetworesultsfromourpaper–modelsthatlearn
tiallylargersetofImageNetsynsets.
general translations from encyclopedic concepts to entry-
level concepts (Section 6.1) and models that predict entry-
levelconceptsforimages(Section6.2).Weadditionallypro- 6.2 EvaluatingImageEntry-LevelPredictions
videanextrinsicevaluationofournamingpredictionmeth-
We measure the accuracy of our proposed entry-level cate-
odsbyusingthemforasentenceretrievalapplication(Sec-
gorypredictionmethodsbyevaluatinghowwellwecanpre-
tion6.3).
dictnounsfreelyassociatedwithimagesbyusersonAma-
zon Mechanical Turk. We initially selected two evaluation
6.1 EvaluatingTranslations image sets. Dataset A: contains 1000 images selected at
randomfromthemillionimagedataset.DatasetB:contains
WeobtaintranslationsfromImageNetsynsetstoentry-level 1000 images selected from images displaying high confi-
categoriesusingAmazonMechanicalTurk(MTurk).Inour denceinconceptpredictions.Weadditionallycollectedan-
experiments,usersarepresentedwitha2x5arrayofimages notationsforanother2000imagessothatwecantunetrade-
sampled from an ImageNet synset, x , and asked to label offparametersinourmodels.Bothsetsarecompletelydis-
i
thedepictedconcept.Resultsareobtainedfor500ImageNet jointfromthesetsofimagesusedforlearning.Foreachim-
synsets and aggregated across 8 users per task. We found age,weinstruct3usersonMTurktowritedownanynouns
agreement(measured asat least3 of8 usersin agreement) thatarerelevanttotheimagecontent.Becausetheseanno-
among users for 447 of the 500 concepts, indicating that tations are free associations we observe a large and varied
eventhoughtherearemanypotentiallabelsforeachsynset set of associated nouns – 3,610 distinct nouns total in our
(e.g. Sarcophaga carnaria could conceivably be labeled as evaluationsets.Thismakesnounpredictionextremelychal-
fly, dipterous insect, insect, arthropod, etc) people have a lenging!
strong preference for particular categories. We denote our Forevaluation,wemeasurehowwellwecanpredictall
resulting set of reference translations as: D = {(x ,y )}, nounsassociatedwithanimagebyTurkers(Figure10)and
i i
whereeachelementpaircorrespondstoatranslationfroma howwellwecanpredictthenounscommonlyassociatedby
leafnodex toanentry-levelword y . Turkers(assignedbyatleast2of3Turkers,Figure11).For
i i
We show sample results from each of our methods to referencewecomputetheprecisionofonehumanannotator
learnconcepttranslationsinFigure6.Insomecaseslanguage- against the other two and found that on Dataset A humans
based translation fails. For example, whinchat (a type of were able to predict what the previous annotators labeled
bird)translatesto“chat”mostlikelybecauseoftheinflated with0.35precisionandwith0.45precisionforDatasetB.
counts for the most common use of “chat”. Visual-based Resultsshowprecisionandrecallforpredictiononeach
translationfailswhenitlearnstoweightcontextwordshighly, of our Datasets, comparing: leaf node classification perfor-
for example “snorkeling” → “water”, or “African bee” → mance (flat classifier), the outputs of hedging (Deng et al.,
“flower” even when we try to account for common con- 2012),andourproposedentry-levelcategorypredictors(lin-
text words using TFIDF. Finally, even humans are not al- guisticallyguidednaming(Section5.1)andvisuallyguided
ways correct, for example “Rhea americana” looks like an naming (Section 5.2)). Qualitative examples for Dataset A
ostrich, but is not taxonomically one. Even for categories are shown in Figure 13 and for Dataset B in Figure 14.
like“marmot”mostpeoplenamedit“squirrel”.Overall,our Performance at this task on Dataset B is in general better
language-basedtranslation(Section4.1)agrees37%ofthe than performance on Dataset A. This is unsurprising since
10 VicenteOrdonezetal.
0.6 0.6
Visually-guided naming Visually-guided naming
0.55 Linguistically-guided naming 0.55 Linguistically-guided naming
Hedging (Deng et.al.2012) Hedging (Deng et.al.2012)
0.5 Flat Classifiers 0.5 Flat Classifiers
0.45 0.45
0.4 0.4
n0.35 n0.35
o o
si si
ci 0.3 ci 0.3
e e
Pr0.25 Pr0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Recall Recall
(a) DatasetA (b) DatasetB
Fig. 10 Precision-recall curves for different entry-level prediction methods when using the top K categorical predictions for K =
1,3,5,10,15,20,50.Thegroundtruthistheunionoflabelsfromallusersforeachimage.
0.5 0.5
Visually-guided naming Visually-guided naming
0.45 Linguistically-guided naming 0.45 Linguistically-guided naming
Hedging (Deng et.al.2012) Hedging (Deng et.al.2012)
Flat Classifiers Flat Classifiers
0.4 0.4
0.35 0.35
0.3 0.3
n n
o o
si si
ci0.25 ci0.25
e e
Pr Pr
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 0.050.10.150.20.250.30.350.40.450.50.550.60.650.7 0 0.050.10.150.20.250.30.350.40.450.50.550.60.650.7
Recall Recall
(a) DatasetA (b) DatasetB
Fig. 11 Precision-recall curves for different entry-level prediction methods when using the top K categorical predictions for K =
1,3,5,10,15,20,50.Thegroundtruthisthesetoflabelswhereatleasttwousersagreed.
Dataset B contains images which have confident classifier ing(Dengetal.,2012)andourlinguistically-guidednaming
scores. Surprisingly their difference in performance is not method.
extremeandperformanceonbothsetsisadmirableforthis
Onthetwodatasetswefindthevisually-guidednaming
challengingtask.Whencomparedtoourpreviouswork(Or-
modeltoperformbetter(Section5.2)thanthelinguistically-
donez et al., 2013) that relies on SIFT + LLC features, we
guidednamingprediction(Section5.1).Inaddition,weout-
foundthattheinclusionofConvNetfeaturesprovidedasig-
performbothleafnodeclassificationandthehedgingtech-
nificant improvement in the performance for the visually-
nique(Dengetal.,2012).
guided naming predictions but it did not improve the re-
WeadditionallycollectedathirdtestsetDatasetCcon-
sultsusingtheWordNetsemantichierarchyforbothHedg-
sistingofrandomImageNetimagesbelongingtothe7,404
Description:for visual recognition by training classifiers for 7,404 object its entry-level category coincides with its basic level category while for penguin which is a .. 0, if otherwise. (6). The POS-tagger helps clean up some word sense am- .. Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. people,