ebook img

Learning From Noisy Large-Scale Datasets With Minimal Supervision PDF

3.3 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning From Noisy Large-Scale Datasets With Minimal Supervision

Learning From Noisy Large-Scale Datasets With Minimal Supervision ∗ AndreasVeit1, NeilAlldrin3 GalChechik3 IvanKrasin3 AbhinavGupta2,3 SergeBelongie1 1 DepartmentofComputerSciene&CornellTech,CornellUniversity 2 TheRoboticsInstitute,CarnegieMellonUniversity 3 GoogleInc. 7 1 0 Abstract SamplesfromOpenImagesvalidationset 2 Subsetofnoisyannotations. Predictingvisualpresence image Thestructureisimplicitly foreachclassintheOpen n learnedbyourmethod. Imagesdataset. a Wepresentanapproachtoeffectivelyusemillionsofim- vegetable J ages with noisy annotations in conjunction with a small tomato sausage? capresesalad 6 subsetofcleanly-annotatedimagestolearnpowerfulimage meat mozzarella representations. One common approach to combine clean ] andnoisydataistofirstpre-trainanetworkusingthelarge sausageprosciutto capresesalad? V noisydatasetandthenfine-tunewiththecleandataset. We C showthisapproachdoesnotfullyleveragetheinformation transport vehicle s. contained in the clean set. Thus, we demonstrate how to war tank tank? [c use the clean annotations to reduce the noise in the large soldier violence datasetbeforefine-tuningthenetworkusingboththeclean black-and-white vehicle? 1 setandthefullsetwithreducednoise. Theapproachcom- v prisesamulti-tasknetworkthatjointlylearnstocleannoisy 9 annotationsandtoaccuratelyclassifyimages. Weevaluate modernart 1 painting boat? ourapproachontherecentlyreleasedOpenImagesdataset, art 6 vehicle 1 containing∼9millionimages,multipleannotationsperim- ship 0 ageandover6000uniqueclasses. Forthesmallcleanset sea modernart? shipwreck 1. of annotations we use a quarter of the validation set with 0 ∼40k images. Our results demonstrate that the proposed Figure1.SampleimagesandannotationsfromtheOpenImages 7 approach clearly outperforms direct fine-tuning across all validationsetillustratingthevarietyofimagesandthenoiseinthe 1 majorcategoriesofclassesintheOpenImagedataset.Fur- annotations. Weareconcernedwiththetaskoftrainingarobust : multi-labelimageclassifierfromthenoisyannotations.Whilethe v ther,ourapproachisparticularlyeffectiveforalargenum- i berofclasseswithmediumlevelofnoiseinannotations(20- imageannotationsaresimplelistsofclasses,ourmodelimplicitly X learnsthestructureinthelabelspace.Forillustrativepurposes,the 80%falsepositiveannotations). r structureissketchedasagraphwithgreenandrededgesdenot- a ingstrongpositiveandnegativerelations.Ourproposedapproach producesbothacleanedversionofthedatasetaswellasarobust 1.Introduction imageclassifier. Deep convolutional neural networks (ConvNets) prolif- erateincurrentmachinevision. Oneofthebiggestbottle- lected annotations. Examples include unsupervised learn- necks in scaling their learning is the need for massive and ing[16],self-supervisedlearning[9,23,24,30]andlearn- clean collections of semantic annotations for images. To- ingfromnoisyannotations[6,22]. day,evenafterfiveyearsofsuccessofImageNet[8],there Mostoftheseapproachesmakeastrongassumptionthat is still no publicly available dataset containing an order of all annotations are noisy, and no clean data is available. magnitude more clean labeled data. To tackle this bottle- In reality, typical learning scenarios are closer to semi- neck, other training paradigms have been explored aiming supervised learning: images have noisy or missing anno- tobypasstheneedoftrainingwithexpensivemanuallycol- tations, and a small fraction of images also have clean an- ∗WorkdoneduringinternshipatGoogleResearch. notations. Thisisthecaseforexample, whenimageswith 1 Weevaluatetheapproachontherecently-releasedlarge- cuisine,dish,produce, noisylabelset label cocodneusts,eforto,dx,iadoimlonsgubmaofood, cnleetawnoinrkg scale Open Images Dataset [15]. The results demonstrate that the proposed approach significantly improves perfor- CNNas cleanedlabelset manceovertraditionalfine-tuningmethods. Moreover, we efxetaratucrteor fevaistuuraels foodcx,uiadisoimilnoens,gudbmisahofo,od, show that direct fine-tuning sometimes hurts performance when only limited rated data is available. In contrast, our Tcroanintaininginsgamimpalege multi-label supervision methodimprovesperformanceacrossthefullrangeoflabel andnoisylabels classifier noise levels, and is most effective for classes having 20% to 80% false positive annotations in the training set. The Figure 2. High-level overview of our approach. Noisy input la- method performs well across a range of categories, show- bels are cleaned and then used as targets for the final classifier. ingconsistentimprovementonclassesinalleighthigh-level Thelabelcleaningnetworkandthemulti-labelclassifierarejointly trainedandsharevisualfeaturesfromadeepconvnet. Theclean- categoriesofOpenImages(vehicles,products,art,person, ingnetworkissupervisedbythesmallsetofcleanannotations(not sport,food,animal,plant). shown)whilethefinalclassifierutilizesboththecleandataandthe Thispapermakesthefollowingcontributions. First,we muchlargernoisydata. introduceasemi-supervisedlearningframeworkformulti- labelimageclassificationthatfacilitatessmallsetsofclean annotations in conjunction with massive sets of noisy an- noisyannotationsareminedfromtheweb,andthenasmall notations. Second,weprovideafirstbenchmarkonthere- fractiongetssenttocostlyhumanverification. cently released Open Images Dataset. Third, we demon- In this paper, we explore how to effectively and effi- stratethattheproposedlearningapproachismoreeffective cientlyleverageasmallamountofcleanannotationsincon- inleveragingsmalllabeleddatathantraditionalfine-tuning. junctionwithlargeamountsofnoisyannotateddata,inpar- ticulartotrainconvolutionalneuralnetworks.Onecommon 2.RelatedWork approach is to pre-train a network with the noisy data and thenfine-tuneitwiththecleandatasettoobtainbetterper- This paper introduces an algorithm to leverage a large formance. Wearguethatthisapproachdoesnotfullylever- corpusofnoisilylabeledtrainingdatainconjunctionwitha agetheinformationcontainedinthecleanannotations. We smallsetofcleanlabelstotrainamulti-labelimageclassifi- proposeanalternativeapproach: insteadofusingthesmall cationmodel.Therefore,werestrictthisdiscussiontolearn- cleandatasettolearnvisualrepresentationsdirectly,weuse ing from noisy annotations in image classification. For a ittolearnamappingbetweennoisyandcleanannotations. comprehensiveoverviewoflabelnoisetaxonomyandnoise We argue that this mapping not only learns the patterns of robustalgorithmswereferto[11]. noise, but it also captures the structure in the label space. Approaches to learn from noisy labeled data can gen- Thelearnedmappingbetweennoisyandcleanannotations erally be categorized into two groups: Approaches in the allowstocleanthenoisydatasetandfine-tunethenetwork first group aim to directly learn from noisy labels and fo- usingboththecleanandthefulldatasetwithreducednoise. cus mainly on noise-robust algorithms, e.g., [3, 20], and Theproposedapproachcomprisesamulti-tasknetworkthat label cleansing methods to remove or correct mislabeled jointly learns to clean noisy annotations and to accurately data, e.g., [4]. Frequently, these methods face the chal- classifyimages,Figure2. lenge of distinguishing difficult from mislabeled training In particular, we consider an image classification prob- samples. Second, semi-supervisedapproachestacklethese lem with the goal of annotating images with all concepts shortcomingsbycombiningthenoisylabelswithasmallset present in the image. When considering label noise, two of clean labels [32]. SSL approaches use label propagra- aspects are worth special attention. First, many multi- tion such as constrained bootstrapping [7] or graph-based label classification approaches assume that classes are in- approaches [10]. Our work follows the semi-supervised dependent. However, the label space is typically highly paradigm,howeverfocusingonlearningmappingbetween structured as illustrated by the examples in Figure 1. We noisyandcleanlabelsandthenexploitingthemappingfor therefore model the label-cleaning network as condition- trainingdeepneuralnetworks. ally dependent on all noisy input labels. Second, many Within the field of training deep neural networks there classes can have multiple semantic modes. For example, arethreestreamsofresearchrelatedtoourwork. First,var- theclasscoconutmaybeassignedtoanimagecontaininga ious methods have been proposed to explicitly model la- drink,afruitorevenatree. Todifferentiatebetweenthese bel noise with neural networks. Natarajan et al. [22] and modes,theinputimageitselfneedstobetakenintoaccount. Sukhbaataretal.[26]bothmodelnoisethatisconditionally Ourmodelthereforecapturesthedependenceofannotation independent from the input image. This assumption does noiseontheinputimagebyhavingthelearnedcleaningnet- nottakeintoaccounttheinputimageandisthusnotableto workconditionallydependentonimagefeatures. distinguish effectively between different visual modes and 2 Legend Trainingsamplewithhumanratedlabels LabelCleaningNetwork convolutional layer identityskip-connection Cleanedlabels noisylabels verifiedlabels ear Linear linearlayer n Li Trainingsamplewithonlynoisylabels Linear Linear + Linear ldrieindmeuaecrntisloaioynnearlwityith noisylabels Linear cloowncdaitmeneantseional nporogpraagdaiteinotn Linear ldiininmceraeeransslaeioynearlwityith embeddings trainingsample fromsetwith humanratedlabels Predictedlabels ConvolutionalNetwork ImageClassifier trainingsample fromsetwith ar oid onlynoisylabels ne m d-dimensional Li Sig vectorcontaining labelsin[0,1]for eachclass Figure3. Overviewofourapproachtotrainanimageclassifierfromaverylargesetoftrainingsampleswithnoisylabels(orange)anda smallsetofsampleswhichadditionallyhavehumanverification(green). Themodelcontainsalabelcleaningnetworkthatlearnstomap noisylabelstocleanlabels,conditionedonvisualfeaturesfromanInceptionV3ConvNet.Thelabelcleaningnetworkissupervisedbythe humanverifiedlabelsandfollowsaresidualarchitecturesothatitonlyneedstolearnthedifferencebetweenthenoisyandcleanlabels. Theimageclassifiersharesthesamevisualfeaturesandlearnstodirectlypredictcleanlabelssupervisedbyeither(a)theoutputofthe labelcleaningnetworkor(b)thehumanratedlabels,ifavailable. related noise. The closest work in this stream of research age classifier, where the output of the cleaning network is isfromXiaoetal.[31]thatproposesanimage-conditioned thetargetoftheimageclassifier. Thecleaningnetworkhas noisemodel. Theyfirstaimtopredictthetypeofnoisefor accesstothenoisylabelsinadditiontothevisualfeatures, eachsample(outofasmallsetoftypes: nonoise,random which could be considered privileged information. In our noise,structuredlabelswappingnoise)andthenattemptto setupthetwonetworksaretrainedinonejointmodel. remove it. Our proposed model is also conditioned on the inputimage,butdiffersfromtheseapproachesinthatitdoes 3.OurApproach notexplicitlymodelspecifictypesofnoiseandisdesigned Our goal is to train a multi-label image classifier us- for multiple labels per image, not only single labels. Also ing a large dataset with relatively noisy labels, where ad- relatedistheworkofMisraetal.[21]whomodelnoisearis- ditionally a small subset of the dataset has human veri- ing from missing, but visually present labels. While their fied labels available. This setting naturally occurs when method is conditioned on the input image and is designed collecting images from the web where only a small sub- formultiplelabelsperimage,itdoesnottakeadvantageof set can be verified by experts. Formally, we have a very cleaned labels and their focus is on missing labels, while large training dataset T comprising tuples of noisy labels ourapproachcanaddressbothincorrectandmissinglabels. y andimagesI, T = {(y ,I ),...}, andasmalldatasetV Second, transfer learning has become common practice i i oftripletsofverifiedlabelsv,noisylabelsy andimagesI, inmoderncomputervision. There,anetworkispre-trained V = {(v ,y ,I ),...}. The two sets differ significantly in on a large dataset of labeled images, say ImageNet, and i i i size with |T| (cid:29) |V|. For instance, in our experiments, T thenusedforadifferentbutrelatedtask,byfine-tuningon exceeds V by three orders of magnitude. Each label y or a small dataset of specific tasks, like image classification visasparsed-dimensionalvectorwithabinaryannotation andretrieval[25]andimagecaptioning[29]. Unlikethese foreachofthedclassesindicatingwhetheritispresentin works, our approach aims to train a network from scratch the image or not. Since the labels in T contain significant using noisy labels and then facilitates a small set of clean labelnoiseandV istoosmalltotrainaConvNet,ourgoal labelstofine-tunethenetwork. is to design an efficient and effective approach to leverage Third, the proposed approach has surface resemblance thequalityofthelabelsinV andthesizeofT. tostudent-teachermodelsandmodelcompression,wherea student, or compressed, model learns to imitate a teacher 3.1.Multi-TaskLabelCleaningArchitecture modelofgenerallyhighercapacityorwithprivilegedinfor- mation[2,5,14,19].Inourframework,wetrainaConvNet Weproposeamulti-taskneuralnetworkarchitecturethat with two classifiers on top, a cleaning network and an im- jointlylearnstoreducethelabelnoiseinT andtoannotate 3 Table1.Breakdownoftheground-truthannotationsinthevalida- imageswithaccuratelabels. Anoverviewofthemodelar- tionsetoftheOpenImagesDatasetbyhigh-levelcategory. The chitectureisgiveninFigure3. Themodelcomprisesafully datasetspansawiderangeofeverydaycategoriesfrommanmade convolutionalneuralnetwork[12,17,18]f withtwoclassi- productstopersonalactivitiesaswellascoarseandfine-grained fiersgandh. Thefirstclassifierisalabelcleaningnetwork naturalspecies. denotedasgthatmodelsthestructureinthelabelspaceand high-levelcategory uniquelabels annotations learnsamappingfromthenoisylabelsytothehumanveri- fiedlabelsv,conditionalontheinputimage. Wedenotethe vehicles 944 240,449 cleanedlabelsoutputbygascˆsothatcˆ=g(y,I).Thesec- products 850 132,705 ondclassifierisanimageclassifierdenotedashthatlearns art 103 41,986 toannotateimagesbyimitatingthefirstclassifierusingonly the image as input. We denote the predicted labels output person 409 55,417 byhaspˆsothatpˆ=h(I). sport 446 65,793 The image classifier h is shown in the bottom row of food 862 140,383 Figure3.First,asampleimageisprocessedbytheconvolu- animal 1064 187,147 tionalnetworktocomputehighlevelimagefeatures. Then, plant 517 87,542 thesefeaturesarepassedthroughafully-connectedlayerw followedbyasigmoidσ,h=σ(w(f(I))).Theimageclas- others 1388 322,602 sifieroutputspˆ,ad-dimensionalvector[0,1]dencodingthe likelihoodofthevisualpresenceofeachofthedclasses. Thelabelcleaningnetworkg isshowninthetoprowof Thelabelcleaningnetworkissupervisedbytheverified Figure 3. In order to model the label structure and noise labelsofallsamplesiinthehumanratedsetV. Theclean- conditionalontheimage,thenetworkhastwoseparatein- ing loss is based on the difference between the cleaned la- puts,thenoisylabelsy aswellasthevisualfeaturesf(I). belscˆiandthecorrespondinggroundtruthverifiedlabelsvi, The sparse noisy label vector is treated as a bag of words (cid:88) andprojectedintoalowdimensionallabelembeddingthat Lclean = |cˆi−vi| (2) encodesthesetoflabels. Thevisualfeaturesaresimilarly i∈V projected into a low dimensional embedding. To combine Wechoosetheabsolutedistanceaserrormeasure,sincethe thetwomodalities,theembeddingvectorsareconcatenated label vectors are very sparse. Other measures such as the and transformed with a hidden linear layer followed by a squarederrortendtosmooththelabels. projectionbackintothehighdimensionallabelspace. Fortheimageclassifier, thesupervisiondependsonthe Another key detail of the label cleaning network is an source of the training sample. For all samples j from the identity-skipconnectionthataddsthenoisylabelsfromthe noisy dataset T, the classifier is supervised by the cleaned trainingsettotheoutputofthecleaningmodule. Theskip labelscˆ producedbythelabelcleaningnetwork. Forsam- j connection is inspired by the approach by He et al. [13] plesiwherehumanratingsareavailable,i∈V,supervision but differs in that the residual cleaning module has the vi- comesdirectlyfromtheverifiedlabelsv .Toallowformul- i sualfeaturesassideinput. Duetotheresidualconnection, tipleannotationsperimage,wechoosethecross-entropyas thenetworkonlyneedstolearnthedifferencebetweenthe classificationlosstocapturethedifferencebetweenthepre- noisyandcleanlabelsinsteadofregressingtheentirelabel dictedlabelspˆandthetargetlabels. vector.Thissimplifiestheoptimizationandenablesthenet- worktopredictreasonableoutputsrightfromthebeginning. (cid:88) (cid:88) L =− pˆ log(v )− pˆ log(cˆ) (3) classify i i j j Toremaininthevalidlabelspacetheoutputsareclippedto i∈V j∈T 0 and 1. Denoting the residual cleaning module as g(cid:48), the labelcleaningnetworkgcomputescleanedlabels It is worth noting that the vast majority of training ex- amples come from set T. Thus, the second summation in cˆ=clip(y+g(cid:48)(y,f(I)),[0,1]) (1) Equation3dominatestheoveralllossofthemodel. Topre- vent a trivial solution, in which the cleaning network and 3.2.ModelTraining classifier both learn to predict label vectors of all zeros, To learn the model parameters we formulate two losses cˆ = pˆ = {0}d, theclassificationlossisonlypropagated j j thatweminimizejointlyusingstochasticgradientdescent: to pˆ. The cleaned labels cˆ are treated as constants with j j a label cleaning loss L that captures the quality of the respect to the classification and only incur gradients from clean cleaned labels cˆand a classification loss L that cap- thecleaningloss. classify tures the quality of the predicted labels pˆ. The calculation Totrainthecleaningnetworkandimageclassifierjointly ofthelosstermsisillustratedontherightsideofFigure3. wesampletrainingbatchesthatcontainsamplesfromT as 4 #annotationsinOpenImagesdataset1111110000001234560 1000C2la0s0s0f3re0q00ue4n0c0i0es5000 6000 shareofannotationspositivelyverified000001......8604200 1000A2n0n0o0ta3ti0o0n0q4u0a0l0ity5000 6000 diGtatoTnaoifhotsgciteetBooohisggnnnqeetrlcissuoemaitpiaudidurolaCneeotintotrdscelyaom-eobttutsvhrhaflsduaeete,irttchhiVuataehemnansicesennoodiwvnoomauaeointntndnaeatntaetssAiodtleooiyftiPtornfasoInbttin1rimqsseoi.oubtnaawiDuaasrsnltneeeuini.etioecoyninnmo,tUatnttowaohthssigfeeioietendhncctgeoelcrlrabaanletsaisasohdnssesuieesierstfnsivoseafiv,g.e.lmaesaFlrestanihTeidtsopgaethaitoutdmhtecisr2eooieiatlrt6nnrani4.vatkrn6a(sieeon%btiesyon--)t. classindex classindex showsthedistributionofthequalityoftheautomatedanno- (a)Classfrequencies (b)Annotationquality tations. Whilesomeclassesonlyhavecorrectannotations, othersdonothaveany. However, thenoiseisnotrandom, Figure4.LabelstatisticsfortheOpenImagesdataset. (a)Classes since the label space is highly structured, see Figure 1 for areheavilyskewedintermsofnumberofannotations,e.g.,”vehi- examples. cle”occursover900,000timeswhereas”hondansx”onlyoccurs Forourexperiments,weusethetrainingsetaslargecor- 70 times. (b) Classes also vary significantly in annotation qual- pus of images with only noisy labels T. Further, we split ity. Whilemorethan70%ofthe∼80Mannotationsinthedataset the validation set into two parts: one quarter of about 40 are correct, the majority of the 6012 classes have lower quality. thousand images is used in our cleaning approach provid- Generally,commonclassestendtohavehigherannotationquality. ingbothnoisyandhumanverifiedlabelsV. Theremaining three-quartersareheldoutandusedonlyforvalidation. well as V in a ratio of 9 : 1. This allows us to utilize the 4.2.EvaluationTaskandMetrics largenumberofsamplesinT whilegivingenoughsupervi- siontothecleaningnetworkfromV. Weevaluateourapproachusingmulti-labelimageclas- sification, i.e., predicting a score for each class-image pair 4.Experiments indicatingthelikelihoodtheconceptdescribedbytheclass ispresentintheimage. 4.1.Dataset Thereisnostandardevaluationprocedureyetforclassi- fication on the Open Images dataset. Thus, we choose the Weevaluateourproposedmodelontherecentlyreleased widely used average precision (AP) as metric to evaluate Open Images dataset [15]. The dataset is uniquely suited performance. TheAPforeachclasscis for our task as it contains a very large collection of im- ages with relatively noisy annotations and a small valida- (cid:80)N Precision(k,c)·rel(k,c) tionsetwithhumanverifications. Thedatasetismulti-label AP = k=1 (4) c numberofpositives andmassivelymulti-classinthesensethateachimagecon- tainsmultipleannotationsandthevocabularycontainssev- where Precision(k,c) is the precision for class c when re- eralthousanduniqueclasses. Inparticular, thetrainingset trieving k annotations and rel(k,c) is an indicator func- contains9,011,219imageswithatotalof79,156,606anno- tion that is 1 iff the ground truth for class c and the im- tations,anaverageof8.78annotationsperimage. Theval- age at rank k is positive. N is the size of the validation idationsetcontainsanother167,056imageswithatotalof set. Wereportthemeanaverageprecision(MAP)thattakes 2,047,758annotations,anaverageof12.26annotationsper the average over the APs of all d, 6012, classes, MAP = image. The dataset contains 6012 unique classes and each 1/d(cid:80)d AP . Further, because we care more about the classhasatleast70annotationsoverthewholedataset. c=1 c modelperformanceoncommonlyoccurringclasseswealso Onekeydistinctionfromotherdatasetsisthattheclasses reportaclassagnosticaverageprecision,AP . Thismet- in Open Images are not evenly distributed. Some high- all ric considers every annotation equally by treating them as level classes such as ‘vehicle‘ have over 900,000 annota- comingfromonesingleclass. tions in the dataset while many fine-grain classes are very Evaluation on Open Images comes with the challenge sparse,e.g.,‘hondansx‘onlyoccurs70times. Figure4(a) thatthevalidationsetiscollectedbyverifyingtheautomat- showsthedistributionofclassfrequenciesoverthevalida- ically generated annotations. As such, human verification tion set. Further, many classes are highly related to each onlyexistsforasubsetoftheclassesforeachimage. This other. To differentiate our evaluation between clusters of raisesthequestionofhowtotreatclasseswithoutverifica- semanticallycloselyrelatedclasses,wegroupclasseswith tion. Oneoptionistoconsiderclasseswithmissinghuman- respecttotheirassociatedhigh-levelcategory.Table1gives verification as negative examples. However, we observe anoverviewofthemaincategoriesandtheirstatisticsover thevalidationset. 1https://cloud.google.com/vision/ 5 Table2.ComparisonofmodelsintermsofAPandMAPonthe (a)Effectofclassfrequencyonperformance heldoutsubsetoftheOpenImagesvalidationset. Ourapproach 2.5 outperforms competing methods. See Sections 4.2 and 4.3 for 10%mostcommonclasses moredetailsonthemetricsandmodelvariants. 2.0 nt e Model APall MAP em1.5 ov 10%leastcommonclasses Baseline 83.82 61.82 pr1.0 m i Misraetal.[21]visualclassifier 83.55 61.85 P A0.5 M Misraetal.[21]relevanceclassifier 83.79 61.89 0.0 Fine-Tuningwithmixedlabels 84.80 61.90 veryrare medium verycommon classessortedbyfrequency Fine-Tuningwithcleanlabels 85.88 61.53 OurApproachwithpre-training 87.68 62.36 (b)Effectofannotationqualityonperformance OurApproachtrainedjointly 87.67 62.38 1.0 0.8 nt e that a large number of the highly ranked annotations are em0.6 likely correct but not verified. Treating them as negatives ov pr0.4 would penalize models that differ substantially from the m i P baselinemodel. Thus, wechooseinsteadtoignoreclasses A0.2 M withouthuman-verificationinourmetrics. Thismeansthe 0.0 measured precision at full recall for all approaches is very verynoisy medium veryclean closetotheprecisionofthemodelthatisusedtoannotate classessortedbyannotationquality thedataset,seethePRcurveinFigure6(a). Figure5.Performancegainofourapproachwithrespecttohow commonaclassisandhownoisyitsannotationsareinthedataset. 4.3.BaselinesandModelVariants Wesorttheclassesalongthex-axis, grouptheminto10equally Asthebaselinemodelforourevaluationwetrainanet- sizedgroupsandcomputetheMAPgainoverthebaselinewithin each group. (a) Most effective is our approach for classes that worksolelyonthenoisylabelsfromthetrainingset.Were- occurfrequently. (b)Ourapproachimprovesperformanceacross fertothismodelasbaselineanduseitasthestartingpoint alllevelsofannotationquality.Itshowsthelargestgainforclasses forallothermodelvariants. Wecomparethefollowingap- with20%to80%falseannotations,classesthatcontainsufficient proaches. negativeandpositiveexamplesinthehumanratedset. Fine-tunewithcleanlabels:Acommonapproachistouse the clean labels directly to supervise the last layer. This approach converges quickly because the dataset for fine- modelfocusesonmissinglabels. tuningisverysmall; however,manyclasseshaveveryfew 4.4.TrainingDetails trainingsamplesmakingitpronetooverfitting. Fine-tune with mix of clean and noisy labels: This ad- Forourbasemodel,weuseanInceptionv3networkar- dresses the shortcomings of limited training samples. We chitecture [27], implemented with the TensorFlow frame- fine-tunethelastlayerwithamixoftrainingsamplesfrom work [1] and optimized with RMSprop [28] with learning thesmallcleanandthelargenoisyset(ina1to9ratio). rate 0.045 and exponential learning rate decay of 0.94 ev- Our approach with pre-trained cleaning network: We ery 2 epochs. As our only modification to the architecture compare two different variants of our approach. Both are wereplacethefinalsoftmaxwitha6012-waysigmoidlayer. trained as described in Section 3.2. They only differ with Thenetworkissupervisedwithabinarycross-entropyloss respect to their initialization. For first variant, we initially betweenthesparsetraininglabelsandthenetwork’spredic- train just the label cleaning network on the human rated tions. We trained the baseline model on 50 NVIDIA K40 data. Then,subsequentlywetrainthecleaningnetworkand GPUs using the noisy labels from the Open Images train- theclassificationlayerjointly. ing set. We stopped training after 49 million mini-batches Ourapproachtrainedjointly: Toreducetheoverheadof (each containing 32 images). This network is the starting pre-training the cleaning network, we also train a second pointforallmodelvariants2. variantinwhichthecleaningnetworkandtheclassification Thefourdifferentfine-tuningvariantsaretrainedforad- layeraretrainedjointlyrightfromthebeginning. ditional4millionbatcheseach.Thelearningrateforthelast Misraetal.: Finally,wecomparetotheapproachofMisra classificationlayerisinitializedto0.001. Forthecleaning et al. [21]. As expected, our method performs better since theirmodeldoesnotutilizethecleanlabelsandtheirnoise 2Ourpre-trainedbaselinemodelwillbemadeavailabletothepublic. 6 Table3.Meanaverageprecisionforclassesgroupedaccordingtohigh-levelcategoriesoftheOpenImagesDataset. Ourmethodconsis- tentlyperformsbestacrossallcategories. Model vehicles products art person sport food animal plant Baseline 56.92 61.51 68.28 59.46 62.84 61.79 61.14 59.00 Fine-Tuningwithmixedlabels 57.00 61.56 68.23 59.49 63.12 61.77 61.27 59.14 Fine-Tuningwithcleanlabels 56.93 60.94 68.12 58.39 62.56 61.60 61.18 58.90 OurApproachwithpre-training 57.15 62.31 68.89 60.03 63.60 61.87 61.26 59.45 OurApproachtrainedjointly 57.17 62.31 68.98 60.05 63.61 61.87 61.27 59.36 1.00 Precision Recall Curves - all classes 1.0 Precision Recall Curves - products 1.00 Precision Recall Curves - animal 0.95 0.95 0.9 0.90 0.90 Precision000...788550 baseline Precision00..87 baseline Precision00..8850 baseline fine-tune mixed labels fine-tune mixed labels 0.75 fine-tune mixed labels 0.70 fine-tune clean labels 0.6 fine-tune clean labels fine-tune clean labels 0.65 ours with pretraining ours with pretraining 0.70 ours with pretraining ours trained jointly ours trained jointly ours trained jointly 0.60 0.5 0.65 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall (a)allclasses (b)products (c)animal Figure6.Precision-recallcurvesforallmethodsmeasuredoverallannotationsandforthemajorcategoriesofproductsandanimals. In general, ourmethodperformsbest, followedbyfine-tuningwithcleanlabels, fine-tuningwithamixofcleanandnoisylabels, andthe baseline model. Over all classes, we see improvements across all confidence levels. For products the main improvements come from annotations with high-confidence. For animals we observe mainly gains in the lower confidence regime. It is worthy of note there is virtuallynodifferencebetweenpre-trainingthecleaningnetworkandlearningitjointly. networkitissethigherto0.015,becauseitsweightsareini- but performance in the majority of classes decreases. For tializedrandomly. Fortheapproachwithpre-trainedclean- manyclassesthelimitednumberofannotationsintheclean ingnetwork, itisfirsttrainedwithalearningrateof0.015 labelsetseemstoleadtooverfitting. Fine-tuningonclean until convergence and then set to 0.001 once it is trained andnoisyannotationsalleviatestheproblemofoverfitting, jointlywiththeclassifier. Tobalancethelosses,weweight however, at a cost in overall performance. Our approach L with0.1andL with1.0. on the other hand does not face the problem of overfit- clean classify ting. Again,ourtwovariantsperformverysimilarandboth 4.5.Results demonstratesignificantimprovementsoverthebaselineand directfine-tuning. Theconsistentimprovementoverallan- Wefirstanalyzetheoverallperformanceoftheproposed notations and over all classes shows that our approach is approach. Table 2 shows mean average precision as well clearlymore effectivethandirect fine-tuningtoextract the as class agnostic average precision. Generally, the perfor- informationfromthecleanlabelset. mance in terms AP is higher than for MAP, indicating all Thesimilarperformanceofthevariantswithandwithout thattheaverageprecisionforcommonclassesishigherthan pre-trainedcleaningnetworkindicatethatpre-trainingisnot forrareclasses. Consideringallannotationsequally,AP , all requiredandourapproachcanbetrainedjointly. weseeclearimprovementsofallvariantsoverthebaseline. Further,thetwovariantsoftheproposedapproachperform 4.5.1 Effectoflabelfrequencyandannotationquality verysimilartoeachotheranddemonstrateasignificantlead overdirectfine-tuning. ThedifferenceinperformanceinAP andMAP implies all The results in terms of MAP show a different picture. that how common a class is plays a key role in its perfor- Insteadofimprovingperformance,fine-tuningontheclean mance and how it benefits from label cleaning. We take a data directly even hurts the performance. This means the closerlookhowclassfrequencyandannotationqualityef- improvementinAP isduetoafewverycommonclasses, fectstheperformanceofourapproach. all 7 e e e Imagefromvalidationset Stoupp5erpseretdoicftions BaselineDirectlyfine-tunoncleanlabels OurApproach Imagefromvalidationset Stoupp5erpseretdoicftions BaselineDirectlyfine-tunoncleanlabels OurApproach Imagefromvalidationset Stoupp5erpseretdoicftions BaselineDirectlyfine-tunoncleanlabels OurApproach Sports Red Musician Athlete Night Music Individualsports Color Performance Muscle Lighting Entertainment Humanaction Light Person Ballgame Building Guitarist Teamsport Room Singing Screenshot Temple Stage Clothing Plant Geological- Fashion Garden phenomenon Black Botany Atmosphere Darkness Tree Darkness Light Landplant Red Stage Flower Night Entertainment Yard Vehicle Performingarts Backyard Screenshot Sports Sea Figure7.Examplesfromthehold-outportionoftheOpenImagesvalidationset. Weshowthetop5mostconfidentpredictionsofthe baseline model, directly fine-tuning on clean labels and our approach, along with whether the prediction is correct of incorrect. Our approachconsistentlyremovesfalsepredictionsmadebythebaselinemodel. Examplegainsaretheremovalof‘teamsport’andrecall of‘muscle’intheupperleft. Thisisaverytypicalexampleasmostsportimagesareannotatedwith‘ballgame’and‘teamsport’inthe dataset.Directlyfine-tuningachievesmixedresults.Sometimesitperformssimilartoourapproachandremovesfalselabels,butforothers itevenrecallsmorefalselabels.Thisillustratesthechallengeofoverfittingfordirectly-finetuning. Figure5(a)showstheperformanceimprovementofour egories, shown in Table 1, range from man-made objects approach over the baseline with respect to how common a such as vehicles to persons and activities to natural cate- class is. The x-axis shows the 6012 unique classes in in- goriessuchasplants. Table3showsthemeanaveragepre- creasingorderfromraretocommon. Wegrouptheclasses cision.Ourapproachclearlyimprovesoverthebaselineand along the axis into 10 equally sized groups The result re- directfine-tuning. Similarresultsareobtainedforclassag- vealsthatourapproachisabletoachieveperformancegains nosticaverageprecision,wherewealsoshowtheprecision- acrossalmostalllevelsoffrequency. Ourmodelismostef- recall curves for the major categories of products and an- fective for very common classes and shows improvement imals in Figure 6. For products the main improvements for all but a small subset of rare classes. Surprisingly, for comefromhigh-confidencelabels,whereas,foranimalswe veryrareclasses,mostlyfine-grainedobjectcategories,we observemainlygainsinthelowerconfidenceregime. againobserveanimprovement. Figure 5(b) shows the performance improvement with 5.Conclusion respect to the annotation quality. The x-axis shows the classes in increasing order from very noisy annotations to How to effectively leverage a small set of clean labels alwayscorrectannotations. Ourapproachimprovesperfor- inthepresenceofamassivedatasetwithnoisylabels? We mance across all levels of annotation quality. The largest showthatusingthecleanlabelstodirectlyfine-tuneanet- gains are for classes with medium levels of annotation worktrainedonthenoisylabelsdoesnotfullyleveragethe noise. For classes with very clean annotations the perfor- informationcontainedinthecleanlabelset. Wepresentan mance is already very high, limiting the potential for fur- alternative approach in which the clean labels are used to ther gains. For very noisy classes nearly all automatically reduce the noise in the large dataset before fine-tuning the generated annotations are incorrect. This means the label networkusingboththecleanlabelsandthefulldatasetwith cleaningnetworkreceivesalmostnosupervisionforwhata reduced noise. We evaluate on the recently released Open positivesampleis. Classeswithmediumannotationquality Images dataset showing that our approach outperforms di- contain sufficient negative as well as positive examples in rectfine-tuningacrossallmajorcategoriesofclasses. thehumanratedsetandhavepotentialforimprovement. There are a couple of interesting directions for future work. The cleaning network in our setup combines the label and image modalities with a concatenation and two 4.5.2 Performance on high-level categories of Open fully connected layers. Future work could explore higher Imagesdataset capacity interactions such as bilinear pooling. Further, in Now we evaluate the performance on the major sub- ourapproachtheinputandoutputvocabularyoftheclean- categories of classes in the Open Images dataset. The cat- ingnetworkisthesame. Futureworkcouldaimtolearna 8 mappingofnoisylabelsinonedomainintocleanlabelsin [18] J.Long, E.Shelhamer, andT.Darrell. Fullyconvolutional anotherdomainsuchasFlickrtagstoobjectcategories. networksforsemanticsegmentation. InProceedingsofthe IEEEConferenceonComputerVisionandPatternRecogni- References tion,pages3431–3440,2015. [19] D.Lopez-Paz,L.Bottou,andy.v.BernhardScho¨lkopfand [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, VladimirVapnik, journal=CoRR. Unifyingdistillationand C.Citro,G.S.Corrado,A.Davis,J.Dean,M.Devin,etal. privilegedinformation. Tensorflow:Large-scalemachinelearningonheterogeneous [20] N. Manwani and P. S. Sastry. Noise tolerance under risk distributedsystems.arXivpreprintarXiv:1603.04467,2016. minimization. CoRR,abs/1109.5231,2013. [2] J.BaandR.Caurana. Dodeepnetsreallyneedtobedeep? [21] I.Misra,C.L.Zitnick,M.Mitchell,andR.Girshick. Seeing CoRR,abs/1312.6184,2014. throughtheHumanReportingBias: VisualClassifiersfrom [3] E.BeigmanandB.B.Klebanov. Learningwithannotation NoisyHuman-CentricLabels. InCVPR,2016. noise. InACL/IJCNLP,2009. [22] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. [4] C.E.BrodleyandM.A.Friedl.Identifyingmislabeledtrain- Learningwithnoisylabels. InNIPS,2013. ingdata. CoRR,abs/1106.0219,1999. [23] L.Pinto,D.Gandhi,Y.Han,Y.-L.Park,andA.Gupta. The [5] C.Bucilu,R.Caruana,andA.Niculescu-Mizil.Modelcom- curious robot: Learning visual representations via physical pression. InProceedingsofthe12thACMSIGKDDinterna- interactions. InECCV,2016. tionalconferenceonKnowledgediscoveryanddatamining, [24] P. I. Richard Zhang and A. A. Efros. Colorful image col- pages535–541.ACM,2006. orization. InECCV,2016. [6] X.ChenandA.Gupta. Weblysupervisedlearningofconvo- [25] A.SharifRazavian, H.Azizpour, J.Sullivan, andS.Carls- lutionalnetworks. InICCV,2015. son. Cnnfeaturesoff-the-shelf: anastoundingbaselinefor [7] X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting recognition. InCVPRWorkshops,pages806–813,2014. visualknowledgefromwebdata. InICCV,2013. [26] S.Sukhbaatar,J.Bruna,M.Paluri,L.Bourdev,andR.Fer- [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- gus. Trainingconvolutionalnetworkswithnoisylabels. In Fei.Imagenet:Alarge-scalehierarchicalimagedatabase.In ICLR,Workshop,2015. CVPR,2009. [27] C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. [9] C. Doersch, A. Gupta, and A. Efros. Unsupervised vi- Rethinking the inception architecture for computer vision. sualrepresentationlearningbycontextprediction. InICCV, arXivpreprintarXiv:1512.00567,2015. 2015. [28] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide [10] R.Fergus,Y.Weiss,andA.Torralba.Semi-supervisedlearn- the gradient by a running average of its recent magnitude. ingingiganticimagecollections. InAdvancesinneuralin- COURSERA:NeuralNetworksforMachineLearning,4(2), formationprocessingsystems,pages522–530,2009. 2012. [11] B.Fre´nayandM.Verleysen.Classificationinthepresenceof [29] O.Vinyals,A.Toshev,S.Bengio,andD.Erhan. Showand labelnoise:asurvey. IEEEtransactionsonneuralnetworks tell:Aneuralimagecaptiongenerator. InCVPR,2015. andlearningsystems,25(5):845–869,2014. [30] X.WangandA.Gupta.Unsupervisedlearningofvisualrep- [12] K. Fukushima. Neocognitron: A self-organizing neu- resentationsusingvideos. InICCV,2015. ral network model for a mechanism of pattern recogni- [31] T.Xiao,T.Xia,Y.Yang,C.Huang,andX.Wang. Learning tionunaffectedbyshiftinposition. Biologicalcybernetics, frommassivenoisylabeleddataforimageclassification. In 36(4):193–202,1980. CVPR,2015. [13] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearn- [32] X.Zhu. Semi-supervisedlearningliteraturesurvey. 2005. ingforimagerecognition.arXivpreprintarXiv:1512.03385, 2015. [14] G.E.Hinton,O.Vinyals,andJ.Dean. Distillingtheknowl- edgeinaneuralnetwork. CoRR,abs/1503.02531,2015. [15] I. Krasin, T. Duerig, N. Alldrin, A. Veit, S. Abu-El-Haija, S.Belongie,D.Cai,Z.Feng,V.Ferrari,V.Gomes,A.Gupta, D.Narayanan,C.Sun,G.Chechik,andK.Murphy. Open- images: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages,2016. [16] Q. V. Le, M. Ranzato, R. Monga, M. Devin, G. S. Cor- rado, K.Chen, J.Dean, andA.Y.Ng. Buildinghigh-level features using large scale unsupervised learning. CoRR, abs/1112.6209,2012. [17] Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner. Gradient- based learning applied to document recognition. Proceed- ingsoftheIEEE,86(11):2278–2324,1998. 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.