ebook img

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks PDF

1.8 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks JuntingPan,ElisaSayrolandXavierGiro-i-Nieto CristianCantonFerrer ImageProcessingGroup Microsoft UniversitatPolitecnicadeCatalunya(UPC) Redmond(WA),USA Barcelona,Catalonia/Spain [email protected] [email protected] 7 JordiTorres KevinMcGuinessandNoelE.O’Connor 1 BarcelonaSupercomputingCenter(BSC) InsightCenterforDataAnalytics 0 2 Barcelona,Catalonia/Spain DublinCityUniversity n [email protected] Dublin,Ireland a [email protected] J 9 ] Abstract GroundTruth BCE SalGAN V C WeintroduceSalGAN,adeepconvolutionalneuralnet- . s workforvisualsaliencypredictiontrainedwithadversarial c examples. Thefirststageofthenetworkconsistsofagenera- [ tormodelwhoseweightsarelearnedbyback-propagation 2 computedfromabinarycrossentropy(BCE)lossoverdown- v sampledversionsofthesaliencymaps. Theresultingpredic- 1 8 tionisprocessedbyadiscriminatornetworktrainedtosolve 0 abinaryclassificationtaskbetweenthesaliencymapsgener- 1 atedbythegenerativestageandthegroundtruthones. Our 0 Figure 1: Example of saliency map generation where the experimentsshowhowadversarialtrainingallowsreaching . proposedsystem(SalGAN)outperformsastandardbinary 1 state-of-the-artperformanceacrossdifferentmetricswhen crossentropy(BCE)predictionmodel. 0 combinedwithawidely-usedlossfunctionlikeBCE.Our 7 resultscanbereproducedwiththesourcecodeandtrained 1 modelsavailableathttps://imatge-upc.github. : v io/saliency-salgan-2017/. capturethehumanattention. Thesesaliencymapshavebeen i usedassoft-attentionguidesforothercomputervisiontasks, X andalsodirectlyforuserstudiesinfieldslikemarketing. r a Thispaperintroducestheusageofgenerativeadversarial 1.Motivation networks(GANs)[7]forvisualsaliencyprediction. Inthis Visualsaliencypredictionincomputervisionaimsatesti- context,trainingisdrivenbytwoagents. First,thegenerator matingthelocationsinanimagethatattracttheattentionof thatcreatesasyntheticsamplematchingthedatadistribution humans.Salientpointsinanimageareunderstoodasaresult modelledbyatrainingdataset;second,thediscriminator, of a bottom-up process where a human observer explores thatdistinguishesbetweenarealsampledrawndirectlyfrom theimageforafewsecondswithnoparticulartaskinmind. the training data set and one created by the generator. In Thesedataaretraditionallymeasuredwitheye-trackers,and ourcase,thisdatadistributioncorrespondstopairsofreal morerecentlywithmouseclicks[12]orwebcams[15]. The imagesandtheircorrespondingvisualsaliencymaps. salientpointsovertheimageareusuallyaggregatedandrep- Thegenerator,namedinourworkSalGAN,istheonlyre- resentedinasaliencymap,asinglechannelimageobtained sponsibleforgeneratingthesaliencymapforagivenimage byconvolvingeachpointwithaGaussiankernel. Asare- andusesadeepconvolutionalneuralnetwork(DCNN)to sult,agray-scaleimageorheatmapisgeneratedtorepresent producesuchmaps. Thisnetworkisinitiallytrainedwitha theprobabilityofeachcorrespondingpixelintheimageto binarycrossentropy(BCE)lossoverdownsampledversions 1 ofthesaliencymaps. Themodelisrefinedwithadiscrim- low-level features at multiple scales and combining them inatornetworktrainedtosolveabinaryclassificationtask toformasaliencymap. Hareletal. [8],alsostartingfrom betweenthesaliencymapsgeneratedbySalGANandthe low-levelfeaturemaps,introducedagraph-basedsaliency realonesusedasgroundtruth. Ourexperimentsshowhow modelthatdefinesMarkovchainsovervariousimagemaps, adversarialtrainingallowsreachingstate-of-the-artperfor- andtreattheequilibriumdistributionovermaplocationsas manceacrossdifferentmetricswhencombinedwithaBCE activationandsaliencyvalues. Juddetal. in[14]presented contentlossinasingle-towerandsingle-taskmodel. a bottom-up, top-down model of saliency based not only The amount of different metrics to evaluate visual on low but mid and high-level image features. Borji [1] saliencypredictionisbroadanddiverse,andthediscussion combinedlow-levelfeaturessaliencymapsofpreviousbest aboutthemisrichandactive. Pioneeringmetricshavesatu- bottom-upmodelswithtop-downcognitivevisualfeatures ratedwiththeintroductionofdeeplearningmodelstrained and learned a direct mapping from those features to eye onlargedatasets,boostingtheproposalofnewmetricsand fixations. challengesforthecommunity. ThewidelyadoptedMIT300 Asinmanyotherfieldsincomputervision,anumberof benchmarkevaluatesitsresultson8differentmetrics,and deep learning solutions have very recently been proposed themorerecentLSUNchallengeontheSALICONdataset that significantly improve performance. For example, as considersfourofthem. Alsorecently,theInformationGain statedin[4],amongthetop10resultsintheMITsaliency (IG) value has been presented as powerful relative metric benchmark[2], asofMarch2016, sixmodelswerebased in the field. The diversity of metrics in our current deep ondeeplearning. WhenconsideringAUC-Juddastheref- learningerahasresultedalsoinadiversityoflossfunctions erencemetric,resultsasofSeptember2016ontheMIT300 betweenstate-of-the-artsolutions. Modelsbasedonback- benchmarkincludeonlyonenon-deeplearningtechnique propagation training have adopted loss functions capable (Booleanmapsaliency [31])amongthetop-ten. to capture the different features assessed by the multiple TheEnsembleofDeepNetworks(eDN)[30]represented metrics. Ourapproach,though,differsfromotherstate-of- anearlyarchitecturethatautomaticallylearnstherepresen- the-art solutions as we focus on exploring the benefits of tationsforsaliencyprediction,blendingfeaturemapsfrom introducingtheagnosticlossfunctionproposedingenerative differentlayers. Theirnetworkmightbeconsiderashallow adversarialtraining. network given the number of layers. In [25] shallow and This paper explores adversarial training for visual deepernetworkswerecompared. DCNNhaveshownbetter saliency prediction, showing how the simple binary clas- resultsevenwhenpre-trainedwithdatasetsbuildforother sificationbetweenrealandsyntheticsamplessignificantly purposes. DeepGaze[19]providedadeepernetworkusing benefits a wide range of visual saliency metrics, without the well-know AlexNet [16], with pre-trained weights on needing to specify a tailored loss function. Our results Imagenet[6]andwithareadoutnetworkontopwhosein- achievestate-of-the-artperformancewithasimpleDCNN putsconsistedofsomelayeroutputsofAlexNet. Theoutput whose parameters are refined with a discriminator. As a of the network is blurred, center biased and converted to secondarycontribution,weshowthebenefitsofusingthe aprobabilitydistributionusingasoftmax. Anewversion binarycrossentropy(BCE)lossfunctionanddownsampled calledDeepGaze2[21]usingVGG-19[27]andtrainedwith saliencymapswhentrainingthisDCNN. the SALICON dataset [12] has recently achieved the best Theremainingoftheworkisorganizedasfollows. Sec- results in the MIT saliency benchmark. Kruthiventi et al. tion2reviewsthestate-of-the-artmodelsforvisualsaliency [18]alsousedpre-trainedVGGnetworksandtheSALICON prediction,discussingthelossfunctionstheyarebasedupon, datasettoobtainsaliencypredictionmapswithoutstanding their relations with the different metrics as well as their results. Thisnetworkusesadditionalinceptionmodules[28] complexityintermsofarchitectureandtraining. Section3 tomodelfineandcoarsesaliencydetails. Huangetal.[9], presents SalGAN, our deep convolutional neural network inthesocallSALICONnet,obtainedbetterresultsbyus- basedonaconvolutionalencoder-decoderarchitecture,as ingVGGratherthanAlexNetorGoogleNet[28]. Intheir wellasthediscriminatornetworkusedduringitsadversarial proposaltheyconsideredtwonetworkswithfineandcoarse training. Section4describesthetrainingprocessofSalGAN inputs,whosefeaturemapsoutputsareconcatenated. andthelossfunctionsused. Section5includestheexperi- Followingthisideaofmodelinglocalandglobalsaliency mentsandresultsofthepresentedtechniques. Finally,Sec- features,Liuetal. [22]proposedamultiresolutionconvo- tion6closesthepaperbydrawingthemainconclusions. lutionalneuralnetworkthatistrainedfromimageregions centeredonfixationandnon-fixationlocationsovermultiple 2.Relatedwork resolutions. Diversetop-downvisualfeaturescanbelearned Saliencypredictionhasreceivedinterestbytheresearch inhigherlayersandbottom-upvisualsaliencycanalsobe community for many years. Thus seminal works by Itti inferredbycombininginformationovermultipleresolutions. et al. [10] proposed to predict saliency maps considering Theseideasarefurtherdevelopedintheyrecentworkcalled 2 Image Stimuli + Predicted Saliency Map Generator Discriminator Adversarial BCE Cost Cost Conv-VGG Max Pooling Upsampling Sigmoid Conv-Scratch Fully Connected Image Stimuli + Ground Truth Saliency Map Figure2: Overallarchitectureoftheproposedsaliencymapestimationsystem. DSCLRCN[24],wheretheproposedmodellearnssaliency Inourworkwepresentanetworkarchitecturethattakes relatedlocalfeaturesoneachimagelocationinparalleland adifferentapproachandweconsiderlossfunctionsthatare thenlearnstosimultaneouslyincorporateglobalcontextand welladaptedtoourmodel. Inparticularotherproposalsuse scenecontexttoinfersaliency. Theyincorporateamodel lossesbasedonMSEorsimilar[5],[25],[17],whereaswe toeffectivelylearnlong-termspatialinteractionsandscene useBCEandcompareitwithMSEasareference. contextualmodulationtoinferimagesaliency. Theirexperi- mentshaveobtainedoutstandingresultsintheMITSaliency 3.Architecture Benchmark. Corniaetal. [5]proposedanarchitecturethat The architecture of the presented SalGAN is based on combinesfeaturesextractedatdifferentlevelsofaDCNN. twodeepconvolutionalneuralnetwork(DCNN)modules, Theirmodeliscomposedofthreemainblocks: afeatureex- namelythegeneratoranddiscriminator,whosecombinedef- tractionDCNN,afeatureencodingnetwork,whichweights fortsaimatpredictingavisualsaliencymapforagiveninput low and high-level feature maps, and a prior learningnet- image. Thissectionprovidesdetailsonthestructureofboth work. Theyalsointroducealossfunctioninspiredbythree modules,theconsideredlossfunctions,andtheinitialization objectives: tomeasuresimilaritywiththegroundtruth,to beforebeginningadversarialtraining. keepinvarianceofpredictivemapstotheirmaximumand togiveimportancetopixelswithhighgroundtruthfixation 3.1.Generator probability. In fact choosing an appropriate loss function hasbecomeanissuethatcanleadtoimprovedresults. Thus, SalGANfollowsaconvolutionalencoder-decoderarchi- anotherinterestingcontributionofHuangetal.[9]lieson tecture,wheretheencoderpartincludesmaxpoolinglayers minimizinglossfunctionsbasedonmetricsthatarediffer- thatdecreasethesizeofthefeaturemaps,whilethedecoder entiable,suchasNSS,CC,SIMandKLdivergencetotrain partusesupsamplinglayersfollowedbyconvolutionalfilters the network (see [26] and [20] for the definition of these toconstructanoutputthatisthesameresolutionastheinput. metrics. A thorough comparison of metrics can be found Figure 2showsthearchitectureofthesystem. in[3]). InHuang’swork[9]KLdivergencegavethebest Theencoderpartofthenetworkisidenticalinarchitec- results. Jetley et al. [11] also tested loss functions based ture to VGG-16 [27], omitting the final pooling and fully onprobabilitydistances,suchasX2divergence,totalvaria- connectedlayers. Thenetworkisinitializedwiththeweights tiondistance,KLdivergenceandBhattacharyyadistanceby of a VGG-16 model trained on the ImageNet data set for consideringsaliencymapmodelsasgeneralizedBernoulli objectclassification[6]. Onlythelasttwogroupsofconvo- distributions. TheBhattacharyyadistancewasfoundtogive lutionallayersinVGG-16aremodifiedduringthetraining the best results. Finally, the work by Johnson et al. [13] forsaliencyprediction,whiletheearlierlayersremainfixed definesaperceptuallosscombiningaper-pixellosstogether fromtheoriginalVGG-16model. Wefixweightstosave withanotherlosstermbasedonthesemanticsoftheimage. computationalresourcesduringtraining,evenatthepossible It is applied to image style transfer but it may be used in expenseofsomelossinperformance. modelsforsaliencyprediction. 3 layer depth kernel stride pad activation functionL(S,Sˆ)isdefinedbetweenthepredictedsaliency mapSˆanditscorrespondinggroundtruthS. conv1 1 3 1×1 1 1 ReLU Thefirstconsideredcontentlossismeansquarederror conv1 2 32 3×3 1 1 ReLU (MSE)orEuclideanloss,definedas: pool1 2×2 2 0 - conv2 1 64 3×3 1 1 ReLU L = 1 (cid:88)N (S −Sˆ )2. (1) conv2 2 64 3×3 1 1 ReLU MSE N j j pool2 2×2 2 0 - j=1 conv3 1 64 3×3 1 1 ReLU Inourwork,MSEisusedasabaselinereference,asithas conv3 2 64 3×3 1 1 ReLU beenadopteddirectlyorwithsomevariationsinotherstate pool3 2×2 2 0 - oftheartsolutionsforvisualsaliencyprediction[25,17,5]. Solutions based on MSE aim at maximizing the peak fc4 100 - - - tanh signal-to-noiseratio(PSNR).Theseworkstendtofilterhigh fc5 2 - - - tanh spatialfrequenciesintheoutput,favoringthiswayblurred fc6 1 - - - sigmoid contours. MSE corresponds to computing the Euclidean distancebetweenthepredictedsaliencyandthegroundtruth. Table1: Architectureofthediscriminatornetwork. Groundtruthsaliencymapsarenormalizedsothateach valueisintherange[0,1]. Saliencyvaluescanthereforebe interpretedasestimatesoftheprobabilitythataparticular Thedecoderarchitectureisstructuredinthesameway pixelisattendedbyanobserver. Itistemptingtotherefore as the encoder, but with the ordering of layers reversed, induceamultinomialdistributiononthepredictionsusing andwithpoolinglayersbeingreplacedbyupsamplinglay- asoftmaxonthefinallayer. Clearly,however,morethana ers. Again,ReLUnon-linearitiesareusedinallconvolution singlepixelmaybeattended,makingitmoreappropriateto layers,andafinal1×1convolutionlayerwithsigmoidnon- treateachpredictedvalueasindependentoftheothers. We linearityisaddedtoproducethesaliencymap. Theweights thereforeproposetoapplyanelement-wisesigmoidtoeach forthedecoderarerandomlyinitialized. outputinthefinallayersothatthepixel-wisepredictionscan 3.2.Discriminator bethoughtofasprobabilitiesforindependentbinaryrandom variables. Anappropriatelossinsuchasettingisthebinary Table 1 gives the architecture and layer configuration crossentropy,whichistheaverageoftheindividualbinary forthediscriminator. Inshort,thenetworkiscomposedof crossentropies(BCE)acrossallpixels: six3x3kernelconvolutionsinterspersedwiththreepooling layers (↓2), and followed by three fully connected layers. N TheconvolutionlayersalluseReLUactivationswhilethe LBCE =−N1 (cid:88)Sjlog(Sˆj)+(1−Sj)log(1−Sˆj). fully connected layers employ tanh activations, with the j=1 exceptionofthefinallayer,whichusesasigmoidactivation. (2) 4.2.Adversarialloss 4.Training Generative adversarial networks (GANs) [7] are com- The filter weights in SalGAN have been trained over monlyusedtogenerateimageswithrealisticstatisticalprop- aperceptualloss[13]resultingfromcombiningacontent erties. Theideaistosimultaneouslyfittwoparametricfunc- and adversarial loss. The content loss follows a classic tions. Thefirstofthesefunctions,knownasthegenerator,is approachinwhichthepredictedsaliencymapispixel-wise trainedtotransformsamplesfromasimpledistribution(e.g. compared with the corresponding one from ground truth. Gaussian)intosamplesfromamorecomplicateddistribution Theadversariallossdependsofthereal/syntheticprediction (e.g. naturalimages). Thesecondfunction,thediscrimina- ofthediscriminatoroverthegeneratedsaliencymap. tor,istrainedtodistinguishbetweensamplesfromthetrue distributionandgeneratedsamples. Trainingproceedsalter- 4.1.Contentloss natingbetweentrainingthediscriminatorusinggenerated Thecontentlossiscomputedinaper-pixelbasis,where andrealsamples,andtrainingthegenerator,bykeepingthe eachvalueofthepredictedsaliencymapiscomparedwith discriminatorweightsconstantandbackpropagatingtheerror its corresponding peer from the ground truth map. Given throughthediscriminatortoupdatethegeneratorweights. animageI ofdimensionsN = W ×H,werepresentthe Thesaliencypredictionproblemhassomeimportantdif- saliencymapS asvectorofprobabilitieswhereS ∈ RN ferencesfromtheabovescenario. First,theobjectiveisto j istheprobabilityofpixelI beingfixated. Acontentloss fitadeterministicfunctionthatgeneratesrealisticsaliency j 4 valuesfromimages,ratherthanrealisticimagesfromran- sAUC↑ AUC-B↑ NSS↑ CC↑ IG↑ domnoise. Assuch,inourcasetheinputtothegenerator BCE 0.752 0.825 2.473 0.761 0.712 isnotrandomnoisebutanimage. Second, itisclearthat BCE/2 0.750 0.820 2.527 0.764 0.592 knowledgeoftheimagethatasaliencymapcorrespondsto BCE/4 0.755 0.831 2.511 0.763 0.825 toisessentialtoevaluatequality. Wethereforeincludeboth BCE/8 0.754 0.827 2.503 0.762 0.831 theimageandsaliencymapasinputstothediscriminator. Finally,whenusinggenerativeadversarialnetworkstogen- eraterealisticimages,thereisgenerallynogroundtruthto Table 2: Impact of downsampled saliency maps at (15 compareagainst. Inourcase,however,thecorresponding epochs)evaluatedoverSALICONvalidation. BCE/xrefers groundtruthsaliencymapisavailable. Whenupdatingthe to a downsample factor of 1/x over a saliency map of parametersofthegeneratorfunction,wefoundthatusinga 256×192. lossfunctionthatisacombinationoftheerrorfromthedis- criminatorandthecrossentropywithrespecttotheground truthimprovedthestabilityandconvergencerateofthead- forSalGANwererunusingthetrainandvalidationparti- versarialtraining. Thefinallossfunctionforthegenerator tionsoftheSALICONdataset[12]. Thisisalargedataset duringadversarialtrainingcanbeformulatedas: built by collecting mouse clicks on a total of 20,000 im- agesfromtheMicrosoftCommonObjectsinContext(MS- L =α·L −logD(I,Sˆ), (3) GAN BCE CoCo) dataset [23]. We have adopted this dataset for our experimentsbecauseitisthelargestoneavailableforvisual whereD(I,Sˆ)istheprobabilityoffoolingthediscriminator, saliencyprediction. InadditiontoSALICON,Section5.3 sothatthelossassociatedtothegeneratorwillgrowmore also presents results on MIT300, the benchmark with the when chances of fooling the discriminator are lower. In largestamountofsubmissions. ourexperiments,weusedanhyperparameterofα = 0.05. Duringthetrainingofthediscriminator,nocontentlossis 5.1.Non-adversarialtraining availableandthesignoftheadversarialtermisswitched. The two content losses presented in Section 4.1, MSE Attraintime,wefirstbootstrapthegeneratorfunctionby andBCE,werecomparedtodefineabaselineuponwhich trainingfor15epochsusingonlyBCE,whichiscomputed welaterassesstheimpactoftheadversarialtraining. The with respect to the downsampled output and ground truth twofirstrowsofTable3showshowasimplechangefrom saliency. After this, we add the discriminator and begin MSEtoBCEbringsaconsistentimprovementinallmetrics. adversarialtraining. Theinputtothediscriminatornetwork Thisimprovementsuggeststhattreatingsaliencyprediction isanRGBSimageofsize256×192×4containingboth asmultiplebinaryclassificationproblemismoreappropriate thesourceimagechannelsand(predictedorgroundtruth) thantreatingitasastandardregressionproblem,inspiteof saliency. the fact that the target values are not binary. Minimizing We train the networks on the 10,000 images from the crossentropyisequivalenttominimizingtheKLdivergence SALICONtrainingsetusingabatchsizeof32. Thiswas between the predicted and target distributions, which is a thelargestbatchsizewecouldusegiventhememorycon- reasonableobjectiveifbothpredictionsantargetsareinter- straintsofourhardware. Duringtheadversarialtraining,we pretedasprobabilities. alternatethetrainingofthegeneratoranddiscriminatorafter Based on the superior BCE-based loss compared with eachiteration(batch). WeusedL2weightregularization(i.e. MSE,wealsoexploredtheimpactofcomputingthecontent weightdecay)whentrainingboththegeneratoranddiscrim- lossoverdownsampledversionsofthesaliencymap. This inator(λ=1×10−4). WeusedAdaGradforoptimization, techniquereducestherequiredcomputationalresourcesat withaninitiallearningrateof3×10−4. bothtrainingandtesttimesand, asshowninTable2, not only does it not decrease performance, but it can actually 5.Experiments improveit. Giventhisresults,wechosetotrainSalGANon ThepresentedSalGANmodelforvisualsaliencypredic- saliencymapsdownsampledbyafactor1/4,whichinour tionwasassessedandcomparedfromdifferentperspectives. architecturecorrespondstosaliencymapsof64×48. First,theimpactofusingBCEandthedownsampledsaliency 5.2.Adversarialgain mapsareassessed. Second,thegainoftheadversariallossis measuredanddiscussed,bothfromaquantitativeandaqual- Thegainachievedbyintroducingtheadversariallossinto itativepointofview. Finally,theperformanceofSalGAN theperceptualloss(seeSection4.2)wasassessedbyusing is compared to both published and unpublished works to BCE as a content loss and feature maps of 68×48. The compare its performance with the current state-of-the-art. firstrowofresultsinTable3referstoabaselinedefinedby The experiments aimed at finding the best configuration trainingSalGANwiththeBCEcontentlossfor15epochs 5 0.882 sAUC↑ AUC-B↑ NSS↑ CC↑ IG 0.670 d0.880 MSE 0.728 0.820 1.680 0.708 0.628 Sim0.665 C Jud0.878 GBCAEN +OBnClyE BCE 0.753 0.825 2.562 0.772 0.824 0.660 AU0.876 BCE/4 0.757 0.833 2.580 0.772 1.067 0.874 0.655 GAN/4 0.773 0.859 2.589 0.786 1.243 20 40 60 80 100 120 20 40 60 80 100 120 0.850 0.790 0.785 0.840 Borji0.830 C00..777850 Table 3: Best results through epochs obtained with non- C C0.770 adversarial(MSEandBCE)andadversarial(GAN)training. AU0.820 0.765 BCE/4 and GAN/4 refer to downsampled saliency maps. 0.810 0.760 0.755 SaliencymapsassessedonSALICONvalidation. 20 40 60 80 100 120 20 40 60 80 100 120 2.600 2.580 1.100 S22..554600 ssian)1.000 5.3.Comparisonwiththestate-of-the-art NS22..550200 Gau0.900 SalGAN is compared in Table 4 to several other algo- 2.480 G (0.800 rithmsfromthestate-of-the-art. Thecomparisonisbased 2.460 I 2.440 0.700 ontheevaluationsrunbytheorganizersoftheSALICON 20 40 60 80 100 120 20 40 60 80 100 120 epoch epoch and MIT300 benchmarks on a test dataset whose ground truthisnotpublic. Thetwobenchmarksoffercomplemen- Figure 3: SALICON validation set accuracy metrics for taryfeatures: whileSALICONisamuchlargerdatasetwith GAN+BCEvsBCEonvaryingnumbersofepochs. AUC 5,000testimages,MIT300hasattractedtheparticipationof shuffledisomittedasthetrendisidenticaltothatofAUC manymoreresearchers. Inbothcases,SalGANwastrained Judd. using15,000imagescontainedinthetraining(10,000)and validation(5,000)partitionsoftheSALICONdataset. No- only. Later,twooptionsareconsidered: 1)trainingbasedon ticethatwhilebothdatasetsaimatcapturingvisualsaliency, BCEonly(2ndrow),or2)introducingtheadversarialloss the acquising of data differed, as SALICON ground truth (3rdand4throw). wasgeneratedbasedoncrowdsourcedmouseclicks,while Figure 3 compares validation set accuracy metrics for the MIT300 was built with eye trackers on a limited and trainingwithcombinedGANandBCElossversusaBCE controlledgroupofusers. Table4comparesSalGANnot alone as the number of epochs increases. In the case of onlywithpublishedmodels,butalsowithworksthathave theAUCmetrics(JuddandBorji), increasingthenumber not been peer reviewed. SalGAN presents very competi- ofepochsdoesnotleadtosignificantimprovementswhen tive results in both datasets, as it improves or equals the usingBCEalone. ThecombinedBCE/GANlosshowever, performanceofallothermodelsinatleastonemetric. continuestoimproveperformancewithfurthertraining. Af- 5.4.Qualitativeresults ter100and120epochs,thecombinedGAN/BCElossshows substantialimprovementsoverBCEforfiveofsixmetrics. Theimpactofadversarialtraininghasalsobeenexplored The single metric for which GAN training fails to im- from a qualitative perspective by observing the resulting proveperformanceisnormalizedscanpathsaliency(NSS). saliency maps. Figure 4 shows three examples from the ThereasonforthismaybethatGANtrainingtendstopro- MIT300dataset,highlightedbyBylinskiietal.[4]asbeing duceasmootherandmorespreadoutestimateofsaliency, particularchallengesforexistingsaliencyalgorithms. The whichbettermatchesthestatisticalpropertiesofrealsaliency areas highlighted in yellow in the images on the left are maps,butmayincreasethefalsepositiverate. Asnotedby regionsthataretypicallymissedbyalgorithms. Inthefirst Bylinskiietal.[3],NSSisverysensitivetosuchfalseposi- example,weseethatSalGANsuccessfullydetectstheoften tives. Theimpactofincreasedfalsepositivesdependsonthe missedhandofthemagicianandfaceoftheboyasbeing finalapplication. Inapplicationswherethesaliencymapis salient. Thesecondexampleillustratesafailurecasewhere usedasamultiplicativeattentionmodel(e.g. inretrievalap- SalGAN,likeotheralgorithms,alsofailstoplacesufficient plications,wherespatialfeaturesareimportanceweighted), saliencyontheareaaroundthewhiteball(thoughthisregion falsepositivesareoftenlessimportantthanfalsenegatives, doeshavemoresaliencythanmostoftherestoftheimage). sincewhiletheformerincludesmoredistractors,thelatter Thefinalexampleillustrateswhatwebelievetobeoneof removes potentially useful features. Note also that NSS thelimitationsofthisdataset. Thegroundtruthplacesmost isdifferentiable,socouldpotentiallybeoptimizeddirectly ofthesaliencyonthesmallertextatthebottomofthesign. whenimportantforaparticularapplication. Thisisbecauseobserverstendspendmoretimeattending 6 SALICON(test) AUC-J↑ Sim↑ EMD↓ AUC-B↑ sAUC↑ CC↑ NSS↑ KL↓ DSCLRCN[24](*) - - - 0.884 0.776 0.831 3.157 - SalGAN - - - 0.884 0.772 0.781 2.459 - ML-NET[5] - - - (0.866) (0.768) (0.743) 2.789 - SalNet[25] - - - (0.858) (0.724) (0.609) (1.859) - MIT300 AUC-J↑ Sim↑ EMD↓ AUC-B↑ sAUC↑ CC↑ NSS↑ KL↓ Humans 0.92 1.00 0.00 0.88 0.81 1.0 3.29 0.00 DeepGazeII[21](*) 0.88 (0.46) (3.98) 0.86 0.72 (0.52) (1.29) (0.96) DSCLRCN[24](*) 0.87 0.68 2.17 (0.79) 0.72 0.80 2.35 0.95 DeepFix[17](*) 0.87 0.67 2.04 (0.80) (0.71) 0.78 2.26 0.63 SALICON[9] 0.87 (0.60) (2.62) 0.85 0.74 0.74 2.12 0.54 SalGAN 0.86 0.63 2.29 0.81 0.72 0.73 2.04 1.07 PDP[11] (0.85) (0.60) (2.58) (0.80) 0.73 (0.70) 2.05 0.92 ML-NET[5] (0.85) (0.59) (2.63) (0.75) (0.70) (0.67) 2.05 (1.10) DeepGazeI[19] (0.84) (0.39) (4.97) 0.83 (0.66) (0.48) (1.22) (1.23) iSEEL[29](*) (0.84) (0.57) (2.72) 0.81 (0.68) (0.65) (1.78) 0.65 SalNet[25] (0.83) (0.52) (3.31) 0.82 (0.69) (0.58) (1.51) 0.81 BMS[31] (0.83) (0.51) (3.35) 0.82 (0.65) (0.55) (1.41) 0.81 Table 4: Comparison of SalGAN with other state-of-the-art solutions on the SALICON (test) and MIT300 benchmarks accordingtotheirpublicleaderboardsonNovember10,2016. Valuesinbracketscorrespondtoperformancesworsethan SalGAN.(*)indicatescitationstonon-peerreviewedtexts. this area (reading the text), and not because it is the first ItisworthpointingoutthatalthoughweuseaVGG-16 areathatobserverstendtoattend. Existingmetrics,however, based encoder-decoder model as the generator in this pa- tendtobeagnostictotheorderinwhichareasareattended, per, the proposed GAN training approach is generic and alimitationwehopetolookintointhefuture. couldbeappliedtoimprovetheperformanceofotherdeep Figure5illustratestheeffectofadversarialtrainingonthe saliencymodels. Forexample,theauthorsoftheDSCLRCN statisticalpropertiesofthegeneratedsaliencymaps. Shown model observed that ResNet-based architectures outper- aretwocloseupsectionsofasaliencymapfromcrossen- formed VGG-based ones for saliency prediction, and that tropytraining(left)andadversarialtraining(right). Training usingmultipleimagescalesimprovesaccuracy.Similarmod- onBCEaloneproducessaliencymapsthatwhiletheymay ificationstoourmodelwouldlikelyprovidesimilargains. be locally consistent with the ground truth, are often less Furtherimprovementscouldalsolikelybeachievedbycare- smoothandhavecomplexlevelsets. GANtrainingonthe fullytuninghyperparameters,andinparticularthetradeoff otherhandproducesmuchsmootherandsimplerlevelsets. betweenBCEandGANlossinEquation3,whichwedid Finally,Figure6showssomequalitativeresultscomparing notattempttooptimizeforthispaper. Finally,anensemble theresultsfromtrainingwithMSE,BCE,andBCE/GAN ofmodelsisalsolikelytoimproveperformance,atthecost against the ground truth for images from the SALICON ofadditionalcomputationatpredicttime. validationset. Our results can be reproduced with the source code andtrainedmodelsavailableathttps://imatge-upc. 6.Conclusions github.io/saliency-salgan-2017/. Inthisworkwehaveshownhowadversarialtrainingover 7.Acknowledgments adeepconvolutionalneuralnetworkcanachievestate-of-the- artperformancewithasimpleencoder-decoderarchitecture. TheImageProcessingGroupatUPCissupportedbythe A BCE-based content loss was shown to be effective for projectTEC2013-43935-RandTEC2016-75976-R,funded bothinitializingthegenerator,andasaregularizationterm bytheSpanishMinisteriodeEconomiayCompetitividad forstabilizingadversarialtraining. Ourexperimentsshowed andtheEuropeanRegionalDevelopmentFund(ERDF).The thatadversarialtrainingimprovedallbaronesaliencymetric ImageProcessingGroupatUPCisaSGR14Consolidated whencomparedtofurthertrainingoncrossentropyalone. ResearchGrouprecognizedandsponsoredbytheCatalan Government(GeneralitatdeCatalunya)throughitsAGAUR 7 Images Ground Truth MSE BCE SalGAN Figure6: QualitativeresultsofSalGANontheSALICONvalidationset. office.ThecontributionfromtheBarcelonaSupercomputing References CenterhasbeensupportedbyprojectTIN2015-65316bythe [1] A.Borji. Boostingbottom-upandtop-downvisualfeatures SpanishMinistryofScienceandInnovationcontracts2014- forsaliencyestimation. InIEEEConferenceonComputer SGR-1051 by Generalitat de Catalunya. This material is VisionandPatternRecognition(CVPR),2012. baseduponworkssupportedbyScienceFoundationIreland [2] Z. Bylinskii, T. Judd, A. Ali Borji, L. Itti, F. Durand, underGrantNo15/SIRG/3283. Wegratefullyacknowledge A. Oliva, and A. Torralba. MIT saliency benchmark. thesupportofNVIDIACorporationwiththedonationofa http://saliency.mit.edu/. GeForceGTXTitanZusedinthiswork,togetherwiththe [3] Z.Bylinskii,T.Judd,A.Oliva,A.Torralba,andF.Durand. supportofBSC/UPCNVIDIAGPUCenterofExcellence. Whatdodifferentevaluationmetricstellusaboutsaliency WealsowanttothankallthemembersoftheX-thesesgroup models? ArXivpreprint:1610.01563,2016. fortheiradvice,especiallytoVictorGarciaandSantiPascual [4] Z.Bylinskii,A.Recasens,A.Borji,A.Oliva,A.Torralba, fortheirproofreading. andF.Durand. Whereshouldsaliencymodelslooknext? In EuropeanConferenceonComputerVision(ECCV),2016. [5] M.Cornia,L.Baraldi,G.Serra,andR.Cucchiara. Adeep multi-levelnetworkforsaliencyprediction. InInternational 8 Ground truth SalGAN ralnetworks. InIEEEInternationalConferenceonComputer Vision(ICCV),2015. [10] L.Itti,C.Koch,andE.Niebur. Amodelofsaliency-basedvi- sualattentionforrapidsceneanalysis. IEEETransactionson PatternAnalysisandMachineIntelligence(PAMI),(20):1254– 1259,1998. [11] S.Jetley,N.Murray,andE.Vig.End-to-endsaliencymapping viaprobabilitydistributionprediction. InIEEEConference onComputerVisionandPatternRecognition(CVPR),2016. [12] M.Jiang,S.Huang,J.Duan,andQ.Zhao. Salicon:Saliency in context. In IEEE Conference on Computer Vision and PatternRecognition(CVPR),2015. [13] J.Johnson,A.Alahi,andL.Fei-Fei. Perceptuallossesfor real-timestyletransferandsuper-resolution. InEuropean ConferenceonComputerVision(ECCV),2016. [14] T.Judd,K.Ehinger,F.Durand,andA.Torralba. Learningto predictwherehumanslook.InIEEEInternationalConference onComputerVision(ICCV),2009. [15] K.Krafka, A.Khosla, P.Kellnhofer, H.Kannan, S.Bhan- darkar,W.Matusik,andA.Torralba. Eyetrackingforevery- one. InIEEEConferenceonComputerVisionandPattern Recognition(CVPR),2016. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classificationwithdeepconvolutionalneuralnetworks. In Advancesinneuralinformationprocessingsystems, pages Figure4: ExampleimagesfromMIT300containingsalient 1097–1105,2012. region(markedinyellow)thatisoftenmissedbycomputa- [17] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A fullyconvolutionalneuralnetworkforpredictinghumaneye tionalmodels,andsaliencymapestimatedbySalGAN. fixations. ArXivpreprint:1510.02927,2015. [18] S.S.Kruthiventi,V.Gudisa,J.Dholakiya,andR.V.Babu. Saliencyunified:Adeeparchitectureforsimultaneouseyefix- ationpredictionandsalientobjectsegmentation.InIEEECon- ferenceonComputerVisionandPatternRecognition(CVPR), 2016. [19] M.Ku¨mmerer,L.Theis,andM.Bethge. DeepGazeI:Boost- ingsaliencypredictionwithfeaturemapstrainedonImageNet. In International Conference on Learning Representations (ICLR),2015. Figure5: Close-upcomparisonofoutputfromtrainingon [20] M. Ku¨mmerer, L. Theis, and M. Bethge. Information- BCElossvscombinedBCE/GANloss. Left: saliencymap theoreticmodelcomparisonunifiessaliencymetrics. Pro- fromnetworktrainedwithBCEloss. Right: saliencymap ceedings of the National Academy of Sciences (PNAS), fromproposedGANtraining. 112(52):16054–16059,2015. [21] M.Ku¨mmerer,T.S.Wallis,andM.Bethge. DeepGazeII: Readingfixationsfromdeepfeaturestrainedonobjectrecog- ConferenceonPatternRecognition(ICPR),2016. nition. ArXivpreprint:1610.01563,2016. [6] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei. [22] G.LiandY.Yu. Visualsaliencybasedonmultiscaledeep ImageNet: A large-scale hierarchical image database. In features.InIEEEConferenceonComputerVisionandPattern IEEEConferenceonComputerVisionandPatternRecogni- Recognition(CVPR),2015. tion(CVPR),2009. [23] T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ra- [7] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde- manan,P.Dolla´r,andL.Zitnick. MicrosoftCOCO:Common Farley, S.Ozair, A.Courville, andY.Bengio. Generative objectsincontext. InEuropeanConferenceonComputer adversarialnets. InAdvancesinNeuralInformationProcess- Vision(ECCV),2014. ingSystems,pages2672–2680,2014. [24] N.LiuandJ.Han. Adeepspatialcontextuallong-termre- [8] J.Harel,C.Koch,andP.Perona.Graph-basedvisualsaliency. currentconvolutionalnetworkforsaliencydetection. ArXiv InNeuralInformationProcessingSystems(NIPS),2006. preprint:1610.01708,2016. [9] X.Huang,C.Shen,X.Boix,andQ.Zhao.Salicon:Reducing [25] J.Pan,E.Sayrol,X.Giro´-iNieto,K.McGuinness,andN.E. thesemanticgapinsaliencypredictionbyadaptingdeepneu- O’Connor. Shallow and deep convolutional networks for 9 saliencyprediction. InIEEEConferenceonComputerVision andPatternRecognition(CVPR),2016. [26] N. Riche, D. M., M. Mancas, B. Gosselin, and T. Dutoit. Saliencyandhumanfixations.state-of-the-artandstudycom- parisonmetrics. InIEEEInternationalConferenceonCom- puterVision(ICCV),2013. [27] K.SimonyanandA. Zisserman. Verydeep convolutional networksforlarge-scaleimagerecognition. InInternational ConferenceonLearningRepresentations(ICLR),2015. [28] C.Szegedy,W.Liu,Y.Jia,P.R.S.Sermanet,D.Anguelov, D.Erhan,V.Vanhoucke,andA.Rabinovich. Goingdeeper withconvolutions. InIEEEConferenceonComputerVision andPatternRecognition(CVPR),2015. [29] H.R.Tavakoli,A.Borji,J.Laaksonen,andE.Rahtu. Ex- ploiting inter-image similarity and ensemble of extreme learnersforfixationpredictionusingdeepfeatures. ArXiv preprint:1610.06449,2016. [30] E.Vig,M.Dorr,andD.Cox. Large-scaleoptimizationof hierarchical features for saliency prediction in natural im- ages. InIEEEConferenceonComputerVisionandPattern Recognition(CVPR),2014. [31] J.ZhangandS.Sclaroff. Saliencydetection:abooleanmap approach. InIEEEInternationalConferenceonComputer Vision(ICCV),2013. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.