Table Of ContentSalGAN: Visual Saliency Prediction with Generative Adversarial Networks
JuntingPan,ElisaSayrolandXavierGiro-i-Nieto CristianCantonFerrer
ImageProcessingGroup Microsoft
UniversitatPolitecnicadeCatalunya(UPC) Redmond(WA),USA
Barcelona,Catalonia/Spain cristian.canton@microsoft.com
xavier.giro@upc.edu
7 JordiTorres KevinMcGuinessandNoelE.O’Connor
1
BarcelonaSupercomputingCenter(BSC) InsightCenterforDataAnalytics
0
2 Barcelona,Catalonia/Spain DublinCityUniversity
n jordi.torres@bsc.es Dublin,Ireland
a
kevin.mcguinness@insight-centre.org
J
9
] Abstract GroundTruth BCE SalGAN
V
C
WeintroduceSalGAN,adeepconvolutionalneuralnet-
.
s workforvisualsaliencypredictiontrainedwithadversarial
c
examples. Thefirststageofthenetworkconsistsofagenera-
[
tormodelwhoseweightsarelearnedbyback-propagation
2
computedfromabinarycrossentropy(BCE)lossoverdown-
v
sampledversionsofthesaliencymaps. Theresultingpredic-
1
8 tionisprocessedbyadiscriminatornetworktrainedtosolve
0 abinaryclassificationtaskbetweenthesaliencymapsgener-
1 atedbythegenerativestageandthegroundtruthones. Our
0 Figure 1: Example of saliency map generation where the
experimentsshowhowadversarialtrainingallowsreaching
. proposedsystem(SalGAN)outperformsastandardbinary
1 state-of-the-artperformanceacrossdifferentmetricswhen
crossentropy(BCE)predictionmodel.
0 combinedwithawidely-usedlossfunctionlikeBCE.Our
7
resultscanbereproducedwiththesourcecodeandtrained
1
modelsavailableathttps://imatge-upc.github.
:
v io/saliency-salgan-2017/. capturethehumanattention. Thesesaliencymapshavebeen
i usedassoft-attentionguidesforothercomputervisiontasks,
X
andalsodirectlyforuserstudiesinfieldslikemarketing.
r
a Thispaperintroducestheusageofgenerativeadversarial
1.Motivation
networks(GANs)[7]forvisualsaliencyprediction. Inthis
Visualsaliencypredictionincomputervisionaimsatesti- context,trainingisdrivenbytwoagents. First,thegenerator
matingthelocationsinanimagethatattracttheattentionof thatcreatesasyntheticsamplematchingthedatadistribution
humans.Salientpointsinanimageareunderstoodasaresult modelledbyatrainingdataset;second,thediscriminator,
of a bottom-up process where a human observer explores thatdistinguishesbetweenarealsampledrawndirectlyfrom
theimageforafewsecondswithnoparticulartaskinmind. the training data set and one created by the generator. In
Thesedataaretraditionallymeasuredwitheye-trackers,and ourcase,thisdatadistributioncorrespondstopairsofreal
morerecentlywithmouseclicks[12]orwebcams[15]. The imagesandtheircorrespondingvisualsaliencymaps.
salientpointsovertheimageareusuallyaggregatedandrep- Thegenerator,namedinourworkSalGAN,istheonlyre-
resentedinasaliencymap,asinglechannelimageobtained sponsibleforgeneratingthesaliencymapforagivenimage
byconvolvingeachpointwithaGaussiankernel. Asare- andusesadeepconvolutionalneuralnetwork(DCNN)to
sult,agray-scaleimageorheatmapisgeneratedtorepresent producesuchmaps. Thisnetworkisinitiallytrainedwitha
theprobabilityofeachcorrespondingpixelintheimageto binarycrossentropy(BCE)lossoverdownsampledversions
1
ofthesaliencymaps. Themodelisrefinedwithadiscrim- low-level features at multiple scales and combining them
inatornetworktrainedtosolveabinaryclassificationtask toformasaliencymap. Hareletal. [8],alsostartingfrom
betweenthesaliencymapsgeneratedbySalGANandthe low-levelfeaturemaps,introducedagraph-basedsaliency
realonesusedasgroundtruth. Ourexperimentsshowhow modelthatdefinesMarkovchainsovervariousimagemaps,
adversarialtrainingallowsreachingstate-of-the-artperfor- andtreattheequilibriumdistributionovermaplocationsas
manceacrossdifferentmetricswhencombinedwithaBCE activationandsaliencyvalues. Juddetal. in[14]presented
contentlossinasingle-towerandsingle-taskmodel. a bottom-up, top-down model of saliency based not only
The amount of different metrics to evaluate visual on low but mid and high-level image features. Borji [1]
saliencypredictionisbroadanddiverse,andthediscussion combinedlow-levelfeaturessaliencymapsofpreviousbest
aboutthemisrichandactive. Pioneeringmetricshavesatu- bottom-upmodelswithtop-downcognitivevisualfeatures
ratedwiththeintroductionofdeeplearningmodelstrained and learned a direct mapping from those features to eye
onlargedatasets,boostingtheproposalofnewmetricsand fixations.
challengesforthecommunity. ThewidelyadoptedMIT300 Asinmanyotherfieldsincomputervision,anumberof
benchmarkevaluatesitsresultson8differentmetrics,and deep learning solutions have very recently been proposed
themorerecentLSUNchallengeontheSALICONdataset that significantly improve performance. For example, as
considersfourofthem. Alsorecently,theInformationGain statedin[4],amongthetop10resultsintheMITsaliency
(IG) value has been presented as powerful relative metric benchmark[2], asofMarch2016, sixmodelswerebased
in the field. The diversity of metrics in our current deep ondeeplearning. WhenconsideringAUC-Juddastheref-
learningerahasresultedalsoinadiversityoflossfunctions erencemetric,resultsasofSeptember2016ontheMIT300
betweenstate-of-the-artsolutions. Modelsbasedonback- benchmarkincludeonlyonenon-deeplearningtechnique
propagation training have adopted loss functions capable (Booleanmapsaliency [31])amongthetop-ten.
to capture the different features assessed by the multiple
TheEnsembleofDeepNetworks(eDN)[30]represented
metrics. Ourapproach,though,differsfromotherstate-of-
anearlyarchitecturethatautomaticallylearnstherepresen-
the-art solutions as we focus on exploring the benefits of
tationsforsaliencyprediction,blendingfeaturemapsfrom
introducingtheagnosticlossfunctionproposedingenerative
differentlayers. Theirnetworkmightbeconsiderashallow
adversarialtraining.
network given the number of layers. In [25] shallow and
This paper explores adversarial training for visual
deepernetworkswerecompared. DCNNhaveshownbetter
saliency prediction, showing how the simple binary clas-
resultsevenwhenpre-trainedwithdatasetsbuildforother
sificationbetweenrealandsyntheticsamplessignificantly
purposes. DeepGaze[19]providedadeepernetworkusing
benefits a wide range of visual saliency metrics, without
the well-know AlexNet [16], with pre-trained weights on
needing to specify a tailored loss function. Our results
Imagenet[6]andwithareadoutnetworkontopwhosein-
achievestate-of-the-artperformancewithasimpleDCNN
putsconsistedofsomelayeroutputsofAlexNet. Theoutput
whose parameters are refined with a discriminator. As a
of the network is blurred, center biased and converted to
secondarycontribution,weshowthebenefitsofusingthe
aprobabilitydistributionusingasoftmax. Anewversion
binarycrossentropy(BCE)lossfunctionanddownsampled
calledDeepGaze2[21]usingVGG-19[27]andtrainedwith
saliencymapswhentrainingthisDCNN.
the SALICON dataset [12] has recently achieved the best
Theremainingoftheworkisorganizedasfollows. Sec-
results in the MIT saliency benchmark. Kruthiventi et al.
tion2reviewsthestate-of-the-artmodelsforvisualsaliency
[18]alsousedpre-trainedVGGnetworksandtheSALICON
prediction,discussingthelossfunctionstheyarebasedupon,
datasettoobtainsaliencypredictionmapswithoutstanding
their relations with the different metrics as well as their
results. Thisnetworkusesadditionalinceptionmodules[28]
complexityintermsofarchitectureandtraining. Section3
tomodelfineandcoarsesaliencydetails. Huangetal.[9],
presents SalGAN, our deep convolutional neural network
inthesocallSALICONnet,obtainedbetterresultsbyus-
basedonaconvolutionalencoder-decoderarchitecture,as
ingVGGratherthanAlexNetorGoogleNet[28]. Intheir
wellasthediscriminatornetworkusedduringitsadversarial
proposaltheyconsideredtwonetworkswithfineandcoarse
training. Section4describesthetrainingprocessofSalGAN
inputs,whosefeaturemapsoutputsareconcatenated.
andthelossfunctionsused. Section5includestheexperi-
Followingthisideaofmodelinglocalandglobalsaliency
mentsandresultsofthepresentedtechniques. Finally,Sec-
features,Liuetal. [22]proposedamultiresolutionconvo-
tion6closesthepaperbydrawingthemainconclusions.
lutionalneuralnetworkthatistrainedfromimageregions
centeredonfixationandnon-fixationlocationsovermultiple
2.Relatedwork
resolutions. Diversetop-downvisualfeaturescanbelearned
Saliencypredictionhasreceivedinterestbytheresearch inhigherlayersandbottom-upvisualsaliencycanalsobe
community for many years. Thus seminal works by Itti inferredbycombininginformationovermultipleresolutions.
et al. [10] proposed to predict saliency maps considering Theseideasarefurtherdevelopedintheyrecentworkcalled
2
Image Stimuli
+ Predicted
Saliency Map
Generator Discriminator
Adversarial
BCE Cost
Cost
Conv-VGG Max Pooling Upsampling Sigmoid
Conv-Scratch Fully Connected
Image Stimuli
+ Ground Truth
Saliency Map
Figure2: Overallarchitectureoftheproposedsaliencymapestimationsystem.
DSCLRCN[24],wheretheproposedmodellearnssaliency Inourworkwepresentanetworkarchitecturethattakes
relatedlocalfeaturesoneachimagelocationinparalleland adifferentapproachandweconsiderlossfunctionsthatare
thenlearnstosimultaneouslyincorporateglobalcontextand welladaptedtoourmodel. Inparticularotherproposalsuse
scenecontexttoinfersaliency. Theyincorporateamodel lossesbasedonMSEorsimilar[5],[25],[17],whereaswe
toeffectivelylearnlong-termspatialinteractionsandscene useBCEandcompareitwithMSEasareference.
contextualmodulationtoinferimagesaliency. Theirexperi-
mentshaveobtainedoutstandingresultsintheMITSaliency 3.Architecture
Benchmark. Corniaetal. [5]proposedanarchitecturethat
The architecture of the presented SalGAN is based on
combinesfeaturesextractedatdifferentlevelsofaDCNN.
twodeepconvolutionalneuralnetwork(DCNN)modules,
Theirmodeliscomposedofthreemainblocks: afeatureex-
namelythegeneratoranddiscriminator,whosecombinedef-
tractionDCNN,afeatureencodingnetwork,whichweights
fortsaimatpredictingavisualsaliencymapforagiveninput
low and high-level feature maps, and a prior learningnet-
image. Thissectionprovidesdetailsonthestructureofboth
work. Theyalsointroducealossfunctioninspiredbythree
modules,theconsideredlossfunctions,andtheinitialization
objectives: tomeasuresimilaritywiththegroundtruth,to
beforebeginningadversarialtraining.
keepinvarianceofpredictivemapstotheirmaximumand
togiveimportancetopixelswithhighgroundtruthfixation 3.1.Generator
probability. In fact choosing an appropriate loss function
hasbecomeanissuethatcanleadtoimprovedresults. Thus, SalGANfollowsaconvolutionalencoder-decoderarchi-
anotherinterestingcontributionofHuangetal.[9]lieson tecture,wheretheencoderpartincludesmaxpoolinglayers
minimizinglossfunctionsbasedonmetricsthatarediffer- thatdecreasethesizeofthefeaturemaps,whilethedecoder
entiable,suchasNSS,CC,SIMandKLdivergencetotrain partusesupsamplinglayersfollowedbyconvolutionalfilters
the network (see [26] and [20] for the definition of these toconstructanoutputthatisthesameresolutionastheinput.
metrics. A thorough comparison of metrics can be found Figure 2showsthearchitectureofthesystem.
in[3]). InHuang’swork[9]KLdivergencegavethebest Theencoderpartofthenetworkisidenticalinarchitec-
results. Jetley et al. [11] also tested loss functions based ture to VGG-16 [27], omitting the final pooling and fully
onprobabilitydistances,suchasX2divergence,totalvaria- connectedlayers. Thenetworkisinitializedwiththeweights
tiondistance,KLdivergenceandBhattacharyyadistanceby of a VGG-16 model trained on the ImageNet data set for
consideringsaliencymapmodelsasgeneralizedBernoulli objectclassification[6]. Onlythelasttwogroupsofconvo-
distributions. TheBhattacharyyadistancewasfoundtogive lutionallayersinVGG-16aremodifiedduringthetraining
the best results. Finally, the work by Johnson et al. [13] forsaliencyprediction,whiletheearlierlayersremainfixed
definesaperceptuallosscombiningaper-pixellosstogether fromtheoriginalVGG-16model. Wefixweightstosave
withanotherlosstermbasedonthesemanticsoftheimage. computationalresourcesduringtraining,evenatthepossible
It is applied to image style transfer but it may be used in expenseofsomelossinperformance.
modelsforsaliencyprediction.
3
layer depth kernel stride pad activation functionL(S,Sˆ)isdefinedbetweenthepredictedsaliency
mapSˆanditscorrespondinggroundtruthS.
conv1 1 3 1×1 1 1 ReLU
Thefirstconsideredcontentlossismeansquarederror
conv1 2 32 3×3 1 1 ReLU
(MSE)orEuclideanloss,definedas:
pool1 2×2 2 0 -
conv2 1 64 3×3 1 1 ReLU L = 1 (cid:88)N (S −Sˆ )2. (1)
conv2 2 64 3×3 1 1 ReLU MSE N j j
pool2 2×2 2 0 - j=1
conv3 1 64 3×3 1 1 ReLU Inourwork,MSEisusedasabaselinereference,asithas
conv3 2 64 3×3 1 1 ReLU beenadopteddirectlyorwithsomevariationsinotherstate
pool3 2×2 2 0 - oftheartsolutionsforvisualsaliencyprediction[25,17,5].
Solutions based on MSE aim at maximizing the peak
fc4 100 - - - tanh
signal-to-noiseratio(PSNR).Theseworkstendtofilterhigh
fc5 2 - - - tanh
spatialfrequenciesintheoutput,favoringthiswayblurred
fc6 1 - - - sigmoid
contours. MSE corresponds to computing the Euclidean
distancebetweenthepredictedsaliencyandthegroundtruth.
Table1: Architectureofthediscriminatornetwork.
Groundtruthsaliencymapsarenormalizedsothateach
valueisintherange[0,1]. Saliencyvaluescanthereforebe
interpretedasestimatesoftheprobabilitythataparticular
Thedecoderarchitectureisstructuredinthesameway
pixelisattendedbyanobserver. Itistemptingtotherefore
as the encoder, but with the ordering of layers reversed,
induceamultinomialdistributiononthepredictionsusing
andwithpoolinglayersbeingreplacedbyupsamplinglay-
asoftmaxonthefinallayer. Clearly,however,morethana
ers. Again,ReLUnon-linearitiesareusedinallconvolution
singlepixelmaybeattended,makingitmoreappropriateto
layers,andafinal1×1convolutionlayerwithsigmoidnon-
treateachpredictedvalueasindependentoftheothers. We
linearityisaddedtoproducethesaliencymap. Theweights
thereforeproposetoapplyanelement-wisesigmoidtoeach
forthedecoderarerandomlyinitialized.
outputinthefinallayersothatthepixel-wisepredictionscan
3.2.Discriminator bethoughtofasprobabilitiesforindependentbinaryrandom
variables. Anappropriatelossinsuchasettingisthebinary
Table 1 gives the architecture and layer configuration crossentropy,whichistheaverageoftheindividualbinary
forthediscriminator. Inshort,thenetworkiscomposedof crossentropies(BCE)acrossallpixels:
six3x3kernelconvolutionsinterspersedwiththreepooling
layers (↓2), and followed by three fully connected layers. N
TheconvolutionlayersalluseReLUactivationswhilethe LBCE =−N1 (cid:88)Sjlog(Sˆj)+(1−Sj)log(1−Sˆj).
fully connected layers employ tanh activations, with the j=1
exceptionofthefinallayer,whichusesasigmoidactivation. (2)
4.2.Adversarialloss
4.Training
Generative adversarial networks (GANs) [7] are com-
The filter weights in SalGAN have been trained over
monlyusedtogenerateimageswithrealisticstatisticalprop-
aperceptualloss[13]resultingfromcombiningacontent
erties. Theideaistosimultaneouslyfittwoparametricfunc-
and adversarial loss. The content loss follows a classic
tions. Thefirstofthesefunctions,knownasthegenerator,is
approachinwhichthepredictedsaliencymapispixel-wise
trainedtotransformsamplesfromasimpledistribution(e.g.
compared with the corresponding one from ground truth.
Gaussian)intosamplesfromamorecomplicateddistribution
Theadversariallossdependsofthereal/syntheticprediction
(e.g. naturalimages). Thesecondfunction,thediscrimina-
ofthediscriminatoroverthegeneratedsaliencymap.
tor,istrainedtodistinguishbetweensamplesfromthetrue
distributionandgeneratedsamples. Trainingproceedsalter-
4.1.Contentloss
natingbetweentrainingthediscriminatorusinggenerated
Thecontentlossiscomputedinaper-pixelbasis,where andrealsamples,andtrainingthegenerator,bykeepingthe
eachvalueofthepredictedsaliencymapiscomparedwith discriminatorweightsconstantandbackpropagatingtheerror
its corresponding peer from the ground truth map. Given throughthediscriminatortoupdatethegeneratorweights.
animageI ofdimensionsN = W ×H,werepresentthe Thesaliencypredictionproblemhassomeimportantdif-
saliencymapS asvectorofprobabilitieswhereS ∈ RN ferencesfromtheabovescenario. First,theobjectiveisto
j
istheprobabilityofpixelI beingfixated. Acontentloss fitadeterministicfunctionthatgeneratesrealisticsaliency
j
4
valuesfromimages,ratherthanrealisticimagesfromran- sAUC↑ AUC-B↑ NSS↑ CC↑ IG↑
domnoise. Assuch,inourcasetheinputtothegenerator
BCE 0.752 0.825 2.473 0.761 0.712
isnotrandomnoisebutanimage. Second, itisclearthat
BCE/2 0.750 0.820 2.527 0.764 0.592
knowledgeoftheimagethatasaliencymapcorrespondsto
BCE/4 0.755 0.831 2.511 0.763 0.825
toisessentialtoevaluatequality. Wethereforeincludeboth
BCE/8 0.754 0.827 2.503 0.762 0.831
theimageandsaliencymapasinputstothediscriminator.
Finally,whenusinggenerativeadversarialnetworkstogen-
eraterealisticimages,thereisgenerallynogroundtruthto
Table 2: Impact of downsampled saliency maps at (15
compareagainst. Inourcase,however,thecorresponding
epochs)evaluatedoverSALICONvalidation. BCE/xrefers
groundtruthsaliencymapisavailable. Whenupdatingthe
to a downsample factor of 1/x over a saliency map of
parametersofthegeneratorfunction,wefoundthatusinga
256×192.
lossfunctionthatisacombinationoftheerrorfromthedis-
criminatorandthecrossentropywithrespecttotheground
truthimprovedthestabilityandconvergencerateofthead-
forSalGANwererunusingthetrainandvalidationparti-
versarialtraining. Thefinallossfunctionforthegenerator
tionsoftheSALICONdataset[12]. Thisisalargedataset
duringadversarialtrainingcanbeformulatedas:
built by collecting mouse clicks on a total of 20,000 im-
agesfromtheMicrosoftCommonObjectsinContext(MS-
L =α·L −logD(I,Sˆ), (3)
GAN BCE CoCo) dataset [23]. We have adopted this dataset for our
experimentsbecauseitisthelargestoneavailableforvisual
whereD(I,Sˆ)istheprobabilityoffoolingthediscriminator,
saliencyprediction. InadditiontoSALICON,Section5.3
sothatthelossassociatedtothegeneratorwillgrowmore
also presents results on MIT300, the benchmark with the
when chances of fooling the discriminator are lower. In
largestamountofsubmissions.
ourexperiments,weusedanhyperparameterofα = 0.05.
Duringthetrainingofthediscriminator,nocontentlossis 5.1.Non-adversarialtraining
availableandthesignoftheadversarialtermisswitched.
The two content losses presented in Section 4.1, MSE
Attraintime,wefirstbootstrapthegeneratorfunctionby
andBCE,werecomparedtodefineabaselineuponwhich
trainingfor15epochsusingonlyBCE,whichiscomputed
welaterassesstheimpactoftheadversarialtraining. The
with respect to the downsampled output and ground truth
twofirstrowsofTable3showshowasimplechangefrom
saliency. After this, we add the discriminator and begin
MSEtoBCEbringsaconsistentimprovementinallmetrics.
adversarialtraining. Theinputtothediscriminatornetwork
Thisimprovementsuggeststhattreatingsaliencyprediction
isanRGBSimageofsize256×192×4containingboth
asmultiplebinaryclassificationproblemismoreappropriate
thesourceimagechannelsand(predictedorgroundtruth)
thantreatingitasastandardregressionproblem,inspiteof
saliency.
the fact that the target values are not binary. Minimizing
We train the networks on the 10,000 images from the
crossentropyisequivalenttominimizingtheKLdivergence
SALICONtrainingsetusingabatchsizeof32. Thiswas
between the predicted and target distributions, which is a
thelargestbatchsizewecouldusegiventhememorycon-
reasonableobjectiveifbothpredictionsantargetsareinter-
straintsofourhardware. Duringtheadversarialtraining,we
pretedasprobabilities.
alternatethetrainingofthegeneratoranddiscriminatorafter
Based on the superior BCE-based loss compared with
eachiteration(batch). WeusedL2weightregularization(i.e.
MSE,wealsoexploredtheimpactofcomputingthecontent
weightdecay)whentrainingboththegeneratoranddiscrim-
lossoverdownsampledversionsofthesaliencymap. This
inator(λ=1×10−4). WeusedAdaGradforoptimization,
techniquereducestherequiredcomputationalresourcesat
withaninitiallearningrateof3×10−4.
bothtrainingandtesttimesand, asshowninTable2, not
only does it not decrease performance, but it can actually
5.Experiments
improveit. Giventhisresults,wechosetotrainSalGANon
ThepresentedSalGANmodelforvisualsaliencypredic- saliencymapsdownsampledbyafactor1/4,whichinour
tionwasassessedandcomparedfromdifferentperspectives. architecturecorrespondstosaliencymapsof64×48.
First,theimpactofusingBCEandthedownsampledsaliency
5.2.Adversarialgain
mapsareassessed. Second,thegainoftheadversariallossis
measuredanddiscussed,bothfromaquantitativeandaqual- Thegainachievedbyintroducingtheadversariallossinto
itativepointofview. Finally,theperformanceofSalGAN theperceptualloss(seeSection4.2)wasassessedbyusing
is compared to both published and unpublished works to BCE as a content loss and feature maps of 68×48. The
compare its performance with the current state-of-the-art. firstrowofresultsinTable3referstoabaselinedefinedby
The experiments aimed at finding the best configuration trainingSalGANwiththeBCEcontentlossfor15epochs
5
0.882 sAUC↑ AUC-B↑ NSS↑ CC↑ IG
0.670
d0.880 MSE 0.728 0.820 1.680 0.708 0.628
Sim0.665 C Jud0.878 GBCAEN +OBnClyE BCE 0.753 0.825 2.562 0.772 0.824
0.660 AU0.876
BCE/4 0.757 0.833 2.580 0.772 1.067
0.874
0.655
GAN/4 0.773 0.859 2.589 0.786 1.243
20 40 60 80 100 120 20 40 60 80 100 120
0.850 0.790
0.785
0.840
Borji0.830 C00..777850 Table 3: Best results through epochs obtained with non-
C C0.770 adversarial(MSEandBCE)andadversarial(GAN)training.
AU0.820 0.765 BCE/4 and GAN/4 refer to downsampled saliency maps.
0.810 0.760
0.755 SaliencymapsassessedonSALICONvalidation.
20 40 60 80 100 120 20 40 60 80 100 120
2.600
2.580 1.100
S22..554600 ssian)1.000 5.3.Comparisonwiththestate-of-the-art
NS22..550200 Gau0.900 SalGAN is compared in Table 4 to several other algo-
2.480 G (0.800 rithmsfromthestate-of-the-art. Thecomparisonisbased
2.460 I
2.440 0.700 ontheevaluationsrunbytheorganizersoftheSALICON
20 40 60 80 100 120 20 40 60 80 100 120
epoch epoch and MIT300 benchmarks on a test dataset whose ground
truthisnotpublic. Thetwobenchmarksoffercomplemen-
Figure 3: SALICON validation set accuracy metrics for taryfeatures: whileSALICONisamuchlargerdatasetwith
GAN+BCEvsBCEonvaryingnumbersofepochs. AUC 5,000testimages,MIT300hasattractedtheparticipationof
shuffledisomittedasthetrendisidenticaltothatofAUC manymoreresearchers. Inbothcases,SalGANwastrained
Judd. using15,000imagescontainedinthetraining(10,000)and
validation(5,000)partitionsoftheSALICONdataset. No-
only. Later,twooptionsareconsidered: 1)trainingbasedon ticethatwhilebothdatasetsaimatcapturingvisualsaliency,
BCEonly(2ndrow),or2)introducingtheadversarialloss the acquising of data differed, as SALICON ground truth
(3rdand4throw). wasgeneratedbasedoncrowdsourcedmouseclicks,while
Figure 3 compares validation set accuracy metrics for the MIT300 was built with eye trackers on a limited and
trainingwithcombinedGANandBCElossversusaBCE controlledgroupofusers. Table4comparesSalGANnot
alone as the number of epochs increases. In the case of onlywithpublishedmodels,butalsowithworksthathave
theAUCmetrics(JuddandBorji), increasingthenumber not been peer reviewed. SalGAN presents very competi-
ofepochsdoesnotleadtosignificantimprovementswhen tive results in both datasets, as it improves or equals the
usingBCEalone. ThecombinedBCE/GANlosshowever, performanceofallothermodelsinatleastonemetric.
continuestoimproveperformancewithfurthertraining. Af-
5.4.Qualitativeresults
ter100and120epochs,thecombinedGAN/BCElossshows
substantialimprovementsoverBCEforfiveofsixmetrics. Theimpactofadversarialtraininghasalsobeenexplored
The single metric for which GAN training fails to im- from a qualitative perspective by observing the resulting
proveperformanceisnormalizedscanpathsaliency(NSS). saliency maps. Figure 4 shows three examples from the
ThereasonforthismaybethatGANtrainingtendstopro- MIT300dataset,highlightedbyBylinskiietal.[4]asbeing
duceasmootherandmorespreadoutestimateofsaliency, particularchallengesforexistingsaliencyalgorithms. The
whichbettermatchesthestatisticalpropertiesofrealsaliency areas highlighted in yellow in the images on the left are
maps,butmayincreasethefalsepositiverate. Asnotedby regionsthataretypicallymissedbyalgorithms. Inthefirst
Bylinskiietal.[3],NSSisverysensitivetosuchfalseposi- example,weseethatSalGANsuccessfullydetectstheoften
tives. Theimpactofincreasedfalsepositivesdependsonthe missedhandofthemagicianandfaceoftheboyasbeing
finalapplication. Inapplicationswherethesaliencymapis salient. Thesecondexampleillustratesafailurecasewhere
usedasamultiplicativeattentionmodel(e.g. inretrievalap- SalGAN,likeotheralgorithms,alsofailstoplacesufficient
plications,wherespatialfeaturesareimportanceweighted), saliencyontheareaaroundthewhiteball(thoughthisregion
falsepositivesareoftenlessimportantthanfalsenegatives, doeshavemoresaliencythanmostoftherestoftheimage).
sincewhiletheformerincludesmoredistractors,thelatter Thefinalexampleillustrateswhatwebelievetobeoneof
removes potentially useful features. Note also that NSS thelimitationsofthisdataset. Thegroundtruthplacesmost
isdifferentiable,socouldpotentiallybeoptimizeddirectly ofthesaliencyonthesmallertextatthebottomofthesign.
whenimportantforaparticularapplication. Thisisbecauseobserverstendspendmoretimeattending
6
SALICON(test) AUC-J↑ Sim↑ EMD↓ AUC-B↑ sAUC↑ CC↑ NSS↑ KL↓
DSCLRCN[24](*) - - - 0.884 0.776 0.831 3.157 -
SalGAN - - - 0.884 0.772 0.781 2.459 -
ML-NET[5] - - - (0.866) (0.768) (0.743) 2.789 -
SalNet[25] - - - (0.858) (0.724) (0.609) (1.859) -
MIT300 AUC-J↑ Sim↑ EMD↓ AUC-B↑ sAUC↑ CC↑ NSS↑ KL↓
Humans 0.92 1.00 0.00 0.88 0.81 1.0 3.29 0.00
DeepGazeII[21](*) 0.88 (0.46) (3.98) 0.86 0.72 (0.52) (1.29) (0.96)
DSCLRCN[24](*) 0.87 0.68 2.17 (0.79) 0.72 0.80 2.35 0.95
DeepFix[17](*) 0.87 0.67 2.04 (0.80) (0.71) 0.78 2.26 0.63
SALICON[9] 0.87 (0.60) (2.62) 0.85 0.74 0.74 2.12 0.54
SalGAN 0.86 0.63 2.29 0.81 0.72 0.73 2.04 1.07
PDP[11] (0.85) (0.60) (2.58) (0.80) 0.73 (0.70) 2.05 0.92
ML-NET[5] (0.85) (0.59) (2.63) (0.75) (0.70) (0.67) 2.05 (1.10)
DeepGazeI[19] (0.84) (0.39) (4.97) 0.83 (0.66) (0.48) (1.22) (1.23)
iSEEL[29](*) (0.84) (0.57) (2.72) 0.81 (0.68) (0.65) (1.78) 0.65
SalNet[25] (0.83) (0.52) (3.31) 0.82 (0.69) (0.58) (1.51) 0.81
BMS[31] (0.83) (0.51) (3.35) 0.82 (0.65) (0.55) (1.41) 0.81
Table 4: Comparison of SalGAN with other state-of-the-art solutions on the SALICON (test) and MIT300 benchmarks
accordingtotheirpublicleaderboardsonNovember10,2016. Valuesinbracketscorrespondtoperformancesworsethan
SalGAN.(*)indicatescitationstonon-peerreviewedtexts.
this area (reading the text), and not because it is the first ItisworthpointingoutthatalthoughweuseaVGG-16
areathatobserverstendtoattend. Existingmetrics,however, based encoder-decoder model as the generator in this pa-
tendtobeagnostictotheorderinwhichareasareattended, per, the proposed GAN training approach is generic and
alimitationwehopetolookintointhefuture. couldbeappliedtoimprovetheperformanceofotherdeep
Figure5illustratestheeffectofadversarialtrainingonthe saliencymodels. Forexample,theauthorsoftheDSCLRCN
statisticalpropertiesofthegeneratedsaliencymaps. Shown model observed that ResNet-based architectures outper-
aretwocloseupsectionsofasaliencymapfromcrossen- formed VGG-based ones for saliency prediction, and that
tropytraining(left)andadversarialtraining(right). Training usingmultipleimagescalesimprovesaccuracy.Similarmod-
onBCEaloneproducessaliencymapsthatwhiletheymay ificationstoourmodelwouldlikelyprovidesimilargains.
be locally consistent with the ground truth, are often less Furtherimprovementscouldalsolikelybeachievedbycare-
smoothandhavecomplexlevelsets. GANtrainingonthe fullytuninghyperparameters,andinparticularthetradeoff
otherhandproducesmuchsmootherandsimplerlevelsets. betweenBCEandGANlossinEquation3,whichwedid
Finally,Figure6showssomequalitativeresultscomparing notattempttooptimizeforthispaper. Finally,anensemble
theresultsfromtrainingwithMSE,BCE,andBCE/GAN ofmodelsisalsolikelytoimproveperformance,atthecost
against the ground truth for images from the SALICON ofadditionalcomputationatpredicttime.
validationset. Our results can be reproduced with the source code
andtrainedmodelsavailableathttps://imatge-upc.
6.Conclusions github.io/saliency-salgan-2017/.
Inthisworkwehaveshownhowadversarialtrainingover
7.Acknowledgments
adeepconvolutionalneuralnetworkcanachievestate-of-the-
artperformancewithasimpleencoder-decoderarchitecture. TheImageProcessingGroupatUPCissupportedbythe
A BCE-based content loss was shown to be effective for projectTEC2013-43935-RandTEC2016-75976-R,funded
bothinitializingthegenerator,andasaregularizationterm bytheSpanishMinisteriodeEconomiayCompetitividad
forstabilizingadversarialtraining. Ourexperimentsshowed andtheEuropeanRegionalDevelopmentFund(ERDF).The
thatadversarialtrainingimprovedallbaronesaliencymetric ImageProcessingGroupatUPCisaSGR14Consolidated
whencomparedtofurthertrainingoncrossentropyalone. ResearchGrouprecognizedandsponsoredbytheCatalan
Government(GeneralitatdeCatalunya)throughitsAGAUR
7
Images Ground Truth MSE BCE SalGAN
Figure6: QualitativeresultsofSalGANontheSALICONvalidationset.
office.ThecontributionfromtheBarcelonaSupercomputing References
CenterhasbeensupportedbyprojectTIN2015-65316bythe
[1] A.Borji. Boostingbottom-upandtop-downvisualfeatures
SpanishMinistryofScienceandInnovationcontracts2014-
forsaliencyestimation. InIEEEConferenceonComputer
SGR-1051 by Generalitat de Catalunya. This material is
VisionandPatternRecognition(CVPR),2012.
baseduponworkssupportedbyScienceFoundationIreland
[2] Z. Bylinskii, T. Judd, A. Ali Borji, L. Itti, F. Durand,
underGrantNo15/SIRG/3283. Wegratefullyacknowledge
A. Oliva, and A. Torralba. MIT saliency benchmark.
thesupportofNVIDIACorporationwiththedonationofa
http://saliency.mit.edu/.
GeForceGTXTitanZusedinthiswork,togetherwiththe
[3] Z.Bylinskii,T.Judd,A.Oliva,A.Torralba,andF.Durand.
supportofBSC/UPCNVIDIAGPUCenterofExcellence.
Whatdodifferentevaluationmetricstellusaboutsaliency
WealsowanttothankallthemembersoftheX-thesesgroup models? ArXivpreprint:1610.01563,2016.
fortheiradvice,especiallytoVictorGarciaandSantiPascual [4] Z.Bylinskii,A.Recasens,A.Borji,A.Oliva,A.Torralba,
fortheirproofreading. andF.Durand. Whereshouldsaliencymodelslooknext? In
EuropeanConferenceonComputerVision(ECCV),2016.
[5] M.Cornia,L.Baraldi,G.Serra,andR.Cucchiara. Adeep
multi-levelnetworkforsaliencyprediction. InInternational
8
Ground truth SalGAN ralnetworks. InIEEEInternationalConferenceonComputer
Vision(ICCV),2015.
[10] L.Itti,C.Koch,andE.Niebur. Amodelofsaliency-basedvi-
sualattentionforrapidsceneanalysis. IEEETransactionson
PatternAnalysisandMachineIntelligence(PAMI),(20):1254–
1259,1998.
[11] S.Jetley,N.Murray,andE.Vig.End-to-endsaliencymapping
viaprobabilitydistributionprediction. InIEEEConference
onComputerVisionandPatternRecognition(CVPR),2016.
[12] M.Jiang,S.Huang,J.Duan,andQ.Zhao. Salicon:Saliency
in context. In IEEE Conference on Computer Vision and
PatternRecognition(CVPR),2015.
[13] J.Johnson,A.Alahi,andL.Fei-Fei. Perceptuallossesfor
real-timestyletransferandsuper-resolution. InEuropean
ConferenceonComputerVision(ECCV),2016.
[14] T.Judd,K.Ehinger,F.Durand,andA.Torralba. Learningto
predictwherehumanslook.InIEEEInternationalConference
onComputerVision(ICCV),2009.
[15] K.Krafka, A.Khosla, P.Kellnhofer, H.Kannan, S.Bhan-
darkar,W.Matusik,andA.Torralba. Eyetrackingforevery-
one. InIEEEConferenceonComputerVisionandPattern
Recognition(CVPR),2016.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classificationwithdeepconvolutionalneuralnetworks. In
Advancesinneuralinformationprocessingsystems, pages
Figure4: ExampleimagesfromMIT300containingsalient 1097–1105,2012.
region(markedinyellow)thatisoftenmissedbycomputa- [17] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A
fullyconvolutionalneuralnetworkforpredictinghumaneye
tionalmodels,andsaliencymapestimatedbySalGAN.
fixations. ArXivpreprint:1510.02927,2015.
[18] S.S.Kruthiventi,V.Gudisa,J.Dholakiya,andR.V.Babu.
Saliencyunified:Adeeparchitectureforsimultaneouseyefix-
ationpredictionandsalientobjectsegmentation.InIEEECon-
ferenceonComputerVisionandPatternRecognition(CVPR),
2016.
[19] M.Ku¨mmerer,L.Theis,andM.Bethge. DeepGazeI:Boost-
ingsaliencypredictionwithfeaturemapstrainedonImageNet.
In International Conference on Learning Representations
(ICLR),2015.
Figure5: Close-upcomparisonofoutputfromtrainingon [20] M. Ku¨mmerer, L. Theis, and M. Bethge. Information-
BCElossvscombinedBCE/GANloss. Left: saliencymap theoreticmodelcomparisonunifiessaliencymetrics. Pro-
fromnetworktrainedwithBCEloss. Right: saliencymap ceedings of the National Academy of Sciences (PNAS),
fromproposedGANtraining. 112(52):16054–16059,2015.
[21] M.Ku¨mmerer,T.S.Wallis,andM.Bethge. DeepGazeII:
Readingfixationsfromdeepfeaturestrainedonobjectrecog-
ConferenceonPatternRecognition(ICPR),2016. nition. ArXivpreprint:1610.01563,2016.
[6] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei. [22] G.LiandY.Yu. Visualsaliencybasedonmultiscaledeep
ImageNet: A large-scale hierarchical image database. In features.InIEEEConferenceonComputerVisionandPattern
IEEEConferenceonComputerVisionandPatternRecogni- Recognition(CVPR),2015.
tion(CVPR),2009. [23] T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ra-
[7] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde- manan,P.Dolla´r,andL.Zitnick. MicrosoftCOCO:Common
Farley, S.Ozair, A.Courville, andY.Bengio. Generative objectsincontext. InEuropeanConferenceonComputer
adversarialnets. InAdvancesinNeuralInformationProcess- Vision(ECCV),2014.
ingSystems,pages2672–2680,2014. [24] N.LiuandJ.Han. Adeepspatialcontextuallong-termre-
[8] J.Harel,C.Koch,andP.Perona.Graph-basedvisualsaliency. currentconvolutionalnetworkforsaliencydetection. ArXiv
InNeuralInformationProcessingSystems(NIPS),2006. preprint:1610.01708,2016.
[9] X.Huang,C.Shen,X.Boix,andQ.Zhao.Salicon:Reducing [25] J.Pan,E.Sayrol,X.Giro´-iNieto,K.McGuinness,andN.E.
thesemanticgapinsaliencypredictionbyadaptingdeepneu- O’Connor. Shallow and deep convolutional networks for
9
saliencyprediction. InIEEEConferenceonComputerVision
andPatternRecognition(CVPR),2016.
[26] N. Riche, D. M., M. Mancas, B. Gosselin, and T. Dutoit.
Saliencyandhumanfixations.state-of-the-artandstudycom-
parisonmetrics. InIEEEInternationalConferenceonCom-
puterVision(ICCV),2013.
[27] K.SimonyanandA. Zisserman. Verydeep convolutional
networksforlarge-scaleimagerecognition. InInternational
ConferenceonLearningRepresentations(ICLR),2015.
[28] C.Szegedy,W.Liu,Y.Jia,P.R.S.Sermanet,D.Anguelov,
D.Erhan,V.Vanhoucke,andA.Rabinovich. Goingdeeper
withconvolutions. InIEEEConferenceonComputerVision
andPatternRecognition(CVPR),2015.
[29] H.R.Tavakoli,A.Borji,J.Laaksonen,andE.Rahtu. Ex-
ploiting inter-image similarity and ensemble of extreme
learnersforfixationpredictionusingdeepfeatures. ArXiv
preprint:1610.06449,2016.
[30] E.Vig,M.Dorr,andD.Cox. Large-scaleoptimizationof
hierarchical features for saliency prediction in natural im-
ages. InIEEEConferenceonComputerVisionandPattern
Recognition(CVPR),2014.
[31] J.ZhangandS.Sclaroff. Saliencydetection:abooleanmap
approach. InIEEEInternationalConferenceonComputer
Vision(ICCV),2013.
10