1 Human Attention Estimation for Natural Images: An Automatic Gaze Refinement Approach Jinsoo Choi, Tae-Hyun Oh, Student Members, IEEE, and In So Kweon, Member, IEEE Abstract—Photocollectionsanditsapplicationstodayattempt Image Ground Truth (Human) GB BT to reflect user interactions in various forms. Moreover, photo collections aim to capture the users’ intention with minimum effort through applications capturing user intentions. Human 6 interest regions in an image carry powerful information about 1 theuser’sbehaviorandcanbeusedinmanyphotoapplications. 0 Image#1 Image#2 Image#3 Image#4 Research on human visual attention has been conducted in the 2 form of gaze tracking and computational saliency models in n the computer vision community, and has shown considerable a progress. This paper presents an integration between implicit J gaze estimation and computational saliency model to effectively 2 estimate human attention regions in images on the fly. Fur- Person#1 Person#2 Person#3 Person#4 thermore, our method estimates human attention via implicit 1 calibration and incremental model updating without any active participation from the user. We also present extensive analysis ] V and possible applications for personal photo collections. C Index Terms—Photo collections, computational saliency, gaze . tracking, human attention. Fig. 1: Illustration of key observations. [Top] Computational s saliencymodelsshowhighfalsepositivesonnaturalscenephotos,but c alsoshowtruepositiveresponses.[Middle]Implicitlycalibratedgaze [ I. INTRODUCTION estimation with a simple configuration shows consistent estimates 1 People can access abundant sources of photo collections despite marginal errors. [Bottom] Although people share common v available in social networks and cloud services (e.g., Face- viewing traits, gaze patterns differ by individuals. Images are from 2 Judd et al. [27]. The ground truth fixation regions are overlaid on book, Instagram, Flickr, iPhoto, Picasa) today. Nowadays, 5 imageswhilereddotsindicategazepredictionpositions.Pleaserefer userscanuseautomaticphotoapplicationsembeddedinphoto to color version. 8 2 collection services (e.g., face tagging, tag region suggestion), the users’ subjective attention preferences and produce user 0 which allow various forms of interaction. In fact, photo appli- . cations today attempt to produce interactive results reflecting oriented results. Apart from other interaction based methods, 1 we pursue a simple method that captures individual user pref- 0 human intentions by exploiting human behavior traits shown 6 while accessing photo collections. erencesbygazeestimationwithoutinvolvingtheusers’notice 1 Duetosimilarmotivationsrecently,computationalmodeling or active participation, but with an off-the-shelf device, i.e., a : web-cam. The key idea is to integrate rough gaze estimates v of human visual attention has attracted the interest of analysts i and psychologists as well as computer vision researchers. and a computational attention model (in this paper, we use a X computationalsaliencymodel).AsillustratedinFig.1,thekey Especially, attention modeling on visual data has been re- r observations are as follows. Firstly, we observe that, in spite a searched as a form of memorability [24], aesthetics [14], of the difficulty to accurately predict salient regions in natural interestingness [19] or saliency [27]. Since human attention scenes, modern saliency models show reasonable agreement regions can be obtained by automatically quantifying subtle withsalienthumanattentionregionsbyvirtueoftheadvances insights from human visual behavior, these models have been on computational saliency research (See Fig. 1-[Top]). Notice attempted in various fields including computer vision applica- the saliency maps show responses to local image features tions[39],[30],[49],[33],[45],[10],andalsonon-engineering leading to extensive false positive responses, but also show fields such as marketing and advertisement [43]. However, appropriate responses at true attention locations obtained by predicting interestingness (or ROI) is still challenging only anaccurateeyetrackingdevice.Secondly,implicitlycalibrated with learned or designed computational models, because it is gaze estimates via simple configuration show rough but plau- highly subjective and dependent on specific categories [19]. sible consistency with expensive eye tracker results (ground Hence, some personal photo collection applications involve truth) (See Fig. 1-[Middle]). Lastly, the gaze estimates of explicit user interactions [31], [51] such as active sketching, individuals show different fixation behaviors, while they also dragging of photo components, tagging or labeling to reflect reflect common viewing traits (See Fig. 1-[Bottom]). JinsooChoi,Tae-HyunOhandInSoKweon(correspondingauthor)arewith Based on the observations, we develop a fully incremental the School of Electrical Engineering, Korea Advanced Institute of Science and simple human attention map estimation method by inte- andTechnology(KAIST),RepublicofKorea.E-mail:[email protected], [email protected],[email protected] grating existing saliency models and implicit gaze estimation 2 method.Thisintegrationallowsourmethodtoreflectpersonal methodwithanoff-the-shelfweb-camwhichiscommonlyset attention preferences which can be used in a number of up or built-in in PCs nowadays. personal photo applications. Moreover, our method does not Works addressing gaze estimation by utilizing saliencies require any active participation. In a task as challenging as include Sugano et al. [45] and Chen et al.[12]. While Chen et accurate human attention estimation on natural scene images, al.achieve relatively higher accuracy, they address a model- we utilize a behavior humans naturally do when encountered based setup with special camera settings and require a long by a photo: look at the photo i.e., gaze behavior. In short, our data capture time. Sugano et al.construct a gaze estimator human-computer method does not require a training step and by using eye images captured from a user watching a few the user to actively execute a specified interaction guideline. video clips. Video based approaches, including Sugano et al., The only thing a user has to do in our framework is to just Feng et al. [16] and Katti et al.[28], benefit from temporal view photos in a photo collection (which is a highly natural cues in the form of temporal saliency aggregation or tracking. process as mentioned). Our work involves independent photos instead of videos, The preliminary version of this paper has appeared in thus we cannot benefit from reliable saliency supports with [3]. We extend [3] by addressing our motivation along with inter-frametemporalrelationship.Despitethis,byaggregating detailedexplanationongazeestimationviaimplicitcalibration independent saliency maps via incremental soft-clustering, in Sec. III-A and the refinement approach in Sec. III-B. The we can enhance the gaze estimation quality even with a experimental setup and analysis has been entirely revised in few independent images and can reduce data redundancy. As Sec. IV, and we show a number of applications in Sec. V. for appearance-based gaze estimation involving independent Overall, our contributions can be summarized as follows: images, Alnajar et al.[1] propose a data driven approach, We develop ahuman visual attention estimationmethod on where a rough input gaze pattern is transformed to the most • the fly for natural image collections without any special similartemplatepatterninapredefineddatabase.Thismethod hardware settings, based on implicit calibration and incre- requires numerous human fixation data with high accuracy. mental model updating. While users use photo collection Computationalsaliencymodels Predictingsaliencyfornat- services, our proposed method can learn user-specific pa- ural scene photos still remains as a challenging task. Over the rameters without the user noticing. years, visual saliency researches have aimed to computation- We experiment systematically to show the validity and • ally model and predict human attention behavior given visual flexibility of the proposed method with 10 subjects. data. Many attention models have been developed under a Todemonstratehowtheproposedmethodcanbeutilizedin • pioneerwork,FeatureIntegrationTheory(FIT)[46],thatdeals practice, we show three possible applications for personal with which visual features are important and how they should photo collections: saliency map refinement, tag candidate be combined to reveal conspicuity. Biologically inspired by region suggestion, and photo retrieval by personal interest the center-surround structure of receptive fields, Ittiet al. [25] location query. was the first to computationally search patterns from multiple If this paper is accepted, we will publish our dataset • features on top of FIT. Following this pioneering work, most collectedontopofMIT1003[27](whichisusedforourex- of the saliency models [21], [26], [18], [4], [9], [41], [27], periments). The dataset includes information no other gaze [23], [17], [32], [5], [50], [40] are integration approaches datasets provide, consisting of actual human eye images, that compute saliencies by combining multiple visual features gaze estimation data obtained from implicit calibration of extractedfromimagesaccordingtoanattentionmodelcriteria: 10 individuals. rarity[5],spectralanalysis[23],graphmodel[21],information theoretic model [9], etc.. For thorough reviews in saliency II. RELATEDWORK estimation, one can refer to Borji et al. [6], [7], [8] and Gaze estimation Gaze estimation is the process of esti- Riche et al. [38]. mating where a user is looking at, which can be used for Apartfrombiologicallyinspiredmodelsorheuristicfeature constructing an attentive human-computer interaction inter- integrating methods, there have been machine learning ap- face [22], [47]. Gaze estimation methods are boiled down proaches which learn the relationship between visual features to either model-based or appearance-based methods1. Model- and human attention behavior. Judd et al. [27] and Kienzle et basedmethods[20]typicallyinvolvespecializedhardwareand al. [29] learn a SVM from low, mid, high-level features and explicit 3D geometric eye modeling which require sophisti- human fixation data to construct a saliency model. Recently, catedandextensivecalibration.Mostofthetime,model-based deep neural networks by Shen et al. [42] and Vig et al. [48] approaches require the person to wear a head mount close have shown relatively improved results in natural scenes due up IR camera device whereas the appearance-based method to its ability to represent complex models without explicitly simply involves a single camera setting anywhere in the PC learning high-level context models like car or face detectors. environment.Wedonotwanttosetupspecializedacquisition Our motivation is supported by Judd et al.[27] and Alna- settings since people usually do not set up IR cameras or spe- jar et al.[1], where they reflect the fact that humans show cialized configurations around their PC environment just for similar viewing patterns when they look at a stimulus (we photoapplicationpurposes.Wecarryoutanappearance-based call commonness). In general, computational saliency models represent this commonness and are used extensively in our 1Thorough reviews on camera based gaze estimation methods including commercialproductscanbefoundinHansenetal.[20] work. In our approach, saliency maps are used for inferring 3 gaze positions in the gaze estimation step. Also, our method Aggregationstage Weclusterthecollectionofeyefeatures integrates the roughly estimated gaze positions with a com- by Gaussian Mixture Model (GMM) [2] to reduce redundant putational saliency baseline and uses it for each independent data and effectively train the gaze estimator. Simultaneously, image as a guide to obtain a refined attention map. By it allows us to estimate weights for saliency aggregation with the integration, our method can effectively estimate person- respecttotheclusteredeyefeatures.EachGaussiancomponent specific attention regions in natural image collections without is represented as: any special settings or tedious processes. P(e z =c)= (e µ ,Σ ), (1) nm| nm N nm| c c III. AUTOMATICGAZEREFINEMENTAPPROACH where z denotes a latent variable of e for the member- nm nm In order to estimate person-specific attention on natural ship of the cluster, µc and Σc are the mean and covariance of scene images without special settings, we undergo two stages the c-th cluster respectively. Since the most computationally namely: (1) initial gaze estimation via implicit calibration expensive part of our algorithm is this eye feature clustering, and (2) integrating initial gaze estimation and computational batch GMM [2] is used only at the initial stage, while we use saliency maps. incremental GMM [11] for later updates4 to reduce computa- tions. We set the number of clusters K to 30 throughout our entire experimental procedure. A. Initial gaze estimation via implicit calibration Aggregation of the saliency maps is done by taking the In this stage, our goal is to construct a gaze estimator weights given to eye feature vectors by GMM and simply via implicit calibration. Specifically, our algorithm learns the applying them to the corresponding collection of saliency user’sgazebehaviorwhiletheuserseesphotosinacollection. maps. Specifically, for the training images I , we com- n { } From the photos, let’s consider a person sees a few images pute the corresponding normalized saliency map s , where n { } dimenaogteesdabnyd {IIn1,...,IRNp×}q, wwhheerree Np deqnodteesnotthees nthuembimeragoef fsrnom∈e[0y,e1f]pe×atqu.reSamlieemncbyersmhaippspraorbeabaigligtyregaas:ted with weights ∈ × dimensions. While the person is viewing an image, a single P(e z =c)s camera can capture the viewer’s face. We capture M frames2 p = n,m nm| nm n, (2) of the person’s frontal view per image seen, a total of NM c P n,mP(enm|znm =c) frames of the person’s face is captured. We extract the eye images by utilizing Active Shape Model (ASM) [35] and use producing a fixationPposition probability map pc ∈ Rp×q. These will produce reliable pairs of fixation position prob- its gray scale eye images as the vectorized form enm ∈ Rd, ability maps and eye feature cluster centers by suppressing where n and m denote the indexes of the n-th image and the irrelevant saliency effects. m-th eye image respectively, and d denotes the eye feature vector dimension. Learning the relationship between eye feature and im- In order to regress the relationship between eye images plicit gaze To predict the gaze positions, we learn Gaussian and fixation location, i.e., a gaze estimator, we need not process regression (GPR) [37] between eye feature cluster only eye feature data but also the corresponding fixation centers and corresponding gaze positions, i.e., , we con- positions. Since the user does not undergo an explicit cali- struct an estimator that predicts the distribution P(g∗ e∗, g) | D bration scheme for gaze training and views images instead, given a new eye image e∗ and the training data g = D we regard the saliency maps of the viewed images as fixation {(µ1,g1),...,(µK,gK)}, where gc ∈ R2 denotes the 2D position probability distributions based on our observations gaze position corresponding to the eye feature cluster center discussed in Sec. I and II. We apply a computational saliency µc ∈ Rd. However, we only have a dataset pairing eye modelcombininglow-levelfeaturessuchasorientation,color, feature cluster centers and fixation probability maps, i.e., the intensity, bottom-up features with top-down cognitive visual aggregated saliency maps, as Dp ={(µ1,p1),...,(µK,pK)}, features [4]. However, a single saliency map per se may not instead of the set g of eye features and gaze points. By D be a reliable fixation position indicator. Imagine that a user simply peaking the maximum probability location of pc as naturally looked at the same position on different images gc, we construct the training set Dg. and we aggregate those image saliency maps. Then, it is We independently model and regress x and y components highly likely to produce a vivid peak of saliency around ofgazeg =[gx,gy],sothatwereducetheproblemdimension that position in the aggregated map due to the consensus and achieve efficiency. For simplicity, we describe the x of independent saliency maps. Moreover, since the saliency component gx only, but deriving gy is analogous. We assume map [4]3 is computed by a combination of low and high-level a noisy observation model as gx∗ = f(e∗) + (cid:15) with noise features,itisexpectedtoproducehightruepositiveresponses (cid:15) (0,γ), where γ 0. GPR assumes that data can be ∼ N ≥ (i.e., salient regions near human fixation positions) and thus represented as a sample drawn from a Gaussian distribution, produceareliablefixationpositionindicatorwhenaggregated. thus f() is modeled by GPR with zero-mean multivariate · Gaussian distribution as 2Inthispaper,weusethefixednumberMforsimplicityoftherepresentation. 3IWnepmraacitnicley,uitsecathnebmeeatrhboidtrasruyggveasluteeds.byBorjietal.[4],butanymethodcan P(gx∗,e∗,Dg)=(cid:20) →−ggx∗x (cid:21)=N (cid:18)0,(cid:20) KK∗ KK∗>∗∗ (cid:21)(cid:19), (3) alsobeusedinourframework.Weprovidetheeffectsofouralgorithmwith respecttomultiplebaselinesaliencymethodsinSec.V. 4Detailsonupdaterulesforeyeclusteringarederivedintheappendix. 4 Algorithm 1 Training phase - Implicit gaze calibration Algorithm 2 Testing phase - Integration stage 1: display In 1: Output gaze estimation results from GPR 2: capture and save eye features en, 1:M 2: for n=1:N do { } 3: if n=b (batch size) then 3: mean shift (initialize at gaze points) 4: batch GMM clustering of eye features 4: convolve a Gaussian kernel 5: else if n>b then 5: end for 6: incremental GMM clustering of eye features 6: Output attention maps 7: end if 8: for j =1:K do 109:: tpacke=mPaPxni,nmm,mPuPm(e(neamnnm|dz|nzcmnom=n=cs)ctsr)unct Dg Bsa.liIenntceygrmatainpg initial gaze estimation and computational 11: end for Thebasicnotionofthisstageistoestimatehumanattention 12: Train GPR with g regions on each image by comparing and contrasting the D initial gaze estimation result and the computational saliency map of the image. Typically, gaze estimation via implicit where K∈RK×K and K(i,j) =k(µi,µj)+γδij, calibration inevitably has lower accuracy compared to the K∗ ∈R1×KKand=Kk∗((ej∗),=e∗k)(e∗R,µ1×j1),, ecxalpilbircaitticoanliibnrvaotilovnesscghroemuned.Ttrhuitshisfixdauteiotnopthoeinftasc,twthhaetreeaxspliimci-t ∗∗ ∈ plicitcalibrationinvolvesdeducedgazepointsfromprobability →−gx =[g1x;··· ;gKx]fromDg,k(·,·)denotesakernel(orcalled mapsinstead.Forthisreason,theinitialgazeestimationresults as covariance function), and δij is the Kronecker delta which fromthepreviousstagealoneisnotreliableforimageattention is1ifi=jand0otherwise.Weadoptthed-thorderpolynomial region estimation. Similarly, although many computational kernel defined as saliency models produce high overlap with actual human fixation positions, they also retain high false positive regions. k(u,v)=α(β+u>v)d, (4) Naturalsceneshaveunimaginablevarietyincontentandsubtle where α > 0 and β 0. We use the 3-rd order polynomial, characteristics that make it difficult for current state-of-the- ≥ i.e., d = 3. Then, the hyper-parameters to be learned are art saliency models to predict salient human attention regions α,β,γ in our model. We learn them from the training data accurately. Thus, computational saliency maps alone are not by GPML toolbox [36]. Note that, since we use the reliable either for our purpose. g D clustered data µ , the computational complexity in the Despite these limitations, initial gaze estimations and com- { c} learning phase mainly depends on the number of clusters putational saliency maps provide valuable information and K and the dimension of eye feature d. Hence, the hyper- cues. Our goal is to utilize the initial gaze estimation as a parameter learning is virtually fast with the small number K roughindicatorofattentionregions,andpotentialtruepositive and the down-sampled eye feature vectors. responses on saliency maps as a useful prior of human atten- Incontrasttothesquaredexponential(SE)kernel5 typically tion candidates for each viewing image. We integrate them to used in GPR, we use the polynomial kernel. It is known detect true salient regions produced from the computational that the polynomial kernel leads to the function shape rapidly saliency model and produce a reliable attention estimation via decreasing and preserves high frequency regression, while the a mode seeking algorithm. SEkernelaffectsinabroaderrangeduetotheheaviertailthan The basic idea is to seek out salient regions around the that of polynomial and behaves like a low-pass filter which initiallyestimatedgazepositionsthroughgreedysearchwhich smooths out the solution space [44]. Since the eye images is practically done by a mean shift algorithm [13]. The have similar appearances and the gaze estimation is required mean shift algorithm does not require prior knowledge on tobeaccurate,thepolynomialkernelwouldbeabetterchoice. the number nor shape of distribution. While the mean shift After learning the hyper-parameters of GPR, we can infer algorithmtraditionallyfindsamodefromsamplesdrawnfrom the gaze estimate. From the joint distribution in Eq. (3), what an arbitrary mother distribution by kernel density estimation, weareinterestedinistheconditionalprobabilityP(g e , ) inourproblem,thesaliencymapsˆ oftheviewedimageIˆ can x∗| ∗ Dg to predict the gaze component g for a newly observed be regarded as a 2D density map directly. Since we collect x∗ eye image e . Since the conditional probability also follow M eye feature vectors for a viewed image Iˆ, we have the ∗ Gaussian, we can compute the posterior mean by the closed- initial gaze estimates denoted as gˆ(0),...,gˆ(0) . These gaze { 1 M } form solution: estimates mark where to start the mode seeking algorithm. gx∗ =K∗K−1→−gx. (5) Then, the problem is simplified to: Notethat,intheaboveequation,K−1→−gx canbecomputedin gˆ(k+1) =argmaxx∈Nh(gˆ(k)) sˆ(x), (6) advance in the learning phase, and predicting g can be done x∗ where gˆ(k) denotes a gaze point at the k-th iteration, (x) bycheapvectormultiplications.Apseudocodeofthisstageis h N is the neighbor region at x within the window size h. We shown in Alg. 1. iterativelysolveEq.(6)foreachgazepointbygreedilyshifting 5k(u,v)=αexp( β u v 2). toamaximumvaluewithinthewindowsizeateveryiteration. − k − k 5 16 14 g.] 12 e Errors [d 108 n 6 a e M 4 2 0 2 20 40 60 80 100 120 140 160 180 200 Number of images Fig. 2: Initial gaze estimation. This figure shows (Top) accurate Fig. 3: Mean error vs. number of images seen. As the user views gaze estimation examples (red dots) on natural scene images and images,onlineGMMincrementallyclustersdataforregression.The (Bottom) misaligned gaze estimation examples. The ground truth graph shows mean error (in degrees) with respect to the number of fixationregionsareoverlaidonimages.Pleaserefertocolorversion. images seen. 8 We terminate the algorithm once the movement is converged. 7 Since our method is inspired by mean shift, convergence to a local maximum is guaranteed [13] and therefore guarantees g.] 6 e cgoazneveprgoesnitcioensto.wTahredssearessualliteinngt Mregicoonnvneeragrebnycethpeoseistitoimnsatoend Errors [d 45 an image saliency map are convolved with a Gaussian kernel n 3 a e to produce an estimated attention map. A pseudocode of this M 2 stage is shown in Alg. 2. 1 0 GMM+GP GMM GMM Kmeans Kmeans Kmeans IV. EXPERIMENTSONGAZEANDATTENTIONREGIONS (Proposed) kNN LinReg GP kNN LinReg In this section, we describe how we conducted experiments Fig.4:Gaze estimation method comparison.Thisshowsthemean for implicit gaze estimation method and attention region estimation errors of the proposed method and other methods com- bining k-means clustering, k-nearest neighbor and linear regression. generation. The experimental procedure was divided into two parts corresponding to the initial gaze estimation step and integrationstep.Weevaluateondataobtainedfrom10subjects estimation error with respect to the number of images seen. viewing photos selected at random. Estimation error is taken as the mean error in degrees on a Computational time was measure for batch GMM of 20 test data. The test data image contains predefined points on images (0.8-2.5s), incremental GMM of 5 images (0.5s), which the subjects are instructed to fixate. A total of 450 saliency aggregation (0.45s), GPR training (0.1-0.5s), and test images with 9 predefined points are used to measure GPR testing (0.005s) on MATLAB. mean estimation error in degrees. As shown in Fig. 3, as the number of images seen increases, the gaze estimation error A. Initial gaze estimation evaluation falls leading to higher accuracy. Notice that even by viewing A subject is allowed to view a total of 200 natural scene 20 images, it gives reasonable accuracy of around 6 degrees photos on a flat screen display (1-1.5 seconds each) just error.From40images,weseethegraphalreadyhittingplateau the way a person will do when viewing images on a PC. phase. This signifies that people only need to look at around During every 1-1.5 seconds of image display, the web-cam 20 to 40 images to obtain reasonable gaze estimation quality. captures 30 frames of the subject’s frontal view. Eye images For quality assessment of our algorithm, we measure the areextractedviaASM[35]andtheirgrayscaleimagesresized gaze estimation error in comparison with a combination of to 3 5 are used as eye feature vectors for efficiency. The different clustering and regression algorithms including k- × subjects are approximately 60 cm from the screen and the meansclusteringwhichisusedinSuganoetal.[45],k-nearest images are displayed in 1200 1600 resolution. We use neighbor and linear regression. Fig. 4 shows that our method × natural scene images from the MIT1003 [27] dataset and the produces lower error than other baseline methods. displayimagesareselectedatrandomforeverysubject.Since our method addresses implicit calibration using independent B. Evaluation on attention maps images, we choose to aggregate independent saliency maps with the incremental soft-clustering method. Fig. 2 shows Here we evaluate our integration algorithm. To assess the someexamplesofgazeestimationresultsonimages.Although quality of the attention maps, we utilize the ground truth fix- gaze results provide rough attention region estimates, one can ation data provided in the MIT1003 [27] dataset. The fixation observe that some estimation error is present. data provides all fixation positions of 15 subjects acquired First of all we investigate gaze estimation accuracy de- from an accurate eye tracking device. The union of these pending on the number of images seen. Fig. 3 shows gaze fixation regions reflect virtually all possible human attention 6 Center BT Our Init. Our 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 AUC Fig. 7: Evaluation on attention map by AUC [27]. We compare the AUC values with center bias, BT [4], our initial gaze, and our attention estimation map. 1 0.8 (a) Image (b) GT (c) Our (d) BT C0.6 h1=0.02 S Fig. 5: Attention region comparison. This figure shows (a) natural S0.4 h2=0.04 image stimulus, (b) thresholded ground truth fixation region maps, 0.2 h3=0.07 (c) our attention region estimations and (d) BT [4] computational 0 saliency map. 0.000 0.094 0.188 0.281 0.375 0.469 0.563 Standard Deviation Fig. 8: Effect of noise on mode seeking. We show SSC values 1 on attention maps produced from mode seeking with three different window sizes at varying noise levels. This was done in normalized coordinates. 0.8 C S 0.6 S Center We measure the SSC to observe how much each individual 0.4 BT attention map actually contain true fixation components via Our Init. Our measuring overlap with the fixation map containing virtually 0.2 all fixation regions. The coefficient thus can reflect how 0 0.2 0.4 0.6 0.8 1 Threshold reliable our individual attention maps obtained from a cheap setting is compared to the ground truth fixation data collected Fig. 6: Evaluation on attention map by Szymkiewicz-Simpson coefficient (SSC). We compare the coefficient values with a center from an accurate eye tracking device. bias map, BT [4] computational saliency map, our initial gaze map, If the coefficient is close to 1, it signifies that the attention and our attention estimation map. The horizontal axis denotes the map reflects true positives and its response has high overlap percentage threshold, the vertical the score. with the ground truth fixation map. If the coefficient is close to 0, the estimated attention map has high false positives and regions on an image, and we take this as ground truth fixation low overlap with the ground truth map. regions. Since the ground truth fixation data contains results To evaluate with the SSC measure, we threshold our atten- from multiple subjects, it represents a generalized fixation tion region estimation map via varying thresholds. In Fig. 5, map.Ontheotherhand,ourresultattentionmapsarespecified we show example comparisons with BT [4] saliency. It shows to each subject reflecting individual attention maps (since thatourattentionmapshavegreateroverlapwithgroundtruth these are to be used in applications capturing user intentions). mapswhileBTsaliencyreflectsfalsepositives.Inaddition,we ThuswemeasuretheSzymkiewicz-Simpsoncoefficient(SSC) also compare our results with a center bias map. Fig. 6 shows whichmeasurestheoverlapcoefficientbetweenourindividual thecomparisonwithourresults,BTsaliency,centerbiasmap, attention maps and ground truth fixation maps as defined as and our initial gaze estimation results with varying threshold A F percentages.Ourresultsshowhighervaluesforallthresholds. OV(A,F)= | ∩ | (7) min( A , F ) Thesamethresholdingschemefrom[34]wasapplied.Wealso | | | | show results using the AUC metric in Fig. 7. where A denotes the number of non-zero components of | | A, A R1200×1600 is the attention map estimated by our The integration stage involves using initial gaze estimates ∈ algorithm and F R1200×1600 is the ground truth fixation as mode seeking starting points on BT [4] saliency map. To ∈ region map. evaluate the robustness and reliability of the mode seeking Since our method deals with attention maps of individual method, we apply Gaussian noise to the estimated gaze subjects, the quality assessment method differs from that of positions and measure SSCs on resulting attention maps as Juddetal.[27]whichdealswithassessingthegeneralquality. shown in Fig. 8. 7 Fig. 9: AUC values are shown for baseline methods (light blue) and our method (dark blue). Average AUC was taken on 40 evaluation photos. Our initial gaze result is shown in black. Before Subject refinement s1 s2 s3 s4 std. Fig.10:Region suggestion for tagging.Withpersonalsalienciesof GB[21] 0.8715 0.9077 0.8795 0.8943 0.8946 0.0115 aphoto,ouralgorithmcanprovidetagcandidateregionsuggestions. IT[26] 0.8077 0.8797 0.8775 0.8858 0.8733 0.0052 CA[18] 0.7833 0.9132 0.8870 0.9056 0.8927 0.0119 BT[4] 0.8782 0.9192 0.9109 0.9102 0.9077 0.0050 AIM[9] 0.7393 0.9016 0.8671 0.8945 0.8856 0.0149 SER[41] 0.7352 0.8801 0.8727 0.8794 0.8852 0.0051 SR[23] 0.7442 0.8748 0.8855 0.8863 0.8676 0.0090 AWS[17] 0.7955 0.8833 0.8817 0.8813 0.8705 0.0059 HFT[32] 0.7976 0.8740 0.8803 0.8800 0.8680 0.0058 Judd[27] 0.8010 0.9263 0.8804 0.9001 0.9052 0.0189 LG[5] 0.7790 0.9156 0.8885 0.9109 0.9034 0.0119 QDCT[40] 0.7840 0.8593 0.8556 0.8521 0.8437 0.0067 BMS[50] 0.8258 0.8996 0.8657 0.8866 0.8811 0.0140 TABLE I: Saliency performance variation according to subjects. The AUC values and its standard deviation (std.) are reported for each subject and saliency method. V. APPLICATIONS (a) (b) (c) Many applications in personal photo collections have used Fig.11:Photoretrievalbysaliency.Wequeryregionsvia(a)mouse drags or (b) human attention maps in order to retrieve photos (c) detection algorithms, attribute learning approaches and vari- (displayedbyoverlayingsaliencymapsonphotos)fromvastpersonal ous image processing algorithms to construct tag suggestion photo collections. When finding a specific photo in a vast collection systems, image search, photo effects, etc.. By providing per- isdifficult,theusercanqueryhisorherintendedattentionregionto sonal attention maps for personal photos, these applications obtain desired photo. can better represent individual preferences. In this section, we demonstrate three computer vision applications: personal saliency generation, tag region suggestion, and salient region as personal saliency maps that suggests preferred candidate based photo retrieval. The last two algorithms are built on regions. We segment the photo via over-segmentation tech- top of the personal saliency generation application. All the nique[13]andassignsegmentsthatoverlapwiththebinarized experiment settings and dataset used here are consistent with saliency map as blobs and present the bounding box (BB) as the setting described in Sec. IV. We include the rest of the region suggestion. results partially not presented here in the appendix. TheresultsoftagregionsuggestionareshowninFig.10in which the suggestion is made based on personal saliency in- Personalsaliency Inourframework,userattentionmapgen- formation.WeassessourresultsbythePASCALmeasure[15] eratedbyourmethodcanbeusedasanothersaliencymapthat issalsipeencciyficestotimthaetiuosneraocncloyr.dWinegfitrostbeavsaelluinaetestahleiepnecryfomrmeathnocdeso6f. wBhBicghtdiesndoetfiesnethdeaesstaimoa=tedaarraeenaad((BBgBBroddttu∩∪nBBdBBtrggutt))th, wBBhesrreesBpBecdttivaenlyd. By comparing with every ground truth BB, we assign the We report AUC [27] values, which is widely used to assess largestPASCALmeasuretoeachestimatedBB.Amongmulti- saliency estimation, in order to demonstrate the performance plecandidateregions,bytakingthemedianPASCALmeasure of our framework with 13 state-of-the-art methods shown in foreachimage,weobtainanaveragemeasureof0.5314.This Fig. 9. To understand the individual viewing behaviors, we signifies that roughly more than half of all candidate regions indirectly observe it by observing independent subject results exceed 50% overlap by the PASCAL measure on average. as shown in Table II. The qualitative results corresponding to The specific experimental configurations and procedures of Table II is presented in the appendix. this application can be found in the appendix. Regionsuggestionfortagging Wedemonstrateatagregion While existing tag candidate region suggestions (e.g., Face- suggestion application that utilizes photo contents as well book)aremademostlyviafacedetection,ourimplementation enables tag candidate region suggestions on high attention 6In this experiment, we conduct tests by replacing the saliency estimation regions. Thus, tags can be made on the user’s interest region module used in both our implicit calibration and mode seeking steps with eachstate-of-the-artmethod. including regions which would normally not be suggested as 8 tags by traditional methods. Duringthetrainingstage,theoveralllearningstepincluding the batch clustering (Line 1 to 3 in Alg. 3) is conducted Photoretrievalbysaliency Anotherapplicationwedemon- on background. After the initial model is constructed, with strate is photo retrieval by personal interest location query. a time interval, our model is updated by starting from the When a user possesses vast photo collections and wants to incremental clustering (Line 5 to 16 in Alg. 3), and then find a specific photo, it may be difficult for a user to search saliency aggregation followed by learning GPR. among the vast number of photos or retrieve it by text or color. Since the user is likely to remember where he or she Algorithm 3 Eye feature vector clustering by GMM (incremental has looked at in the photo, it is possible for the user to query GMM). by personal attention regions. The implementation was done 1: // Initial stage. by displaying photos with top similarity values [27] com- apsareSdimw(itsh,tt)he=inputmqiunerys.(xT)h,et(sxim) idlaxri,tyfomretthriec7niosrmdeafilinzeedd 23:: GBaattchherGNˆMMnum[2b]etroopfreoydeucfeea{tuDˆrec,vπˆecc,tµoˆrcs,Σ{ˆˆeci}}Kc.=1. 4: saliencymapssandt.Fig.11showssampledresultsofphoto R (cid:0) (cid:1) 5: // Incremental update. retrievalbypersonalattentionqueries.Morequalitativeresults 6: GatherN numberofnewlyobservedeyefeaturevectors withtoprankimagesareshownintheappendix.Theusercan e . sinimteprflaycelo.cAaltseo,inittewreosutldrebgeiopnossstoiblseeatorcqhueprhyotboysgbayzinagmaotuthsee 7: { Iin}itialize {Dc,πc,µc,Σc}Kc=1 ={Dˆc,πˆc,µˆc,Σˆc}Kc=1. 8: repeat interest location as an alternate way to query. 9: E-step h i VI. CONCLUSION 1101:: M-sptce,pi =M P(zi =c|ei)= Kcπ=c1Nπc(eNi|(µeic|,µΣcc,)Σc). In this paper, we present human attention estimation on 12: h Dc =i Ni=1pc,i. P tchoempfluytatvioianailmspalliiceintcygatzherouesgthimmatoiodne saenedkiningt.egWraitthioonutwainthy 13: πc = DPˆNˆc++NDc. specialhardware,ourmethodcanestimatetheuser’sattention 14: µc = Dˆcµˆc+DˆPc+NiD=1cpc,iei. region on natural scene images. On top of this, users do not 15: Σc = Dˆc(Σˆc+(µˆc−µc)(µˆc−µc)>)+ need to follow explicit interaction instructions, due to the N p (e(cid:16) µ )(e µ ) /(Dˆ +D ). entirely implicit process. For demonstration of usability and i=1 c,i i− c i− c > c c practicality, we introduce three user-specific applications for 16: until converged. (cid:17) P personal photo collections. For future work, we wish to de- velop further discussion and applications on human attention. B. Additional Application Results APPENDIX Additional results on the applications are shown in the following materials. We show more examples on the personal Here, we provide the technical details for efficient incre- saliency application and display comparison with state-of- mental GMM clustering and additional application results. In the-art saliency models. The region suggestion for tagging addition, we provide a larger table containing experimental applicationwas conductedbytakingimages containingsome- results of the attention map quality using the similarity (Sim.) what distinguishable objects. The ground truth objectness was metric values. We also provide some failure cases seen from recorded by a human subject and the evaluation was done by our applications and analyze them. taking this as the ground truth. Analysis on failure cases are reviewed as well. Lastly, we also show additional results on A. Eye Feature Vector Clustering retrieval by saliency query application. The eye feature vector clustering procedure is summarized REFERENCES in Alg. 3. Gaussian mixture model (GMM) is parameterized by Dˆ ,πˆ ,µˆ ,Σˆ K , where K is the number of unimodal [1] F.Alnajar,T.Gevers,R.Valenti,andS.Ghebreab.Calibration-freegaze Gau{ssicandcistrcibutico}ncs=(1orclusters), µˆ ,Σˆ denotethemean estimationusinghumangazepatterns.InIEEEICCV,2013. 2 { c c} [2] C.M.Bishopetal.Patternrecognitionandmachinelearning.Springer and covariance of GMM, πˆc denotes the weights of GMM, NewYork,2006. 3,8 and Dˆ = Nˆ P(z = cˆe ). While Dˆ is not regarded [3] J.ChoiandT.OhandI.Kweon.GMM-basedsaliencyaggregationfor c i=1 i | i { c} calibration-freegazeestimation.InICIP,2014. 2 as a model specific parameter traditionally, we explicitly store [4] A.Borji.Boostingbottom-upandtop-downvisualfeaturesforsaliency Dˆ toredPucethecomputationcomplexityattheincremental estimation.InIEEECVPR,2012. 2,3,6,7,9 c { } [5] A.BorjiandL.Itti.Exploitinglocalandglobalpatchraritiesforsaliency update stage (used in Line 13 to 15 in Alg. 3). We measure detection.InIEEECVPR,2012. 2,7 the changes of {µc} between subsequent inner iterations to [6] A.BorjiandL.Itti.State-of-the-artinvisualattentionmodeling.IEEE measuretheconvergence.Ifthechangesarebelowathreshold TPAMI,2013. 2 ε = 0.01, then we stop the clustering. Also, we fix the [7] A.Borji,D.N.Sihite,andL.Itti.Quantitativeanalysisofhumanmodel agreementinvisualsaliencymodeling:Acomparativestudy.IEEETIP, maximum number of iteration to 100. 2013. 2 [8] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti. Analysis of scores, 7Insteadoftakingtheintegral,actualcomputationisdonebysummationin datasets,andmodelsinvisualsaliencyprediction.InIEEEICCV.IEEE, adiscretemanner. 2013. 2 9 C. Personal Saliency Image GT BT Subject#1 Subject#2 Subject#3 Subject#4 Fig.12:Personal saliencies.Fromlefttorightare:inputimages,groundtruthsaliencymaps,BT[4],refinedsalienciesfromsubjectsusing the baseline BT model. Saliency maps for individual people are shown along with the saliency maps from the BT model, which performed the best among baseline methods. 10 AUC Sim. Baseline Subject Subject Baseline Subject Subject s1 s2 s3 s4 mean std s1 s2 s3 s4 mean std GB 0.8715 0.9077 0.8795 0.8943 0.8946 0.8940 0.0115 0.3649 0.3770 0.2525 0.3503 0.3968 0.3442 0.0640 IT 0.8077 0.8797 0.8775 0.8858 0.8733 0.8791 0.0052 0.3284 0.3301 0.2543 0.3120 0.3544 0.3127 0.0426 CA 0.7833 0.9132 0.8870 0.9056 0.8927 0.8996 0.0119 0.3182 0.3979 0.2832 0.3662 0.4194 0.3667 0.0598 BT 0.8782 0.9192 0.9109 0.9102 0.9077 0.9120 0.0050 0.3286 0.3949 0.2989 0.3654 0.4253 0.3711 0.0540 AIM 0.7393 0.9016 0.8671 0.8945 0.8856 0.8872 0.0149 0.2724 0.3577 0.2627 0.3511 0.3857 0.3393 0.0532 SER 0.7352 0.8801 0.8727 0.8794 0.8852 0.8794 0.0051 0.3028 0.3269 0.2355 0.3259 0.3604 0.3122 0.0536 SR 0.7442 0.8748 0.8855 0.8863 0.8676 0.8786 0.0090 0.3007 0.3175 0.2449 0.3209 0.3611 0.3111 0.0484 AWS 0.7955 0.8833 0.8817 0.8813 0.8705 0.8792 0.0059 0.3178 0.3418 0.2610 0.3252 0.3707 0.3263 0.0486 HFT 0.7976 0.8740 0.8803 0.8800 0.8680 0.8756 0.0058 0.3225 0.3070 0.2484 0.3075 0.3422 0.3013 0.0389 Judd 0.8010 0.9263 0.8804 0.9001 0.9052 0.9030 0.0189 0.2902 0.4171 0.2841 0.3775 0.4514 0.3825 0.0722 LG 0.7790 0.9156 0.8885 0.9109 0.9034 0.9046 0.0119 0.2859 0.3938 0.2822 0.3887 0.4216 0.3716 0.0613 QDCT 0.7840 0.8593 0.8556 0.8521 0.8437 0.8527 0.0067 0.3116 0.3178 0.2399 0.3149 0.3313 0.3010 0.0413 BMS 0.8258 0.8996 0.8657 0.8866 0.8811 0.8833 0.0140 0.3300 0.3742 0.2612 0.3400 0.3831 0.3396 0.0555 TABLE II: AUC and similarity metric values. The AUC and similarity (Sim.) values are shown for each subject. Comparing the baseline model and our corresponding mean performance, we display the higher performance in bold. The similarity values show that people have diverse viewing patterns. On the contrary, AUC values show that people share a certain ”commonness.” D. Region Suggestion for Tagging Fig. 13: Tag region suggestions. Additional tag suggestion results are shown.