1 A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach Chenglong Li, Guizhao Wang, Yunpeng Ma, Aihua Zheng, Bin Luo, and Jin Tang Abstract—Despite significant progress, image saliency detec- In this paper, we contribute a comprehensiveimage bench- tion still remains a challenging task in complex scenes and mark for RGB-T saliency detection, and the following two 7 environments. Integrating multipledifferent but complementary main aspects are considered in creating this benchmark. 1 cues, like RGB and Thermal (RGB-T), may be an effective 0 way for boosting saliency detection performance. The current • A good dataset should be with reasonable size, high di- 2 research in this direction, however, is limited by the lack of a versityandlowbias[4].Therefore,we useourrecording n comprehensivebenchmark.ThisworkcontributessuchaRGB-T system to collect 821 RGB-T image pairs in different a imagedataset,whichincludes821spatiallyalignedRGB-Timage J pairs and their ground truth annotations for saliency detection scenesandenvironmentalconditions,andeachimagepair purpose.Theimagepairsarewithhighdiversityrecordedunder is aligned and annotated with ground truth. In addition, 1 1 different scenes and environmental conditions, and we annotate the category, size, number and spatial information of 11challengesontheseimagepairsforperformingthechallenge- salient objects are also taken into account for enhancing sensitive analysis for different saliency detection algorithms. ] thediversityandchallenge,andwepresentsomestatistics V We also implement 3 kinds of baseline methods with different modalityinputstoprovideacomprehensivecomparisonplatform. ofthecreateddatasettoanalyzethediversityandbias.To C With this benchmark, we propose a novel approach, multi- analyze the challenge-sensitive performance of different s. taskmanifoldrankingwithcross-modalityconsistency,forRGB- algorithms,weannotate11differentchallengesaccording c T saliency detection. In particular, we introduce a weight for to the above-mentionedfactors. [ each modalitytodescribethereliability,andintegratetheminto • To the best of our knowledge,RGB-T saliency detection thegraph-basedmanifoldrankingalgorithmtoachieve adaptive 1 remains not well investigated. Therefore, we implement fusion of different source data. Moreover, we incorporate the v cross-modalityconsistentconstraintstointegratedifferentmodal- somebaselinemethodstoprovideacomparisonplatform. 9 ities collaboratively. For the optimization, we design an efficient Ononehand,weregardRGBorthermalimagesasinputs 2 algorithm to iteratively solve several subproblems with closed- in some popular methods to achieve single-modality 8 form solutions. Extensive experiments against other baseline 2 saliency detection. These baselines can be utilized to methods on the newly created benchmark demonstrate the 0 identifytheimportanceandcomplementarityofRGBand effectiveness of the proposed approach, and we also provide 1. basicinsightsandpotentialfutureresearchdirectionsforRGB-T thermal information with comparing to RGB-T saliency 0 saliency detection. detection methods. On the other hand, we concatenate 7 the features extracted from RGB and thermal modalities Index Terms—RGB-T benchmark, Saliency detection, Cross- 1 modality consistency, Manifold ranking, Fast optimization. together as the RGB-T feature representations, and em- : v ploy some popular methods to achieve RGB-T saliency i detection. X I. INTRODUCTION r Salient object detection has been extensively studied in a IMAGE saliency detection is a fundamental and active pastdecades,andnumerousmodelsandalgorithmshavebeen problem in computer vision. It aims at highlightingsalient proposed based on different mathematical principles or pri- foreground objects automatically from background, and has ors[5–15]. Most of methodsmeasuredsaliency by measuring received increasing attentions due to its wide range of ap- local center-surround contrast and rarity of features over the plications in computer vision and graphics, such as object entire image [5, 12, 16, 17]. In contrast, Gopalakrishnan et recognition, content-aware retargeting, video compression, al. [18] formulated the object detection problem as a binary and image classification. Despite significant progress, image segmentation or labelling task on a graph. The most salient saliency detection still remains a challenging task in complex seeds and several background seeds were identified by the scenes and environments. behaviorofrandomwalksonacompletegraphandak-regular graph. Then, a semi-supervised learning technique was used Recently, integrating RGB data and thermal data (RGB- to infer the binary labels of the unlabelled nodes. Different T data) has been proved to be effective in several computer fromit,Yangetal.[19]employedmanifoldrankingtechnique vision problems, such as moving object detection [1, 2] and to salient object detection that requires only seeds from one tracking[3].GiventhepotentialsofRGB-Tdata,however,the class, which are initialized with either the boundary priors or research of RGB-T saliency detection is limited by the lack foreground cues. They then extended their work with several of a comprehensive image benchmark. improvements [20], including multi-scale graph construction and a cascade scheme on a multi-layer representation. Based The authors are with Anhui University, Hefei 230601, China. Email: [email protected]. on the manifold ranking algorithms, Li et al. [21] gener- 2 ated pixel-wise saliency maps via regularized random walks paper. ranking, and Wang et al. [22] proposed a new graph model which captured local/global contrast and effectively utilized II. RGB-T IMAGE SALIENCY BENCHMARK the boundary prior. In this section, we introduce our newly created RGB- With the created benchmark,we proposea novelapproach, T saliency benchmark, which includes dataset with statistic multi-task manifold ranking with cross-modality consistency, analysis,baselinemethodswithdifferentinputsandevaluation for RGB-T saliency detection. For each modality, we employ metrics. the idea of graph-based manifold ranking [19] for the good saliencydetectionperformanceintermsofaccuracyandspeed. A. Dataset Then, we assign each modality with a weight to describe Wecollect821RGB-Timagepairsbyourrecordingsystem, the reliability, which is capable of dealing with occasional which consists of an online thermal imager (FLIR A310) perturbation or malfunction of individual sources, to achieve and a CCD camera (SONY TD-2073). For alignment, we adaptive fusion of multiple modalities. To better exploit the uniformly select a number of point correspondences in each relationsamongmodalities,weimposethecross-modalitycon- image pairs, and compute the homography matrix by the sistentconstraintsontherankingfunctionsofdifferentmodali- least-square algorithm. It is worth noting that this registration tiestointegratethemcollaboratively.Consideringthemanifold method can accurately align image pairs due to the following ranking in each modality as an individualtask, our method is two reasons. First, we carefully choose the planar and non- essentially formulated as a multi-task learning problem. For planar scenes to make the homography assumption effective. the optimization, we jointly optimize the modality weights Second, since two camera views are almost coincident as we andtherankingfunctionsofmultiplemodalitiesbyiteratively made, the transformation between two views is simple. As solving two subproblems with closed-form solutions. eachimagepairisaligned,weannotatethe pixel-levelground This paper makes the following three major contributions truth using more reliable modality.Fig. 1 shows some sample for RGB-T image saliency detection and related applications. image pairs and their ground truths. • It creates a comprehensive benchmark for facilitating The image pairs in our dataset are recorded in approxi- evaluatingdifferentRGB-Tsaliencydetectionalgorithms. mately60sceneswithdifferentenvironmentalconditions,and The benchmark dataset includes 821 aligned RGB-T the category, size, number and spatial information of salient images with the annotated ground truths, and we also objectsare also taken intoaccountforenhancingthe diversity presentthefine-grainedannotationswith11challengesto and challenge. Specifically, the following main aspects are allow us to analyze the challenge-sensitive performance considered in creating the RGB-T image dataset. of differentalgorithms.Moreover,we implement3 kinds of baseline methods with different inputs (RGB, thermal • Illumination condition. The image pairs are captured under different light conditions, such as sunny, snowy, and RGB-T) for evaluations. This benchmark will be and nighttime. The low illumination and illumination available online for free academic usage 1. variation caused by different light conditions usually • It proposes a novel approach, multi-task manifold rank- bring big challenges in RGB images. ing with cross-modalityconsistency,for RGB-T saliency detection. In particular, we introduce a weight for each • Background factor. Two background factors are taken into account for our dataset. First, similar background modality to represent the reliability, and incorporate the to the salient objects in appearance or temperature will cross-modality consistent constraints to achieve adaptive introduce ambiguous information. Second, it is difficult and collaborative fusion of different source data. The toseparateobjectsaccuratelyfromclutteredbackground. modality weights and ranking function are jointly op- timized by iteratively solving several subproblems with • Salient object attribute. We take different attributes of salient objects, including category (more than 60 cate- closed-form solutions. gories), size (see the size distribution in Fig. 2 (b)) and • It presents extensive experiments against other state-of- number,into accountin constructingourdataset forhigh the-art image saliency methods with 3 kinds of inputs. diversity. The evaluation results demonstrate the effectiveness of the proposed approach. Through analyzing quantitative • Object location. Most of methods employ the spatial information (center and boundary of an image) of the results, we further provide basic insights and identify salientobjectsas priors,whichis verifiedto be effective. the potentials of thermal information in RGB-T saliency However, some salient objects are not at center or cross detection. image boundaries, and these situations isolate the spatial The rest of this paper is organized as follows. Sect. II priors. We incorporate these factors into our dataset introducesdetailsoftheRGB-Tsaliencydetectionbenchmark. construction to bring its challenge, and Fig. 2 presents Theproposedmodelandtheassociatedoptimizationalgorithm the spatialdistributionof salient objectson CB and CIB. are presented in Sect. III, and the RGB-T saliency detection Considering the above-mentioned factors, we annotate 11 approach is introduced in Sect. IV. The experimental results challenges for our dataset to facilitate the challenge-sensitive and analysis are shown in Sect. V. Sect. VI concludes the performanceof differentalgorithms. They are: big salient ob- ject(BSO),smallsalientobject(SSO),multiplesalientobjects 1RGB-Tsaliency detection benchmark’s webpage: http://chenglongli.cn/people/lcl/journals.html. (MSO), low illumination (LI), bad weather (BW), center bias 3 SSO,SA,CB,TC IC,TC,CB,MSO,SSO LI,BSO,IC,CIB MSO,SA,CB,CIB,TC BSO,TC,OF BW,OF,CIB,IC,CB,MSO BSO,TC,SA CB,MSO,SA,OF CB,CIB,TC,BSO MSO,TC,CIB BW,TC,CB,CIB,MSO CIB,MSO,BSO BW,BSO,MSO,CIB,TC LI,CB,IC,TC,SSO Fig.1. Sampleimagepairswithannotated groundtruthsandchallenges fromourRGB-Tdataset. TABLEII LISTOFTHEBASELINEMETHODSWITHTHEUSEDFEATURES,THEMAINTECHNIQUESANDTHEPUBLISHEDINFORMATION. Algorithm Feature Technique Book Title Year MST[15] Lab&Intensity Minimumspanningtree IEEECVPR 2016 RRWR[21] Lab Regularized random walksranking IEEECVPR 2015 CA[23] Lab Celluar Automata IEEECVPR 2015 GMR[19] Lab Graph-basedmanifoldranking IEEECVPR 2013 STM[8] LUV&Spatial information Scale-based treemodel IEEECVPR 2013 GR[24] Lab Graphregularization IEEESPL 2013 NFI[25] Lab&Orientations &Spatial information Nonlinear feature integration JournalofVision 2013 MCI[26] Lab&Spatial information Multiscale contextintegration IEEETPAMI 2012 SS-KDE[27] Lab Sparsesamplingandkernel densityestimation SCIA 2011 BR[28] Lab&Intensity&Motion Bayesianreasoning ECCV 2010 SR[29] Lab Self-resemblance JournalofVision 2009 SRM[16] Spectrum Spectral residualmodel IEEECVPR 2007 (CB), cross image boundary (CIB), similar appearance (SA), we regard RGB or thermal images as inputs in 12 popular thermal crossover (TC), image clutter (IC), and out of focus methodstoachievesingle-modalitysaliencydetection,includ- (OF). Tab. I shows the details, and Fig. 2 (a) presents the ing MST [15], RRWR [21], CA [23], GMR [19], STM [8], challenge distribution. We will analyze the performance of GR[24],NFI[25],MCI[26],SS-KDE[27],BR[28],SR[29] different algorithms on the specific challenge using the fine- andSRM[16].Tab.IIpresentsthedetails.Thesebaselinescan grained annotations in the experimental section. be utilized to identify the importance and complementarity of RGB and thermal information with comparing to RGB-T saliency detection methods. On the other hand, we concate- B. Baseline Methods nate the features extracted from RGB and thermal modalities Toprovideacomparisonplatform,weimplement3kindsof together as the RGB-T feature representations, and employ baselinemethodswithdifferentmodalityinputs.Ononehand, the above-mentioned methods to achieve RGB-T saliency 4 TABLEI LISTOFTHEANNOTATEDCHALLENGESOFOURRGB-TDATASET. Challenge Description BSO Big Salient Object - the ratio of ground truth salient objects overimageismorethan0.26. SSO SmallSalient Object-theratioofgroundtruthsalient objects overimageislessthan0.05. LI Low Illumination - the environmental illumination is low. BW Bad Weather - the image pairs are recorded in bad weathers, suchassnowy,rainy,hazyandcloudy. MSO Multiple Salient Objects - the number of the salient objects intheimageismorethan1. Fig.3. Illustration ofgraphconstruction. CB CenterBias-thecentersofsalientobjectsarefaraway fromtheimagecenter. CIB Cross Image Boundary - the salient objects cross the The adaptive threshold is defined as: imageboundaries. SA Similar Appearance - the salient objets have similar w h colororshapetothebackground. 2 TC Thermal Crossover - the salient objects have similar T = S(i,j), (1) w h temperature tothebackground. × Xi=1Xj=1 IC ImageClutter-theimageiscluttered. OF OutofFocus-theimageisout-of-focus. where w and h denote the width and height of an image, respectively. S is the computed saliency map. The F-measure (F)isdefinedasfollowswiththeprecision(P)andrecall(R) of the above adaptive threshold: (1+β2) P R Fβ2 = β2 P×+R× , (2) × where we set the β2 = 0.3 to emphasize the precision as suggested in [17]. PR curves and F metric are aimed at 0.3 quantitative comparison, while MAE are better than them for taking visual comparison into consideration to estimate (a) Challenge distribution (b) Size distribution dissimilarity between a saliency map S and the ground truth G, which is defined as: w h 1 MAE = S(i,j) G(i,j). (3) w h | − | × Xi=1Xj=1 III. GRAPH-BASED MULTI-TASKMANIFOLD RANKING The graph-based ranking problem is described as follows: Givenagraphandanodeinthisgraphasquery,theremaining nodes are ranked based on their affinities to the given query. (c) Spatial distribution of CB (d) Spatial distribution of CIB The goal is to learn a ranking function that defines the Fig.2. Datasetstatistics. relevance between unlabelled nodes and queries. Thissectionwillintroducethegraph-basedmulti-taskmani- foldrankingmodelandtheassociatedoptimizationalgorithm. detection. The optimized modality weights and ranking scores will be utilized for RGB-T saliency detection in next section. C. Evaluation Metrics A. Graph Construction There exists several metrics to evaluate the agreement between subjective annotations and experimental predictions. Given a pair of RGB-T images, we regard the thermal In this work, We use (P)recision-(R)ecallcurves (PR curves), image as one of image channels, and then employ SLIC F metric and Mean Absolute Error (MAE) to evaluate algorithm [30] to generate n non-overlapping superpixels. 0.3 all the algorithms. Given the binarized saliency map via the We take these superpixels as nodes to construct a graph threshold value from 0 to 255, precision means the ratio of G=(V,E),whereV isanodesetandE isasetofundirected the correctly assigned salient pixel number in relation to all edges. In this work, any two nodes in V are connected if one the detected salient pixel number, and recall means the ratio of the following conditions holds: 1) they are neighboring; of the correct salient pixel number in relation to the ground 2) they share common boundaries with its neighboring node; truthnumber.Differentfrom(P)recision-(R)ecallcurvesusing 3) they are on the four sides of image, i.e., boundary nodes. a fixed threshold for every image, the F metric exploits an Fig. 3 shows the details. The first and second conditions are 0.3 adaptive threshold of each image to perform the evaluation. employed to capture local smoothness cues as neighboring 5 superpixels tend to share similar appearance and saliency values. The third condition attempts to reduce the geodesic distance of similar superpixels. It is worth noting that we can explore more cues in RGB and thermal data to construct an adaptive graph that makes best use of intrinsic relationship among superpixels. We will study this issue in future work as this paper is with an emphasis on the multi-task manifold ranking algorithm. If nodes V and V is connected, we assign it with an edge i j weight as: (a) (b) (c) Wkij =e−γk||cki−ckj||, k =1,2,...K, (4) Fig.4. Illustration oftheeffectiveness ofintroducing themodalityweights and the cross-modality consistency. (a) Input RGB and thermal images. where ck denotes the mean of the i-th superpixel in the k-th (b) Results of our method without modality weights and cross-modality i consistency are shown in the first and second rows, respectively. (c) Our modality, and γ is a scaling parameter. resultsandthecorresponding groundtruth. B. Multi-Task ManifoldRankingwith Cross-ModalityConsis- ranking algorithm is proposed as follows: tency 1 K n sk sk We first review the algorithm of graph-based manifold min ((rk)2 Wk i j 2)+ ranking that exploits the intrinsic manifold structure of data sk,rk 2kX=1 iX,j=1 ij|| Dkii − Dkjj|| for graph labeling [31]. Given a superpixel feature set X = p q (7) x ,...,x Rd×n, some superpixels are labeled as queries K { 1 n}∈ µ sk y 2+ Γ (1 r) 2+λ sk sk−1 2, and the rest need to be ranked according to their affinities to || − || || ◦ − || || − || kX=2 the queries, where n denotes the number of superpixels. Let s:X Rn denote a ranking function that assigns a ranking where Γ = [Γ1,...,ΓK]T is an adaptive parameter vector, value →s to each superpixel x , and s can be viewed as a which is initialized after the first iteration (see Alg. 1), and i i vector s = [s ,...,s ]T. In this work, we regard the query r = [r1,...,rK]T is the modality weight vector. denotes 1 n ◦ labels as initial superpixel saliency value, and s is thus an the element-wise product, and λ is a balance parameter. The initialsuperpixelsaliencyvector.Lety=[y ,...,y ]T denote third term is to avoid overfitting of r, and the last term is 1 n an indication vector, in which y = 1 if x is a query, and the cross-modalityconsistentconstraints. The effectivenessof i i y = 0 otherwise. Given G, the optimal ranking of queries introducingthese two terms is presentedin Fig. 4. With some i are computed by solving the following optimization problem: simple algebra, Eq. (7) can be rewritten as: 1 K n sk sk min1( n Wij si sj 2+µ s y 2), (5) smk,irnk 2Xk=1((rk)2iX,j=1Wikj|| Di kii − Djkjj||2)+ (8) s 2 ||√D − D || || − || p q iX,j=1 ii jj µ sk y 2+ Γ (1 r) 2+λ CS 2, p || − || || ◦ − || || ||F where D = diag D11,...,Dnn is the degree matrix, and where S = [s1;s2;...;sK] RnK×1, and C Rn(K−1)×nK { } D = W . diag indicates the diagonal operation. µ is a ∈ ∈ ii j ij denotesthecross-modalityconsistentconstraintmatrix,which paramePtertobalancethesmoothnesstermandthefittingterm. is defined as: Then, we apply manifold ranking on multiple modalities, I2,1 I2 0 0 0 and have − ··· 0 I3,2 I3 0 0 1 n sk sk − ··· , (9) mskin2(iX,j=1Wikj|| Di kii − Djkjj||2+µ||sk−y||2), (6) 0 ·0·· 0 ··· IK·,K··−1 −IK p q where Ik and Ik,k−1 are the identity matrices with the size of k =1,2,...,K. n n. × From Eq. (6), we can see that it inherently indicates that available modalities are independent and contribute equally. C. Optimization Algorithm This may significantly limit the performance in dealing with We present an alternating algorithm to optimize Eq. (8) occasional perturbation or malfunction of individual sources. efficiently, and denote Therefore,weproposeanovelcollaborativemodelforrobustly performingsalientobjectdetectionthati)adaptivelyintegrates 1 K n sk sk J(S,r)= ((rk)2 Wk i j 2)+ ditiifefse,reini)t mcooldlaabliotrieastivbealsyedcoomnptuhteeisr rthesepercatnivkeingmofduanlctrieolniasbiol-f 2Xk=1 iX,j=1 ij|| Dkii − Dkjj|| p q multiple modalities by incorporating the cross-modality con- µ sk y 2+ Γ (1 r) 2+λ CS 2. sistentconstraints.Theformulationofthemulti-taskmanifold || − || || ◦ − || || ||F (10) 6 Algorithm 1 Optimization Procedure to Eq. (8) Algorithm 2 Proposed Approach Input: The matrix Ak = I (Dk)−21Wk(Dk)−12, the indi- Input: OneRGB-Timagepair,theparametersγk,µ1,µ2 and − cation vector Y, and the parameters µ and λ; λ. Set rk = 1; ε=10−4, maxIter =50. Output: Saliency map ¯s. K Output: S, r. 1: // Graph construction 1: for t=1:maxIter do 2: Construct the graph with superpixels as nodes, and cal- 2: Update S by Eq. (13); culate the affinity matrix Wk and the degree matrix Dk 3: if t==1 then with the parameter γk; 4: for k =1:K do 3: // first stage ranking 5: Γk = (sk)TAksk; 4: Utilize the boundary priors to generate the background 6: end forp queries; 7: end if 5: Run Alg. 1 with the parameters µ1 and λ to obtain the 8: for k =1:K do initial saliency map sfs (Eq. (16)); 9: Update rk by Eq. (15); 6: // second stage ranking 10: end for 7: Computetheforegroundqueriesusingtheadaptivethresh- 11: if Jt Jt−1 <ε then olds; | − | 12: Terminate the loop. 8: Run Alg. 1 with the parameters µ2 and λ to compute the 13: end if final saliency map ¯s (Eq. (17)). 14: end for IV. TWO-STAGERGB-T SALIENCY DETECTION Given r, Eq. (10) can be written as: In this section, we present the two-stage ranking scheme for unsupervised bottom-up RGB-T saliency detection using 1 K n sk sk the proposed algorithm with boundary priors and foreground J(S)= ((rk)2 Wk i j 2)+ 2Xk=1 iX,j=1 ij|| Dkii − Dkjj|| (11) queries. p q µ sk y 2+λ CS 2, A. Saliency Measure || − || || ||F Given an input RGB-T images represented as a graph and and we reformulate it as follows: somesalientquerynodes,thesaliencyofeachnodeisdefined J(S)=(R S)TA(R S)+µ S Y 2+λ CS 2, as its ranking score computed by Eq. (8). In the conventional ◦ ◦ || − || || || (12) ranking problems, the queries are manually labelled with the where Y = [y1;y2;...;yK] RnK×1, and A is a block- ground-truth.Inthiswork,we firstemploytheboundaryprior diagonal matrix defined as: A∈ = diag A1,A2,...,AK widelyusedinotherworks[11,19]tohighlightthesalientsu- RnK×nK, where Ak = I (Dk)−12{Wk(Dk)−12. R} =∈ perpixels,andselecthighlyconfidentsuperpixels(lowranking [r1;...;r1;r2;...;r2;...;rK;...;−rK] RnK×1. Taking the scores in all modalities) belonging to the foreground objects derivative of J(S) with respect to S,∈we have as the foreground queries. Then, we perform the proposed algorithmtoobtainthefinalrankingresults,andcombinethem RRTA+λCTC −1 withtheirmodalityweightstocomputethefinalsaliencymap. S= +I Y (13) (cid:18) µ (cid:19) B. Ranking with Boundary Priors where I is an identity matrix with the size of nK nK. × Based on the attention theories for visual saliency [32], we Given S, Eq. (10) can be written as: regard the boundary nodes as background seeds (the labelled 1 data) to rank the relevances of all other superpixel nodes in J(rk)= (rk)2(sk)TAksk+Γk(1 rk), (14) the first stage. 2 − Taking the bottom image boundary as an example, we and we take the derivative of J(rk) with respect to rk, and utilize the nodes on this side as labelled queries and the obtain rest as the unlabelled data, and initilize the indicator y in rk = 1+ (sk1)TAksk, k =1,2,...,K. (15) Eemq.pl(o8y).inGgitvheenpyro,ptohseedrarnakniknigngvaallugeosri{thsmkb},aanrde wcoemnpourmtedalibzye (Γk)2 sk as ˆsk withtherangebetween0and1.Similarly,given { b} { b} A sub-optimal optimization can be achieved by alternating the top, left and right image boundaries, we can obtain the between the updating of S and r, and the whole algorithm is respectiverankingvalues ˆsk , ˆsk , ˆsk .Weintegratethem { t} { l} { r} summarizedinAlg.1.Althoughtheglobalconvergenceofthe to compute the initial saliency map for each modality in the proposed algorithm is not proved, we empirically validate its first stage: fast convergence in our experiments. The optimized ranking sk =(1 ˆsk) (1 ˆsk) (1 ˆsk) (1 ˆsk), functions sk andmodalityweights rk will be utilizedfor fs − b ◦ − t ◦ − l ◦ − r (16) { } { } k=1,2,...,K. RGB-T saliency detection in next section. 7 1 1 1 1 0.8 0.8 0.8 0.8 Precision000...246 NMCGSSSGBRRSRAMRFC-M I [[K I[[R00[ 00[ D0 .[.0.[.320.570E6.944.09 .436[497040389]]2]].5]66]]4]7] Precision000...246 NMCGSSSGBRRSRAMRFC-M I [[K I[[R00[ 00[ D0 .[.0.[.420.340E4.285.86 .354[361180612]]7]].6]49]]6]3] Precision000...246 NMCGSSSGBRRSRAMRFC-M I [[K I[[R00[ 00[ D0 .[.0.[.350.540E4.864.44 .314[936560537]]2]].8]39]]7]3] Precision000...246 NMCGSSSGBRRSRAMRFC-M I [[K I[[R00[ 00[ D0 .[.0.[.430.550E5.725.29 .435[105880636]]2]].0]55]]2]7] 0 Ours [0.753] 0 Ours [0.498] 0 Ours [0.570] 0 Ours [0.628] 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Recall Recall (a) BSO Subset (b) BW Subset (c) CB Subset (d) CIB Subset 1 1 1 1 0.8 0.8 0.8 0.8 Precision00..46 NMCGAMFCI I[R[ 0[0 0[..504.54.8500925]]6]] Precision00..46 NMCGAMFCI I[R[ 0[0 0[..605.44.5589668]]8]] Precision00..46 NMCGAMFCI I[R[ 0[0 0[..605.51.4569897]]5]] Precision00..46 NMCGAMFCI I[R[ 0[0 0[..605.54.3603941]]1]] SRM [0.386] SRM [0.427] SRM [0.444] SRM [0.386] SR [0.410] SR [0.353] SR [0.474] SR [0.456] 0.2 SS-KDE [0.462] 0.2 SS-KDE [0.505] 0.2 SS-KDE [0.532] 0.2 SS-KDE [0.561] GR [0.516] GR [0.568] GR [0.585] GR [0.630] BR [0.443] BR [0.442] BR [0.492] BR [0.529] Ours [0.610] Ours [0.678] Ours [0.666] Ours [0.672] 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Recall Recall (e) IC Subset (f) LI Subset (g) MSO Subset (h) OF Subset 1 1 1 1 0.8 0.8 0.8 0.8 Precision000...2460 NMCGSSSGBORRSRAMRFuC-M I r[[K I[[R0s0[ 00[ D0 .[.0[.[.350.5500E5.824.26. .4465[7080701766]]1]].9]449]]]7]5] Precision000...2460 NMCGSSSGBORRSRAMRFuC-M I r[[K I[[R0s0[ 00[ D0 .[.0[.[.250.4300E3.973.41. .2343[0076507667]]7]].8]923]]]7]1] Precision000...2460 NMCGSSSGBORRSRAMRFuC-M I r[[K I[[R0s0[ 00[ D0 .[.0[.[.440.5500E5.345.58. .4265[7520301224]]6]].4]847]]]8]5] Precision000...2460 NMCGSSSGBORRSRAMRFuC-M I r[[K I[[R0s0[ 00[ D0 .[.0[.[.450.6600E5.325.01. .4466[2010801481]]5]].1]055]]]1]5] 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Recall Recall (i) SA Subset (j) SSO Subset (k) TC Subset (l) Entire dataset Fig.5. PRcurvesoftheproposedapproachwithotherbaselinemethodswithRGB-Tinputontheentiredatasetandseveralsubsets,wheretheF0.3 values areshowninthelegend. C. Ranking with Foreground Queries The source codes and result figures will be providedwith the benchmark for public usage in the community. Givensk ofthek-thmodality,wesetanadaptivethreshold fs T =max(sk ) ǫ to generatethe foregroundqueries,where k fs − A. Experimental Settings max() indicatesthe maximumoperation,andǫ is a constant, · For fair comparisons, we fix all parameters and other which is fixed to be 0.25 in this work. Specifically, we select settingsofourapproachintheexperiments,andusethedefault thei-thsuperpixelastheforegroundqueryofthek-thmodality parameters released in their public codes for other baseline if sk > T . Therefore, we compute the ranking values sk fs,i k ss methods. and the modality weights r in the second stage by employing In graph construction, we empirically generate n = 300 ourrankingalgorithm.Similarto thefirststage,we normalize superpixels and set the affinity parameters γ1 =24 and γ2 = the ranking value sk of the k-th modality as ˆsk with the ss ss 12, which control the edge strength between two superpixel range between 0 and 1. Finally, the final saliency map can be nodes.InAlg.1andAlg.2,weempiricallysetλ=0.03,µ = obtained by combining the ranking values with the modality 1 0.02(thefirststage)andµ =0.06(thesecondstage).Herein, weights: 2 we use bigger balance weight in the second stage than in the K ¯s= (rkˆsk ). (17) firststageastherefinedforegroundqueriesaremoreconfident ss than the background queries according to the boundary prior. kX=1 B. Comparison Results V. EXPERIMENTS To justify the importance of thermal information, the com- In this section, we apply the proposed approach over our plementarybenefitsto image saliencydetectionandthe effec- RGB-T benchmarkandcomparewith otherbaselinemethods. tiveness of the proposed approach, we evaluate the compared 8 TABLEIII AVERAGEPRECISION,RECALL,ANDF-MEASUREOFOURMETHODAGAINSTDIFFERENTKINDSOFBASELINEMETHODSONTHENEWLYCREATED DATASET.THECODETYPEANDRUNTIME(SECOND)AREALSOPRESENTED.THEBOLDFONTSOFRESULTSINDICATETHEBESTPERFORMANCE,AND “M”ISTHEABBREVIATIONOFMATLAB. RGB Thermal RGB-T Algorithm Code Type Runtime P R F P R F P R F BR [28] 0.724 0.260 0.411 0.648 0.413 0.488 0.804 0.366 0.520 M&C++ 8.23 SR [29] 0.425 0.523 0.377 0.361 0.587 0.362 0.484 0.584 0.432 M 1.60 SRM [16] 0.411 0.529 0.384 0.392 0.520 0.380 0.428 0.575 0.411 M 0.76 CA [23] 0.592 0.667 0.568 0.623 0.607 0.573 0.648 0.697 0.618 M 1.14 MCI [26] 0.526 0.604 0.485 0.445 0.585 0.435 0.547 0.652 0.515 M&C++ 21.89 NFI [25] 0.557 0.639 0.532 0.581 0.599 0.541 0.564 0.665 0.544 M 12.43 SS-KDE [27] 0.581 0.554 0.532 0.510 0.635 0.497 0.528 0.656 0.515 M&C++ 0.94 GMR [19] 0.644 0.603 0.587 0.700 0.574 0.603 0.694 0.624 0.615 M 1.11 GR [24] 0.621 0.582 0.534 0.639 0.544 0.545 0.705 0.593 0.600 M&C++ 2.43 STM [8] 0.658 0.569 0.581 0.647 0.603 0.579 - - - C++ 1.54 MST [15] 0.627 0.739 0.610 0.665 0.655 0.598 - - - C++ 0.53 RRWR [21] 0.642 0.610 0.589 0.689 0.580 0.596 - - - C++ 2.99 Ours - - - - - - 0.716 0.713 0.680 M&C++ 1.39 TABLEIV AVERAGEMAESCOREOFOURMETHODAGAINSTDIFFERENTKINDSOFBASELINEMETHODSONTHENEWLYCREATEDDATASET.THEBOLDFONTSOF RESULTSINDICATETHEBESTPERFORMANCE. MAE CA NFI SS-KDE GMR GR BR SR SRM MCI STM MST RRWR Ours RGB 0.163 0.126 0.122 0.172 0.197 0.269 0.300 0.199 0.211 0.194 0.127 0.171 0.109 T 0.225 0.124 0.132 0.232 0.199 0.323 0.218 0.155 0.176 0.208 0.129 0.234 0.141 RGB-T 0.195 0.125 0.127 0.202 0.199 0.297 0.260 0.178 0.195 - - - 0.107 methods with different modality inputs on the newly created bias (CB), cross image boundary (CIB), similar appearance benchmark, which has been introduced in Sect. II. (SA), thermal crossover (TC), image clutter (IC), and out Overall performance. We first report the precision (P), of focus (OF), see Sect. II for details) to facilitate analysis recall(R) and F-measure(F) of 3 kindsof methodson entire of performance on different challenging factors, we present datasetinTab.III.Herein,asthepublicsourcecodesofSTM, the PR curves in Fig. 5. From the results we can see that MST and RRWR are encrypted, we only run these methods our approach outperformsother RGB-T methods with a clear on the single modality. From the evaluation results, we can margin on most of challenges except for BSO and BW. observe that the proposed approach substantially outperforms It validates the effectiveness of our method. In particular, all baseline methods. This comparison clearly demonstrates for occasional perturbation or malfunction of one modality the effectivenessof our approach for adaptively incorporating (e.g., LI, SA and TC), our method can effectively incorporate thermalinformation.Inaddition,thethermaldataareeffective another modal information to detect salient objects robustly, to boost image saliency detection and complementary to justifyingthe complementarybenefitsofmultiplesourcedata. RGB data by observing the superior performance of RGB- For BSO, CA [23] achieves a superior performance over T baselines over both RGB and thermal methods. We also ours, and it may attribute to CA takes the global color dis- reportMAEof3kindsofmethodsonentiredatasetinTab.IV, tinctionand the globalspacial distance into accountfor better the results of which are almost consistent with Tab. III. The capturing the global and local information. We can alleviate evaluation results further validate the effectiveness of the thisproblembyimprovingthegraphconstructionthatexplores proposedapproach,theimportanceofthermalinformationand more relations among superpixels. For BW, most of methods the complementary benefits of RGB-T data. havebadperformance,butMCIobtainsabigperformancegain Fig. 6 shows some sample results of our approach against over ours, the second best one. It suggests that considering other baseline methods with different inputs. The evaluation multiple cues, like low-level considerations, global consider- results show that the proposedapproachcan detect the salient ations, visual organizational rules and high-level factors, can objects more accurate than other methods by adaptively and handleextremelychallengesinRGB-Tsaliencydetection,and collaborativelyleveragingRGB-T data. It is worth notingthat we will integrate them into our framework to improve the someresultsusingsinglemodalityarebetterthanusingRGB- robustness. Tdata.Itisbecausethattheredundantinformationintroduced by the noisy or malfunction modality sometimes affects the C. Analysis of Our Approach fusion results in bad way. We discuss the details of our approach by analyzing the Challenge-sensitive performance. For evaluating RGB-T main components, efficiency and limitations. methodsonsubsetswithdifferentattributes(bigsalientobject Components. To justify the significance of the main com- (BSO), small salient object (SSO), multiple salient objects ponents of the proposed approach, we implement two special (MSO), low illumination (LI), bad weather (BW), center versions for comparative analysis, including: 1) Ours-I, that 9 (a) Image pair (b) BR (c) SR (d) SRM (e) CA (f) MCI (g) NFI (h) SS-KDE (i) GMR (j) GR (k) STM (l) RRWR (m)MST (n)Ours &Ground-truth Fig. 6. Sample results of the proposed approach and other baseline methods with different modality inputs. (a) Input RGB and thermal image pair and theirgroundtruth.(b-i)TheresultsofthebaselinemethodswithRGB,thermalandRGB-Tinputs.(j-m)TheresultsofthebaselinemethodswithRGBand thermalinputs.(n)Theresultsofourapproach. 0.8 0.15 1 1 0.7 0.8 0.6 0.8 0.1 0.5 Precision00..46 00..34 0.05 Criteria0.6 0.2 p 0.4 0.2 OOOuuurrrsss--III 0.1 PRreeccailslion Sto 0 0 F-measure 0 0.2 0 0.2 0.4 0.6 0.8 1 Ours-I Ours-II Ours Ours-IOurs-IIOurs Recall (a) (b) (c) 0 0 10 20 30 40 50 Fig. 7. PR curves, the representative precision, recall and F-measure and Iteration MAEoftheproposedapproachwithitsvariants ontheentire dataset. Fig.8. Average convergence curve oftheproposed approach onthe entire dataset. removesthe modalityweightsin the proposedrankingmodel, i.e., fixes rk = 1 in Eq. (8), and 2) Ours-II, that removesthe K cross-modality consistent constraints in the proposed ranking fast optimization to the proposed ranking model. model, i.e., sets λ=0 in Eq. (8). The experiments are carried out on a desktop with an Intel The PR curves, representative precision, recall and F- i73.4GHzCPUand16GBRAM,andimplementedonmixing measure, and MAE are presented in Fig. 7, and we can platform of C++ and MATLAB without any optimization. draw the following conclusions. 1) Our method substantially Fig.8showstheconvergencecurveoftheproposedapproach. outperformsOurs-I. This demonstrates the significance of the Although involves 5 times of the optimization to the ranking introduced weighted variables to achieve adaptive fusion of model, our method costs about 1.39 second per image pair different source data. 2) The complete algorithm achieves due to the efficiency of our optimization algorithm, which superiorperformancethanOurs-II,validatingtheeffectiveness convergesapproximately within 5 iterations. of the cross-modality consistent constraints. Wealsoreportruntimeofothermainproceduresofthepro- Efficiency. Runtime of our approachagainstother methods posedapproachwiththetypicalresolutionof640 480pixels. × is presented in Tab. III. It is worth mentioning that our 1)Theover-segmentationbySLICalgorithmtakesabout0.52 approach is comparable with GMR [19] mainly due to the second.2)Thefeatureextractiontakesapproximately0.24sec- 10 From the evaluation results, we also observe the following research potentials for RGB-T saliency detection. 1) The ensemble of multiple models or algorithms (CA [23]) can achieve robust performance. 2) Some principles are crucial (a) for effective saliency detection, e.g., global considerations (CA [23]), boundary priors (Ours, GMR [19], RRWR [21], MST[15])andmultiscalecontext(STM[8]).3)Theexploita- tionofmorerelationsamongpixelsorsuperpixelsisimportant for highlighting the salient objects (STM [8], MST [15]). (b) VI. CONCLUSION Fig. 9. Twofailure cases ofourmethod. Theinput RGB, thermal images, the ground truth and the results generated by our method are shown in (a) In this paper, we have presented a comprehensive image and(b),respectively. benchmark for RGB-T saliency detection, which includes a dataset, three kinds of baselines and four evaluation metrics. ond.3)Thefirststage,including4rankingprocess,costsabout With the benchmark, we have proposed a graph-based multi- 0.42second.4)Thesecondstage,including1rankingprocess, task manifold ranking algorithm to achieve adaptive and spendsapproximately0.14second.Theover-segmentationand collaborative fusion of RGB and thermal data in RGB-T the feature extraction are mostly time consuming procedure saliency detection. Through analyzing the quantitative and (about 55%). Hence, through introducing the efficient over- qualitative results, we have demonstrated the effectiveness of segmentation algorithms and feature extraction implementa- the proposedapproach,and also providedsome basic insights tion, we can achieve much better computation time under our andpotentialresearchdirectionsforRGB-Tsaliencydetection. approach. Our future works will focus on the following aspects: 1) Limitations. We also present two failure cases generated We will expand the benchmark to a larger one, including a by our method in Fig. 9. The reliable weights sometimes are largerdatasetwithmorechallengingfactorsandmorepopular wrongly estimated due to the effect of clutter background, baseline methods. 2) We will improve the robustness of our as shown in Fig. 9 (a), where the modality weights of RGB approach by studying other prior models [33] and graph and thermal data are 0.97 and 0.95, respectively. In such construction. circumstance, our method will generate bad detection results. This problem could be tackled by incorporating the measure REFERENCES of backgroundclutter in optimizing reliable weights, and will [1] B. Zhao, Z. Li, M. Liu, W. Cao, and H. Liu, “Infrared beaddressedinourfuturework.Inadditiontoreliableweight and visible imagery fusion based on region saliency de- computation, our approach has another major limitation. The tectionfor24-hour-surveillancesystems,”inProceedings first stage ranking relies on the boundary prior, and thus the of the IEEE International Conference on Robotics and salientobjectsare failedto be detectedwhenthey crossesim- Biomimetics, 2013. age boundaries. The insufficient foreground queries obtained [2] C.Li,X.Wang,L.Zhang,andJ.Tang,“Weld:Weighted after the first stage usually result in bad saliency detection low-rank decomposition for robust grayscale-thermal performance in the second stage, as shown in Fig. 9 (b). We foreground detection,” IEEE Transactions on Circuits will handle this issue by selecting more accurate boundary and Systems for Video Technology, 2016. queries according to other prior cues in future work. [3] C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learningcollaborative sparse representationfor D. Discussions on RGB-T Saliency Detection grayscale-thermaltracking,”IEEETransactionsonImage We observe from the evaluationsthat integratingRGB data Processing, vol. 25, no. 12, pp. 5743–5756, 2016. and thermal data will boost saliency detection performance [4] A.TorralbaandA.Efros,“Unbiasedlookatdatasetbias,” (see Tab. III). The improvements are even bigger while en- in IEEE Conference on Computer Vision and Pattern countering certain challenges, i.e., low illuminations, similar Recognition, 2011. appearance, image clutter and thermal crossover (see Fig. 5), [5] J. Harel, C. Koch, and P. Perona, “Graph-based visual demonstratingtheimportanceofthermalinformationinimage saliency,” in Proceedings of the Advances in Neural saliencydetectionandthecomplementarybenefitsfromRGB- Information Processing Systems, 2006. T data. [6] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, In addition, directly integrating RGB and thermal informa- and H.-Y. Shum, “Learning to detect a salient object,” tionsometimesleadtoworseresultsthanusingsinglemodality IEEE Transactions on Pattern Analysis and Machine (seeFig. 6),astheredundantinformationisintroducedbythe Intelligence, vol. 33, no. 2, pp. 353–367, 2011. noisy or malfunction modality. We can address this issue by [7] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and adaptively fusing different modal information (Our method) S.Li,“Salientobjectdetection:Adiscriminativeregional that can automatically determine the contribution weights featureintegrationapproach,”inProceedingsoftheIEEE of different modalities to alleviate the effects of redundant ConferenceonComputerVisionandPatternRecognition, information. 2013.