Table Of Content

IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 1 Cost-Effective Active Learning for Deep Image Classification Keze Wang, Dongyu Zhang*, Ya Li, Ruimao Zhang, and Liang Lin, Senior Member, IEEE Abstract—Recent successes in learning-based image classifi- samples. Then it is continuously boosted by selecting and cation, however, heavily rely on the large number of annotated pushing some of the most informative samples to user for an- trainingsamples,whichmayrequireconsiderablehumanefforts. notation.AlthoughtheexistingALapproaches[23],[24]have In this paper, we propose a novel active learning framework, 7 whichiscapableofbuildingacompetitiveclassifierwith optimal demonstrated impressive results on image classification, their 1 feature representation via a limited amount of labeled training classifiers/models are trained with hand-craft features (e.g., 0 instances in an incremental learning manner. Our approach ad- HoG, SIFT) on small scale visual datasets. The effectiveness 2 vancestheexistingactivelearningmethodsintwoaspects.First, of AL on more challenging image classification tasks has not n we incorporate deep convolutional neural networks into active been well studied. a learning.Throughtheproperlydesignedframework,thefeature Recently, incredible progress on visual recognition tasks J representation and the classifier can be simultaneously updated with progressively annotated informative samples. Second, we has been made by deep learning approaches [25], [26]. 3 present a cost-effective sample selection strategy to improve the With sufficient labeled data [27], deep convolutional neural 1 classification performance with less manual annotations. Unlike networks(CNNs) [25], [28] are trained to directly learn fea- traditionalmethodsfocusingononlytheuncertainsamplesoflow ] tures from raw pixels which has achieved the state-of-the-art V predictionconfidence,weespeciallydiscoverthelargeamountof performance for image classification. However in many real highconfidencesamplesfromtheunlabeledset forfeaturelearn- C ing.Specifically,thesehighconfidencesamplesareautomatically applications of large-scale image classification, the labeled s. selected and iteratively assigned pseudo-labels. We thus call our data is not enough, since the tedious manual labeling process c framework “ Cost-Effective Active Learning” (CEAL) standing requires a lot of time and labor. Thus it has great practical [ for the two advantages. Extensive experiments demonstrate that theproposedCEALframeworkcanachievepromisingresultson significance to develop a framework by combining CNNs and 1 twochallengingimageclassificationdatasets,i.e.,facerecognition active learning, which can jointly learn features and classi- v on CACD database [1] and object categorization on Caltech-256 fiers/modelsfromunlabeledtrainingdatawithminimalhuman 1 [2]. annotations. But incorporating CNNs into active learning 5 5 Index Terms—Incremental learning, Active learning, Deep framework is not straightforward for real image classification 3 neural nets, Image classification. tasks. This is due to the following two issues. 0 • The labeled training samples given by current AL ap- . 1 I. INTRODUCTION proaches are insufficient for CNNs, as the majority unla- 0 beled samples are usually ignored in active learning. AL 7 Aiming at improving the existing models by incremen- usuallyselectsonlyafewofthemostinformativesamples 1 tally selecting and annotating the most informative unlabeled : (e.g., samples with quite low prediction confidence) in v samples, Active Learning (AL) has been well studied in each learning step and frequently solicit user labeling. i the past few decades [3], [4], [5], [6], [7], [8], [9], [10], X Thusitisdifficulttoobtainproperfeaturerepresentations [11], [12], and applied to various kind of vision tasks, such r by fine-tuning CNNs with these minority informative as image/video categorization [13], [14], [15], [16], [17], a samples. text/web classification [18], [19], [20], image/video retrieval • The process pipelines of AL and CNNs are inconsistent [21],[22],etc.IntheALmethods[3],[4],[5],classifier/model with each other. Most of AL methods pay close attention isfirstinitializedwitharelativelysmallsetoflabeledtraining to model/classifier training. Their strategies to select the most informative samples are heavily dependent on ThisworkwassupportedinpartbyGuangdongNaturalScienceFoundation under Grant S2013050014548 and 2014A030313201, in part by State Key the assumption that the feature representation is fixed. DevelopmentProgramunderGrantNo2016YFB1001000,andinpartbythe However, the feature learning and classifier training are FundamentalResearchFundsfortheCentralUniversities.Thisworkwasalso jointlyoptimizedinCNNs.Becauseofthisinconsistency, supported by Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase). We would like to simplyfine-tuningCNNsinthetraditionalALframework thank Depeng Liang and Jin Xu for their preliminary contributions on this may face the divergence problem. project.WegratefullyacknowledgethesupportofNVIDIACorporationwith thedonationoftheTeslaK40GPUusedforthisresearch. Inspired by the insights and lessons from a significant Keze Wang, Dongyu Zhang, Ruimao Zhang and Liang Lin are with the amount of previous works as well as the recently proposed SchoolofDataandComputerScience,SunYat-senUniversity,GuangZhou. E-mail:[email protected].(CorrespondingauthorisDongyuZhang.) technique, i.e., self-paced learning [29], [30], [31], [32], we YaLiiswithGuangzhouUniversity,GuangZhou. addressabovementionedissuesbycost-effectivelycombining Copyright (c) 2016 IEEE. Personal use of this material is permitted. the CNN and AL via a complementary sample selection. However, permission to use this material for any other purposes must be [email protected]. In particular, we propose a novel active learning framework IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 2 ... Auto Pseudo-Labeling ... Progressively Inputting Majority & Clearly Classified Samples Human Labeling ... CNN ... Unlabeled Dataset Minority & Most Informative Samples Fig.1. IllustrationofourproposedCEALframework.OurproposedCEALprogressivelyfeedsthesamplesfromtheunlabeleddatasetintotheCNN.Then both of the clearly classified samples and most informative samples selection criteria are applied on the classifier output of the CNN. After adding user annotatedminorityuncertainsamplesintothelabeledsetandpseudo-labelingmajoritycertainsamples,themodel(featurerepresentationandclassifierofthe CNN)arefurtherupdated. called “Cost-Effective Active Learning” (CEAL), which is pseudo-labeling these majority high confidence samples for enabledtofine-tunetheCNNwithsufficientunlabeledtraining training is a reasonable data augmentation way for the CNN data and overcomes the inconsistency between the AL and to learn robust features. In particular, the number of the CNN. high confidence samples is actually much larger than that Different from the existing AL approaches that only con- of most uncertainty ones. With the obtained robust feature sider the most informative and representative samples, our representation, the inconsistency between the AL and CNN CEAL proposes to automatically select and pseudo-annotate can be overcome. unlabeled samples. As Fig. 1 illustrates, our proposed CEAL For the problem of keep the model stable in the training progressivelyfeedsthesamplesfromtheunlabeleddatasetinto stage, many works [36], [37] are proposed in recent years in- the CNN, and selects two kinds of samples for fine-tuning spiredbythelearningprocessofhumansthatgraduallyinclude according to the output of CNN’s classifiers. One kind is the samplesintotrainingfromeasytocomplex.Throughthisway, minority samples with low prediction confidence, called most the training samples for the further iterations are gradually informative/uncertainsamples.Thepredictedlabelsofsamples determined by the model itself based on what it has already are most uncertainty. For the selection of these uncertain learned [32]. In other words, the model can gradually select samples, the proposed CEAL considers three common active the high confidence samples as pseudo-labeled ones along learning methods: Least confidence [33], margin sampling with the training process. The advantages of these related [34] and entropy [35]. The selected samples are added into studies motivate us to incrementally select unlabeled samples the labeled set after active user labeling. The other kind is in a easy-to-hard manner to make pseudo-labeling process the majority samples with high prediction confidence, called reliable. Specifically, considering the classification model is high confidence samples. The predicted labels of samples are usuallynotreliableenoughintheinitialiterations,weemploy most certainty. For this certain kind of samples, the proposed high confidence threshold to define clearly classified samples CEAL automatically assigns pseudo-labels with no human and assign them pseudo-labels. When the performance of the laborcost.Asonecanseethat,thesetwokindsofsamplesare classification model improves, the threshold correspondingly complementary to each other for representing different confi- decreases. dence levels of the current model on the unlabeled dataset. The main contribution of this work is threefold. First, to In the model updating stage, all the samples in the labeled thebestofourknowledge,ourworkisthefirstoneaddressing set and currently pseudo-labeled high confidence samples are the deep image classification problems in conjunction with exploited to fine-tuning the CNN. active learning framework and convolutional neural networks The proposed CEAL advances in employing these two training. Our framework can be easily extended to other complementarykindsofsamplestoincrementallyimprovethe similar visual recognition tasks. Second, this work also ad- model’s classifier training and feature learning: the minority vancestheactivelearningdevelopment,byintroducingacost- informativekindcontributestotrainmorepowerfulclassifiers, effectivestrategytoautomaticallyselectandannotatethehigh while the majority high confidence kind conduces to learn confidence samples, which improves the traditional samples more discriminative feature representations. On one hand, selection strategies. Third, experiments on challenging CACD although the number is small, the most uncertainty unlabeled [1] and Caltech 256 [2] datasets show that our approach out- samples usually have great potential impact on the classifiers. performsothermethodsnotonlyintheclassificationaccuracy Selecting and annotating them into training can lead to a but also in the reduction of human annotation. better decision boundary of the classifiers. On the other hand, The rest of the paper is organized as follows. Section II though unable to significantly improve the performance of presents a brief review of related work. Section III discusses classifiers, the high confidence unlabeled samples are close thecomponentofourframeworkandthecorrespondinglearn- to the labeled samples in the CNN’s feature space. Thus ingalgorithm.SectionIVpresentstheexperimentsresultswith IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 3 deep empirical analysis. Section V concludes the paper. stability of classifiers. Even more, we shall demonstrate that consideringthesehighconfidencesamplescanalsoreducethe user effort of annotation effectively. II. RELATEDWORK The key idea of the active learning is that a learning III. COST-EFFECTIVEACTIVELEARNING algorithm should achieve greater accuracy with fewer labeled In this section, we develop an efficient algorithm for the training samples, if it is allowed to choose the ones from proposed cost-effective active learning (CEAL) framework. whichitlearns[33].Inthisway,theinstanceselectionscheme OurobjectiveistoapplyourCEALframeworktodeepimage is becoming extremely important. One of the most common classification tasks by progressively selecting complementary strategy is the uncertainty-based selection [12], [18], which samples for model updating. Suppose we have a dataset of measures the uncertainties of novel unlabeled samples from m categories and n samples denoted as D = {x }n . We the predictions of previous classifiers. In [12], Lewis et al. i i=1 denote the currently annotated samples of D as DL while proposed to extract the sample, which has the largest entropy the unlabeled ones as DU. The label of x is denoted as i on the conditional distribution over predicted labels, as the y =j,j ∈{1,...,m}, i.e., x belongs to the jth category. We i i most uncertain instance. The SVM-based method [18] deter- should give two necessary remarks on our problem settings. mined the uncertain samples based on the relative distance One is that in our investigated image classification problems, between the candidate samples and the decision boundary. almost all data are unlabeled, i.e., most of {y } of D is i Some earlier works [38], [19] also determined the sample unknown and needed to be completed in the learning process. uncertainty referring to a committee of classifiers (i.e. exam- TheotherremarkisthatDU mightpossiblybeeninputtedinto iningthedisagreementamongclasslabelsassignedbyasetof the system in an incremental way. This means that data scale classifiers). Such theoretically-motivated framework is called might be consistently growing. query-by-committeeinliterature[33].Allofabovementioned Thanks to handling both manually annotated and automat- uncertainty-based methods usually ignore the majority certain ically pseudo-labeled samples together, our proposed CEAL unlabeled samples, and thus are sensitive to outliers. The modelcanprogressivelyfittheconsistentlygrowingunlabeled later methods have taken the information density measure data in such a holistic manner. The CEAL for deep image into account and exploited the information of unlabeled data classification is formulated as follows: whenselectingsamples.Theseapproachesusuallysequentially select the informative samples relying on the probability n m estimation [6], [39] or prior information [8] to minimize the 1 (cid:88)(cid:88) min − 1{y =j}logp(y =j|x ;W), (1) generalizationerrorofthetrainedclassifierovertheunlabeled {W,yi,i∈DU} n i=1j=1 i i i data. For example, Joshi et al. [6] considered the uncertainty sampling method based on the probability estimation of class where 1{·} is the indicator function, so that 1{a true state- membershipforalltheinstanceintheselectionpool,andsuch ment} = 1, and 1{a false statement} = 0, W denotes the method can be effective to handle the multi-class case. In [8], network parameters of the CNN. p(yi = j|xi;W) denotes some context constraints are introduced as the priori to guide the softmax output of the CNN for the jth category, which userstotagthefaceimagesmoreefficiently.Atthesametime, represents the probability of the sample xi belonging to the a series of works [24], [7] is proposed to take the samples jth classifiers. to maximize the increase of mutual information between the The alternative search strategy is readily employed to opti- candidate instance and the remaining unlabeled instances un- mizetheaboveEq.(1).Specifically,thealgorithmisdesigned der the Gaussian Process framework. Li et al. [24] presented by alternatively updating the pseudo-labeled sample yi ∈DU a novel adaptive active learning approach that combines an andthenetworkparametersW.Inthefollowing,weintroduce information density measure and a most uncertainty measure the details of the optimization steps, and give their physical together to label critical instances for image classifications. interpretations.ThepracticalimplementationoftheCEALwill Moreover,thediversityoftheselectedinstanceoverthecertain also be discussed in the end. category has been taken into consideration in [40] as well. SuchworkisalsothepioneerstudyexpandingtheSVM-based A. Initialization. activelearning fromthe singlemode to batchmode.Recently, Before the experiment starts, the labeled samples DL is Elhamifar [11] et al. further integrated the uncertainty and empty.Foreachclasswerandomlyselectfewtrainingsamples diversity measurement into a unified batch mode framework from DU and manually annotate them as the starting point to viaconvexprogrammingforunlabeledsampleselection.Such initialize the CNN parameters W. approach is more feasible to conjunction with any type of classifiers, but not limited in max-margin based ones. It B. Complementary sample selection. is obvious that all of the above mentioned active learning methods only consider those low-confidence samples (e.g., Fixing the CNN parameters W, we first rank all unlabeled uncertain and diverse samples), but losing the sight of large samplesaccordingtothecommonactivelearningcriteria,then majority of high confidence samples. We hold that due to manuallyannotatethosemostuncertainsamplesandaddthem the majority and consistency, these high confidence samples intoDL.Forthosemostcertainones,weassignpseudo-labels will also be beneficial to improve the accuracy and keep the and denote them as DH. IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 4 Informative Sample Annotating: Our CEAL can use Algorithm 1 Learning Algorithm of CEAL in conjunction with any type of common actively learning Input: criteria, e.g., least confidence [33], margin sampling [34] and Unlabeled samples DU, initially labeled samples DL, un- entropy [35] to select K most informative/uncertain samples certainsamplesselectionsizeK,highconfidencesamples leftinDU.Theselectioncriteriaarebasedonp(y =j|x ;W) selection threshold δ, threshold decay rate dr, maximum i i whichdenotestheprobabilityofx belongingtothejthclass. iteration number T, fine-tuning interval t. i Specifically, the three selection criteria are defined as follows: Output: 1) Least confidence: Rank all the unlabeled samples in an CNN parameters W. ascendingorderaccordingtothelc value.lc isdefined 1: Initialize W with DL. i i as: 2: while not reach maximum iteration T do lc =maxp(y =j|x ;W). (2) 3: Add K uncertainty samples into DL based on Eq. (2) i i i j or (3) or (4), Iftheprobabilityofthemostprobableclassforasample 4: Obtain high confidence samples DH based on Eq. (5) is low then the classifier is uncertain about the sample. 5: In every t iterations: 2) Margin sampling: Rank all the unlabeled samples in • UpdateW viafine-tuningaccordingtoEq.(6)with an ascending order according to the ms value. ms is DH ∪DL i i defined as: • Update δ according to Eq. (9) ms =p(y =j |x ;W)−p(y =j |x ;W), (3) 6: end while i i 1 i i 2 i 7: return W where j and j represent the first and second most 1 2 probable class labels predicted by the classifiers. The smaller of the margin means the classifier is more where N denotes the number of samples in DH ∪DL. We uncertain about the sample. employ the standard back propagation to update the CNN’s 3) Entropy: Rank all the unlabeled samples in an descend- parameters W. Specifically, let L denote the loss function of ing order according to their en value. en is defined Eq. (6), then the partial derive of the network parameter W i i as: according to Eq. (6) is: eni =−(cid:88)m p(yi =j|xi;W)logp(yi =j|xi;W). (4) ∂L =∂−N1 (cid:80)Ni=1(cid:80)mj=11{yi =j}logp(yi =j|xi;W) ∂W ∂W j=1 N m This method takes all class label probabilities into con- =− 1 (cid:88)(cid:88)1{y =j}∂logp(yi =j|xi;W) siderationtomeasuretheuncertainty.Thehigherentropy N i ∂W i=1j=1 value, the more uncertain is the sample. N High Confidence Sample Pseudo-labeling: We select the =− 1 (cid:88)(1{y =j}−p(y =j|x ;W))∂zj(xi;W), N i i i ∂W high confidence samples from DU, whose entropy is smaller i=1 thanthethresholdδ.Thenweassignclearlypredictedpseudo- (7) labels to them. The pseudo-label yi is defined as: where {zj(xi;W)}mj=1 denotes the activation for the ith sam- j∗ =argmaxp(y =j|x ;W), ple of the last layer of CNN model before feeding into the i i j softmax classifier, which is defined as: (cid:26) j∗, en <δ, (5) y = i ezj(xi;W) i 0, otherwise. p(y =j|x ;W)= (8) i i (cid:80)m ezt(xi;W) t=1 where y = 1 denotes that x is regarded as high confidence i i After fine-tuning we put the high confidence samples DH sample. The selected samples are denoted as DH. Note that back to DU and erase their pseudo-label. compared with classification probability p(y = j∗|x ;W) i i for the j∗th category, the employed entropy en holistically i D. Threshold updating considers the classification probability of the other categories, i.e., the selected sample should be clearly classified with high As the incremental learning process goes on, the classifica- confidence.Thethresholdδ issettoalargevaluetoguarantee tioncapabilityofclassifierimprovesandmorehighconfidence a high reliability of assigning a pseudo-label. samples are selected, which may result in the decrease of incorrect auto annotation. In order to guarantee the reliability of high confidence sample selection, at the end of each C. CNN fine-tuning iteration t, we update the high confidence sample selection Fixing the labels of self-labeled high confidence samples threshold by setting DH and manually annotated ones DL by active user, the (cid:26) δ , t=0, Eq. (1) can be simplified as: δ = 0 (9) δ−dr∗t, t>0. N m 1 (cid:88)(cid:88) mWin−N 1{yi =j}logp(yi =j|xi;W), (6) wdehcearyerδa0te.is the initial threshold, dr controls the threshold i=1j=1 IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 5 Fig.2. Wedemonstratetheeffectivenessofourproposedheuristicdeepactivelearningframeworkonfacerecognitionandobjectcategorization.Thefirst andsecondline:sampleimagesfromtheCaltech-256[2]dataset.Thelastline:samplesimagesfromtheCross-AgeCelebrityDataset[1]. TABLEI The entire algorithm can be then summarized into Algo- THEDETAILEDCONFIGURATIONOFTHECNNARCHITECTUREUSEDIN rithm 1. It is easy to see that this alternative optimizing CACD[1].ITTAKESTHE200×150×3IMAGESASINPUTAND strategyfinelyaccordswiththepipelineoftheproposedCEAL GENERATESTHE500-WAYSOFTMAXOUTPUTFORCLASSESPREDICTION. THERELU[42]ACTIVATIONFUNCTIONISNOTSHOWNFORBREVITY. framework. layertype kernelsize/stride outputsize convolution 5×5/2 98×73×32 IV. EXPERIMENTS maxpool 3×3/2 48×36×32 LRN 48×36×32 A. Datasets and Experiment settings convolution(padding2) 5×5/1 48×36×64 maxpool 3×3/2 23×17×64 1) Dataset Description: In this section, we evaluate our LRN 23×17×64 cost-effective active learning framework on two public chal- convolution(padding1) 3×3/1 23×17×96 fc(dropout50%) 1×1×1536 lenging benchmarks, i.e., the Cross-Age Celebrity face recog- fc(dropout50%) 1×1×1536 nition Dataset (CACD) [1] and the Caltech-256 object cate- softmax 1×1×500 gorization [2] Dataset (see Figure 2). CACD is a large-scale and challenging dataset for face identification and retrieval TABLEII problems. It contains more than 160,000 images of 2,000 THEDETAILEDCONFIGURATIONOFTHECNNARCHITECTUREUSEDIN celebrities, which are varying in age, pose, illumination, and CALTECH-256[2].ITTAKESTHE256×256×3IMAGESASINPUTWHICH WILLBERANDOMLYCROPPEDINTO227×227DURINGTHETRAINING occlusion.Sincenotalloftheimagesareannotated,weadopt ANDGENERATESTHE256-WAYSOFTMAXOUTPUTFORCLASSES a subset of 580 individuals from the whole dataset in our PREDICTION.THERELUACTIVATIONFUNCTIONISNOTSHOWNFOR experiments,inwhich,200individualsareoriginallyannotated BREVITY and 380 persons are extra annotated by us. Especially, 6,336 layertype kernelsize/stride outputsize images of 80 individuals are utilized for pre-training the convolution 11×11/4 55×55×96 maxpool 3×3/2 27×27×96 network and the remaining 500 persons are used to perform LRN 27×27×96 theexperiments.Caltech-256isachallengingobjectcategories convolution(padding2) 5×5/1 27×27×256 dataset. It contains a total of 30,607 images of 256 categories maxpool 3×3/2 13×13×256 LRN 13×13×256 collected from the Internet. convolution(padding1) 3×3/1 13×13×384 2) Experiment setting: For CACD, we utilize the method convolution(padding1) 3×3/1 13×13×384 convolution(padding1) 3×3/1 13×13×256 proposed in [41] to detect the facial points and align the faces maxpool 3×3/2 6×6×256 based on the eye locations. We resize all the faces into 200× fc(dropout50%) 1×1×4096 150, then we set the parameters: δ = 0.05, dr = 0.0033 fc(dropout50%) 1×1×4096 0 softmax 1×1×256 and K = 2000. For Caltech-256, we resize all the images to 256×256 and we set δ = 0.005, dr = 0.00033 and K = 0 1000. Following the settings in the existing active learning method[11],werandomlyselect80%imagesofeachclassto overall network architecture for Caltech-256 experiments. We form the unlabeled training set, and the rest are as the testing use the Alexnet [25] as the network architecture for Caltech- set in our experiments. Among the unlabeled training set, we 256 and using the ImageNet ILSVRC dataset [43] pre-trained randomly select 10% samples of each class to initialize the model as the starting point following the setting of [44]. Then network and the rest are for incremental learning process. To we keep all layers fixed and just modify the last layer to get rid of the influence of randomness, we average 5 times be the 256-way softmax classifier to perform the Caltech-256 execution results as the final result. experiments. Weemploy Caffe [45] forCNN implementation. We use different network architectures for CACD [1] and ForCACD,wesetthelearningratesofallthelayersas0.01. Caltech256 [2] datasets because the difference between face For Caltech-256, we set the learning rates of all the layers as andobjectisrelativelylarge.TableIshowstheoverallnetwork 0.001 except for the softmax layer which is set to 0.01. All architecture for CACD experiments and table II shows the the experiments are conducted on a common desktop PC with IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 6 CEAL_MS 0.9 0.9 TCAL AL_RAND 0.85 0.85 AL_ALL y 0.8 y 0.8 c c a a ur0.75 ur0.75 c c c c a a 0.7 0.7 CEAL_MS 0.65 TCAL 0.65 AL_RAND 0.6 AL_ALL 0.6 10%20% 40% 60% 80% 100% 10%20% 40% 60% 80% 100% percentage of labeled samples percentage of labeled samples Fig.3. Classificationaccuracyunderdifferentpercentagesofannotatedsamplesofthewholetrainingseton(a)CACDand(b)Caltech-256datasets.Our proposedmethodCEAL MSperformsconsistentlybetterthanthecomparedTCALandAL RAND. TABLEIII CLASSACCURACYPERSOMESPECIFICALITERATIONSONTHECACDANDCALTECH-256DATASET. (a) CACD Trainingiteration 0 2 4 6 8 10 12 14 16 18 20 Percentageoflabeledsamples 0.1 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 1 CEAL MS 57.4% 77.3% 84.5% 88.2% 89.5% 90.5% 91.5% 91.6% 91.7% 91.7% 92.0% TCAL 57.4% 74.4% 81.6% 85.0% 87.9% 88.8% 89.8% 90.9% 91.5% 91.8% 91.9% AL RAND 57.4% 70.9% 77.7% 80.9% 84.5% 86.9% 88.0% 89.0% 89.9% 90.6% 92.0% AL ALL - - - - - - - - - - 92.0% (b) Caltech-256 Trainingiteration 0 2 4 6 8 10 12 14 16 18 20 22 Percentageoflabeledsamples 0.10 0.18 0.26 0.34 0.43 0.51 0.59 0.68 0.76 0.84 0.93 1 CEAL MS 58.4% 64.8% 68.8% 70.4% 71.4% 72.3% 72.8% 73.3% 73.8% 74.2% 74.2% 74.2% TCAL 58.4% 62.4% 65.4% 67.5% 69.8% 70.9% 71.9% 72.5% 73.2% 73.6% 73.9% 74.2% AL RAND 58.4% 62.4% 64.8% 66.6% 67.5% 69.1% 70.4% 71.2% 72.4% 72.9% 73.6% 74.2% AL ALL - - - - - - - - - - - 74.2% an intel 3.8GHz CPU and a Titan X GPU. Average 17 hours Thus, we regard it as a relevant competitor. are needed to finish training on the CACD dataset with 44708 Implementation Details. The compared methods share the images. same CNN architecture with our CEAL on the both datasets. 3) ComparisonMethods: Todemonstratethatourproposed The only difference in the sample selection criteria. For the CEAL framework can improve the classification performance BaseLine method, we select all training samples to fine-tune with less labeled data, we compare CEAL with a new the CNN, i.e. all labels are used. For TCAL, we follow the state-of-the-art active learning (TCAL) and baseline methods pipeline of [3] by training a SVM classifier and then applying (AL ALL and AL RAND): theuncertainty,diversityanddensitycriteriatoselectthemost • AL ALL:Wemanuallylabelallthetrainingsamplesand informative samples. Specifically, the uncertainty of samples use them to train the CNN. This method can be regarded is assessed according to the margin sampling strategy. The astheupperbound(bestperformancethatCNNcanreach diversityiscalculatedbyclusteringthemostuncertainsamples with all labeled training samples). via k-means with histogram intersection kernel. The density • AL RAND: During the training process, we randomly ofonesampleismeasuredbycalculatingtheaveragedistance selectsamplestobeannotatedtofine-tunetheCNN.This with other samples within a cluster it belonged to. For each methoddiscardsallactivelearningtechniquesandcanbe cluster,thehighestdensity(i.e.,thesmallestaveragedistance) considered as the lower bound. sampleisselectedasthemostinformativesample.ForCACD, • Triple Criteria Active Learning (TCAL) [3]: TCAL is we cluster 2000 most uncertain samples and select 500 most a comprehensive active learning approach and is well informative samples according to above mentioned diversity designed to jointly evaluate sample selection criteria and density. For Caltech-256, we select 250 most informative (uncertainty, diversity and density), and has overcome samples from 1000 most uncertain samples. To make a fair the state-of-the-art methods with much less annotations. comparison, samples selected in each iteration by the TCAL TCAL represents those methods who intend to mine mi- arealsousedtofine-tunetheCNNtolearntheoptimalfeature nority informative samples to improve the performance. representation as AL RAND. Once optimal feature learned, IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 7 0.9 0.9 0.9 0.85 0.85 0.85 y 0.8 y 0.8 y 0.8 c c c a a a cur0.75 cur0.75 cur0.75 c c c a CEAL_LC a CEAL_MS a CEAL_EN 0.7 0.7 0.7 CEAL_RAND CEAL_RAND CEAL_RAND 0.65 AL_LC 0.65 AL_MS 0.65 AL_EN AL_RAND AL_RAND AL_RAND 0.6 AL_ALL 0.6 AL_ALL 0.6 AL_ALL 10%20% 40% 60% 80% 100% 10%20% 40% 60% 80% 100% 10%20% 40% 60% 80% 100% percentage of labeled samples percentage of labeled samples percentage of labeled samples CEAL_LC CEAL_MS CEAL_EN 0.9 CEAL_RAND 0.9 CEAL_RAND 0.9 CEAL_RAND AL_LC AL_MS AL_EN 0.85 0.85 0.85 AL_RAND AL_RAND AL_RAND y 0.8 AL_ALL y 0.8 AL_ALL y 0.8 AL_ALL c c c a a a cur0.75 cur0.75 cur0.75 c c c a a a 0.7 0.7 0.7 0.65 0.65 0.65 0.6 0.6 0.6 10%20% 40% 60% 80% 100% 10%20% 40% 60% 80% 100% 10%20% 40% 60% 80% 100% percentage of labeled samples percentage of labeled samples percentage of labeled samples Fig.4. ExtensivestudyfordifferentinformativesampleselectioncriteriaonCACD(thefirstrow)andCaltech-256(thesecondrow)datasets.Thesecriteria includes least confidence (LC, the first column), margin sampling (MS, the second column) and entropy (EN, the third column). One can observe that our CEALframeworkworksconsistentlywellonthecommoninformationsampleselectioncriteria. CEAL_FUSION 0.9 0.9 CEAL_LC CEAL_MS 0.85 0.85 CEAL_EN AL_ALL y 0.8 y 0.8 c c a a ur0.75 ur0.75 c c c c a CEAL_FUSION a 0.7 0.7 CEAL_LC 0.65 CEAL_MS 0.65 CEAL_EN 0.6 AL_ALL 0.6 10%20% 40% 60% 80% 100% 10%20% 40% 60% 80% 100% percentage of labeled samples percentage of labeled samples Fig. 5. Comparison between different informative sample selection criteria and their fusion (CEAL FUSION) on CACD (left) and Caltech-256 (right) datasets. the SVM classifier of TCAL is further updated. proposed CEAL framework overcomes the compared method from the aspects of the recognition accuracy and user annotation amount. From the aspect of recognition accuracy, given B. Comparison Results and Empirical Analysis the same percentage of annotated samples, our CEAL MS 1) Comparison Results: To demonstrate the effective- outperforms the compared method in a clear margin, es- ness of our proposed framework, we also apply mar- pecially when the percentage of annotated samples is low. gin sampling criterion to measure the uncertainty of sam- From the aspect of the user annotation amount, to achieve ples and denote this method as CEAL MS. Fig. 3 illus- 91.5%recognitionaccuracyontheCACDdataset,AL RAND trates the accuracy-percentage of annotated samples curve of and TCAL require 99% and 81% labeled training samples, AL RAND, AL ALL, TCAL and the proposed CEAL MS respectively. CEAL MS needs only 63% labeled samples and on both CACD and Caltech-256 datasets. This curve demon- reduces around 36% and 18% user annotations, compared to strates the classification accuracy under different percentages AL RAND and TCAL. To achieve the 73.8% accuracy on of annotated samples of the whole training set. the caltech-256 dataset, AL RAND and TCAL require 97% As illustrated in Fig. 3, Tab. III(a) and Tab. III(b), our IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 8 0.06 and93%labeledsamples,respectively.CEAL MSneedsonly 78% labeled samples and reduces around 19% and 15% user 0.055 annotations,comparedtoAL RANDandTCAL.Thisjustifies 0.05 thatourproposedCEALframeworkcaneffectivelyreducethe 0.045 need of labeled samples. e 0.04 From above results, one can see that our proposed frame at CmEetAhLodpTeCrfAorLmsincobnosthistreenctolygnbiteitotneratchcaunratchyeasntdateu-soefr-thane-naor-t e error r0.00.3053 CCAalCteDche-2rr5o6rerrartoerrate g tation amount through fair comparisons. This is due to that era0.025 v TCAL only mines minority informative samples and is not a 0.02 able to provide sufficient training data for feature learning 0.015 under deep image classification scenario. Hence, our CEAL 0.01 has a competitive advantage in deep image classification task. 6 8 10 12 To clearly analyze our CEAL and justify the effectiveness of #iteration itscomponent,wehaveconductedtheseveralexperimentsand Fig. 6. The average error rate of the pseudo-labels of high confidence discussed in the following subsections. samplesassignedbytheheuristicstrategyonCACDandCaltech-256datasets experiments.Theverticalaxesrepresenttheaverageerrorrateandthehori- 2) ComponentAnalysis: TojustifythattheproposedCEAL zontalaxesrepresentthelearningiteration.OurproposedCEALframework can works consistently well on the common informative can assign reliable pseudo-labels to the unlabeled samples under acceptable sample selection criteria, we implement three variants of averageerrorrate CEAL according to least confidence (LC), margin sampling (MS) and entropy (EN) to assess uncertain samples. These three variants are denoted as CEAL LC, CEAL MS and by the three criteria at the same time) from obtained 3K/2 CEAL EN. Meanwhile, to show the raw performance of samples. After removing the repeated samples, we randomly these criteria, we discard the cost-effective high confidence select K samples from them to require user annotations. We sample selection of above mentioned variants and denoted the denote this method as CEAL FUSION. discardedversionsasAL LC,AL MSandAL EN.Toclarify Fig.5illustratesthatCEAL LC,CEAL MSandCEAL EN thecontributionofourproposedpseudo-labelingmajorityhigh have similar performance while CEAL FUSION performs confidence sample strategy, we further introduce this strategy better.Thisdemonstratesthatinformativesampleselectioncri- into the AL RAND and denote this variant as CEAL RAND. terionstillplaysanimportantroleinimprovingtherecognition Since AL RAND means randomly select samples to be anno- accuracy. Though being a minority, the informative samples tated, CEAL RAND reflects the original contribution of the have great potential impact on the classifier. pseudo-labeled majority high confidence sample strategy, i.e., CEAL RAND denotes the method who only uses pseudo- C. Reliability of CEAL labeled majority samples. Fromtheaboveexperiments,weknowthattheperformance Fig. 4 illustrates the results of these variants on dataset of our framework is better from other methods, which shows CACD (the first row) and Caltech-256 (the second row). The the superiority of introducing the majority pseudo-labeled resultsdemonstratethatgivingthesamepercentageoflabeled samples. But how does the accuracy of assigning the pseudo- samples and compared with AL RAND, CEAL RAND, sim- label to those high confidence samples? In order to demon- ply exploiting pseudo-labeled majority samples, obtains simi- strate the reliability of our proposed CEAL framework, we lar performance gain as AL LC, AL MS and AL EN, which also evaluate the average error in selecting high confidence employs informative sample selection criterion. This justifies samples. Fig. 6 shows the error rate of assigning pseudo-label that our proposed pseudo-labeling majority sample strategy along with the learning iteration. As one can see that, the is effective as some common informative sample selection averageerrorrateisquitelow(saylessthan3%ontheCACD criteria. Moreover, as one can see that in Fig. 4, CEAL LC, dataset and less than 5.5% on the Caltech-256 dataset) even CEAL MS and CEAL EN all consistently outperform the atearlyiterations.Hence,ourproposedCEALframeworkcan purepseudo-labelingsamplesversionCEAL RANDandtheir assign reliable pseudo-labels to the unlabeled samples under excluding pseudo-labeled samples versions AL LC, AL MS acceptableaverageerrorratealongwiththelearningiteration. andAL ENinaclearmarginonboththeCACDandCaltech- 256 datasets, respectively. This validates that our proposed D. Sensitivity of High Confidence Threshold pseudo-labeling majority sample strategy is complementary to the common informative sample selection criteria and can Since the training phase of deep convolutional neural further significantly improve the recognition performance. networks is time-consuming, it is not affordable to employ Toanalyzethechoiceofinformativesampleselectioncrite- a try and error approach to set the threshold for defining high ria,wehavemadeacomparisonamongthreeabovementioned confidence samples. We further analyze the sensitivity of the criteria. We also make an attempt to simply combine them threshold parameters δ (threshold) and dr (threshold decay together. Specifically, in each iteration, we select top K/2 rate) on our system performance on CACD dataset using the samples according to each criterion respectively. Then we CEAL EN. While analyzing the sensitivity of parameter δ on remove repeated ones (i.e., some samples may be selected our system, we fix the decrease rate dr to 0.0033. We fix IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 9 apply our framework on more challenging large-scale object 0.95 recognitiontasks(e.g.,1000categoriesinImageNet).Andwe plan to incorporate more persons from the CACD dataset to 0.9 evaluate our framework. Moreover, we plan to generalize our frameworkintoothermulti-labelobjectrecognitiontasks(e.g., 0.85 20 categories in PASCAL VOC). y c a ur acc 0.8 REFERENCES [1] B. Chen, C. Chen, and W. H. Hsu, “Cross-age reference coding for age-invariantfacerecognitionandretrieval,”inECCV,2014. 0.75 [2] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,”TechnicalReport7694,2007. [3] B.DemirandL.Bruzzone,“Anovelactivelearningmethodinrelevance 0.7 feedbackforcontent-basedremotesensingimageretrieval,”IEEETrans. 0.2 0.3 0.4 0.5 0.6 0.7 percentage of annotated samples Geosci.RemoteSensing,vol.53,no.5,2015. [4] K. Brinker, “Incorporating diversity in active learning with support 0.95 vectormachines,”inICML,2003. [5] B. Long, J. Bian, O. Chapelle, Y. Zhang, Y. Inagaki, and Y. Chang, “Activelearningforrankingthroughexpectedlossoptimization,”IEEE Trans.Knowl.DataEng.,vol.27,2015. 0.9 [6] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learningforimageclassification,”inCVPR,2009. [7] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Active learning 0.85 y withgaussianprocessesforobjectcategorization,”inICCV,2007. c ura [8] A.Kapoor,G.Hua,A.Akbarzadeh,andS.Baker,“Whichfacestotag: acc 0.8 Addingpriorconstraintsintoactivelearning,”inICCV,2009. [9] R.M.CastroandR.D.Nowak,“Minimaxboundsforactivelearning,” IEEETrans.Inform.Theory,vol.54,2008. [10] X. Li and Y. Guo, “Adaptive active learning for image classification,” 0.75 inCVPR,2013. [11] E. Elhamifar, G. Sapiro, A. Y. Yang, and S. S. Sastry, “A convex optimizationframeworkforactivelearning,”inICCV,2013. 0.7 0.2 0.3 0.4 0.5 0.6 0.7 [12] D. D. Lewis, “A sequential algorithm for training text classifiers: percentage of annotated samples Corrigendum and additional data,” SIGIR Forum, vol. 29, no. 2, pp. Fig.7. Thesensitivityanalysisofheuristicthresholdδ (thefirstrow)and 13–19,1995. decayratedr(thesecondrow).Onecanobservethattheseparametersdonot [13] X. Li and Y. Guo, “Multi-level adaptive active learning for scene substantiallyaffecttheoverallsystemperformance. classification,”inECCV,2014. [14] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification via high-orderlabelcorrelationdrivenactivelearning,”IEEETrans.Image Processing,vol.23,2014. [15] F. Sun, M. Xu, and X. Jiang, “Robust multi-label image classification the threshold δ to 0.05 when analyzing the sensitivity of dr. with semi-supervised learning and active learning,” in MultiMedia Results of the sensitivity analysis of δ (range 0.045 to 0.1) is Modeling,2015. shown in the first row of Fig. 7, while the sensitivity analysis [16] S.C.H.Hoi,R.Jin,J.Zhu,andM.R.Lyu,“Batchmodeactivelearning anditsapplicationtomedicalimageclassification,”inICML,2006. of dr (range 0.001 to 0.0035) is shown in the second row of [17] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery, Fig.7.Notethatthetestrangeofδ anddr issettoensurethe “Correction to ”active learning methods for remote sensing image majority high confidence assumption of this paper. Though classification”,”IEEETransactionsonGeoscienceandRemoteSensing, vol.48,no.6,pp.2767–2767,2010. the range of {δ, dr} seems to be narrow from the value, it [18] S.TongandD.Koller,“Supportvectormachineactivelearningwithap- leads to the significant difference: about 10% 60% samples plicationstotextclassification,”JournalofMachineLearningResearch, arepseudo-labeledinhighconfidencesampleselection.Lower vol.2,pp.45–66,2001. [19] A. McCallum and K. Nigam, “Employing EM and pool-based active standard deviation of the accuracy in Fig. 7 proves that the learningfortextclassification,”inICML,1998. choice of these parameters does not significantly affect the [20] G. Schohn and D. Cohn, “Less is more: Active learning with support overall system performance. vectormachines,”inICML,2000. [21] S.VijayanarasimhanandK.Grauman,“Large-scaleliveactivelearning: Trainingobjectdetectorswithcrawleddataandcrowds,”inCVPR,2011. V. CONCLUSIONS [22] A. G. Hauptmann, W. Lin, R. Yan, J. Yang, and M. Chen, “Extreme video retrieval: joint maximization of human and computer perfor- In this paper, we propose a cost-effective active learning mance,”inACMMultimedia,2006. frameworkfordeepimageclassificationtasks,whichemploys [23] E. Elhamifar, G. Sapiro, A. Yang, and S. S. Sastry, “A convex opti- a complementary sample selection strategy: Progressively mization framework for active learning,” in International Conference onComputerVision(ICCV),2013. select minority most informative samples and pseudo-label [24] X. Li and Y. Guo, “Adaptive active learning for image classification,” majority high confidence samples for model updating. In inCVPR,2013. such a holistic manner, the minority labeled samples benefit [25] A.Krizhevsky,I.Sutskever,andG.E.Hinton,“Imagenetclassification withdeepconvolutionalneuralnetworks,”inNIPS,2012. the decision boundary of classifier and the majority pseudo- [26] D.C.Ciresan,U.Meier,andJ.Schmidhuber,“Multi-columndeepneural labeled samples provide sufficient training data for robust networksforimageclassification,”inCVPR,2012. feature learning. Extensive experiment results on two pub- [27] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei,“Imagenet: Alarge-scalehierarchicalimagedatabase,”inCVPR,2009. lic challenging benchmarks justify the effectiveness of our [28] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksfor proposed CEAL framework. In future works, we plan to large-scaleimagerecognition,”inICLR,2015. IEEETRANSACTIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOLOGY 10 [29] L.Jiang,D.Meng,Q.Zhao,S.Shan,andA.G.Hauptmann,“Self-paced Ya Li is a lecturer in School of Computer Science curriculumlearning,”inAAAI,2015. andEducationalSoftwareatGuangzhouUniversity, [30] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann, China.ShereceivedtheB.E.degreefromZhengzhou “Self-pacedlearningformatrixfactorization,”inAAAI,2015. University, Zhengzhou, China, in 2002, M.E. de- [31] L.Jiang,D.Meng,T.Mitamura,andA.G.Hauptmann,“Easysamples greefromSouthwestJiaotongUniversity,Chengdu, first:Self-pacedrerankingforzero-examplemultimediasearch,”inACM China,in2006andPh.D.degreefromSunYat-sen Multimedia,2014. University,Guangzhou,China,in2015.Herresearch [32] L. Jiang, D. Meng, S. Yu, Z. Lan, S. Shan, and A. G. Hauptmann, focusesoncomputervisionandmachinelearning. “Self-pacedlearningwithdiversity,”inNIPS,2014. [33] B.Settles,“Activelearningliteraturesurvey,”UniversityofWisconsin– Madison,ComputerSciencesTechnicalReport1648,2009. [34] T.Scheffer,C.Decomain,andS.Wrobel,“Activehiddenmarkovmodels forinformationextraction,”inIDA,2001. [35] C. E. Shannon, “A mathematical theory of communication,” Mobile ComputingandCommunicationsReview,vol.5,2001. [36] M.P.Kumar,B.Packer,andD.Koller,“Self-pacedlearningforlatent variablemodels,”inNIPS,2010. [37] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,”inICML,2009. [38] Y.Freund,H.S.Seung,E.Shamir,andN.Tishby,“Selectivesampling using the query by committee algorithm,” Machine Learning, vol. 28, no.2-3,pp.133–168,1997. [39] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable active learningformulticlassimageclassification,”IEEETrans.PatternAnal. Mach.Intell.,vol.34,no.11,pp.2259–2273,2012. Ruimao Zhang received the B.E. degree in the [40] K. Brinker, “Incorporating diversity in active learning with support School of Software from Sun Yat-Sen University vectormachines,”inICML,2003. (SYSU) in 2011. Now he is a Ph.D. candidate [41] X. Xiong and F. D. la Torre, “Supervised descent method and its of Computer Science in the School of Information applicationstofacealignment,”inCVPR,2013. Science and Technology, Sun Yat-Sen University, [42] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Guangzhou, China. From 2013 to 2014, he was a boltzmannmachines,”inICML,2010. visitingPh.D.studentwiththeDepartmentofCom- [43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, puting,HongKongPolytechnicUniversity(PolyU). Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and His current research interests are computer vision, L. Fei-Fei, “Imagenet large scale visual recognition challenge,” CoRR, pattern recognition, machine learning and related vol.abs/1409.0575,2014. applications. [44] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visualrecognition,”inProceedingsofthe31thInternationalConference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 2014. [45] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.B.Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fastfeatureembedding,”inACMMultimedia,2014. Liang Lin is a Professor with the School of KezeWangreceivedtheBSdegreeinsoftwareen- computerscience,SunYat-SenUniversity(SYSU), gineeringfromSunYat-SenUniversity,Guangzhou, China.HereceivedtheB.S.andPh.D.degreesfrom China, in 2012. He is currently pursuing a double the Beijing Institute of Technology (BIT), Beijing, Ph.D.degreeincomputerscienceandtechnologyat China, in 1999 and 2008, respectively. From 2008 SunYat-SenUniversityandHongKongPolytechnic to 2010, he was a Post-Doctoral Research Fellow University,advisedbyProfessorLiangLinandLei with the Department of Statistics, University of Zhang. His current research interests include com- California, Los Angeles UCLA. He worked as a putervisionandmachinelearning. VisitingScholarwiththeDepartmentofComputing, HongKongPolytechnicUniversity,HongKongand with the Department of Electronic Engineering at theChineseUniversityofHongKong.Hisresearchfocusesonnewmodels, algorithmsandsystemsforintelligentprocessingandunderstandingofvisual datasuchasimagesandvideos.Hehaspublishedmorethan100papersintop tier academic journals and conferences. He currently serves as an associate editor of IEEE Tran. Human-Machine Systems. He received the Best Paper Runners-Up Award in ACM NPAR 2010, Google Faculty Award in 2012, Dongyu Zhang isaresearchassociatewithschool Best Student Paper Award in IEEE ICME 2014, and Hong Kong Scholars of data and computer science, Sun-Yat-Sen Uni- Awardin2014. versity (SYSU). He received his B.S. degree and Ph.D.inHarbinInstituteofTechnologyUniversity, Heilongjiang,China,in2003and2010.Hiscurrent research interests include computer vision and ma- chinelearning.