ebook img

Regularized maximum correntropy machine PDF

0.24 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Regularized maximum correntropy machine

Regularized maximum correntropy machine Jim Jing-Yan Wanga, Yunji Wangb, Bing-Yi Jingc, Xin Gaoa,∗ aComputer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah Universityof Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia 5 bElectrical and Computer Engineering Department, The University of Texas at San 1 Antonio, San Antonio, TX 78249, USA 0 cDepartment of Mathematics, Hong Kong Universityof Science and Technology, Kowloon, 2 Hong Kong n a J 8 1 Abstract ] In this paper we investigate the usage of regularizedcorrentropyframework for G L learning of classifiers from noisy labels. The class label predictors learned by . s minimizing transitional loss functions are sensitive to the noisy and outlying c [ labels of training samples, because the transitional loss functions are equally 1 applied to all the samples. To solve this problem, we propose to learn the class v 2 label predictors by maximizing the correntropy between the predicted labels 8 2 and the true labels of the training samples, under the regularized Maximum 4 CorrentropyCriteria (MCC) framework. Moreover,we regularize the predictor 0 . 1 parameter to control the complexity of the predictor. The learning problem is 0 5 formulated by an objective function considering the parameter regularization 1 and MCC simultaneously. By optimizing the objective function alternately, we : v i develop a novel predictor learning algorithm. The experiments on two chal- X lenging pattern classification tasks show that it significantly outperforms the r a machines with transitional loss functions. Keywords: Pattern classification, Label noise, Maximum Correntropycriteria, Regularization ∗CorrespondenceshouldbeaddressedtoXinGao. Tel: +966-12-8080323. Preprint submitted toElsevier January 20, 2015 1. Introduction The classification machine design has been a basic problem in the pattern recognitionfield. Ittriestolearnaneffectivepredictortomapthefeaturevector ofasampletoitsclasslabel[1,2,3,4,5,6,7,8,9,10]. Westudythesupervised multi-classlearning problemwith L classes. Suppose we havea training setde- notedasD={(x ,y )},i=1,··· ,N,wherex =[x ,··· ,x ]⊤ ∈RD istheD i i i i1 iD dimensionalfeaturevectorofthei-thtrainingsample,andy ∈{1,··· ,L}isthe i classlabel of i-thtraining sample. Moreover,we also denote the label indicator matrix as Y =[Y ]∈RL×N, and Y =1 if y =l,and −1 otherwise. We try to li li i learn L class label predictors {fl(x)},l = 1,··· ,L for the multi-class learning θ problem, where fl(x) is the predictor for the l-th class and θ is its parameter. θ Given a sample x , the output of the l-th predictor is denoted as fl(x ), and i θ i we further denote the prediction result matrix as F = [F ] ∈ RL×N, and θ θli F = fl(x ). To make the prediction as precise as possible, the target of pre- θli θ i dictorlearningisto learnparameterθ,sothat the difference betweentrue class labels of the training samples in Y and the prediction results in F could be θ minimized, while keeping the complexity ofthe predictoras lowas possible. To measurehowwellthepredictionresultsfitthe trueclasslabelindicator,several loss functions L(F ,Y) could be considered to compare the prediction results θ in F against the true class labels of the training samples in Y, such as the 0-1 θ loss function, the square loss function, the hinge loss function, and the logistic loss function. We summarize various loss functions in Table 1. These loss functions introduced in Table 1 have been used widely in various learningproblems. One commonfeature ofthese lossfunction is thata sample- wise lossfunction is appliedto eachtraining sample equallyandthen the losses of all the samples are summed up to obtain the final overall loss. The sample- wise loss functions are of exactly the same form with the same parameter (if they have parameters). The basic assumption behind this loss function is that thetrainingsamplesareofthesameimportance. However,duetothelimitation ofthe sampling technologyandnoises occurredduring the sampling procedure, 2 Table1: Variousempiricallossfunctionsforpredictorlearning Title Formula of L(F ,Y) Notes θ 0-1 Loss I[F Y < 0], where The 0-1 loss function is NP-hard to op- i,l θli li P I(·)istheindicatorfunction timize, non-smooth and non-convex. and I(·) = 1 if (·) is true, 0 otherwise. SquareLoss [F − Y ]2 = ||F − Thesquarelossfunctionisaconvexup- i,l θli li θ P Y||2,where◦denotestheel- per bound on the 0-1 loss. It is smooth ement wise product of two and convex, thus easy to optimize. matrices,and1 isaN× N×L L matrix with all elements of ones. Hinge Loss [1 − F Y ] = The hinge loss function is not smooth i,l θli li + P 1⊤[1 − F ◦ Y] 1 but subgradient descent can be used to N N×L θ + L where [x] = max(0,x), optimize it. It is the most common loss + and 1 ∈ RN is a column function in SVM. N vector with all ones. Logistic ln[1 + e−FθliYli] = This loss function is also smooth and i,l P Loss 1⊤ln 1 +e−Fθ◦Y 1 convex,andisusuallyusedinregression N N×L L (cid:2) (cid:3) problem. there are some noisy and outlying samples in real-worldapplications. If we use the transitional loss functions listed in Table 1, the noisy and outlying training samples will play more important roles even than the good samples. Thus the predictors learned by minimizing the transitional loss functions are not robust to the noisy and outlying training samples, and could bring a high error rate when applied to the prediction of test samples. Recently, regularized correntropy framework has been proposed for robust pattern recognition problems [11, 12, 13, 14]. In [15], He et, al argued that the classical mean square error (MSE) criterion is sensitive to outliers, and intro- 3 ducedthecorrentropytoimprovetherobustnessofthepresentation. Moreover, the l regularization scheme is imposed on the correntropy to learn robust and 1 sparserepresentations. Inspiredbytheirwork,weproposetousetheregularized correntropy as a criterion to compare the prediction results and the true class labels. We use correntropyto comparethe predicted labels and the true labels, instead of comparing the feature of test sample and its reconstructionfrom the training samples in He et, al’s work. Moreover, an l norm regularization is 2 introducedtocontrolthecomplexityofthepredictor. Inthisway,thepredictor learned by maximizing the correntropybetween prediction results and the true labels will be robust to the noisy and outlying training samples. The proposed classificationMachineMaximizingtheRegularizedCorrEntropy,whichiscalled RegMaxCEM,is supposed to be more insensitive to outlining samples than the ones with transitional loss functions. Yang et, al. [16] also proposed to use correntropytocomparepredictedclasslabelsandtruelabels. However,intheir framework, the target is to learn the class labels of the unlabeled samples in a transductive semi-supervised manner, while we try to learn the parameters for the class label predictor in a supervised manner. The rest of this paper is structured as follows: In Section 2, we propose the regularized maximum correntropy machine by constructing an objective func- tion based on the maximum correntropy criterion (MCC) and developing an expectation – maximization (EM) based alternative algorithm for its optimiza- tion. InSection 3, the proposedmethods arevalidated by conducting extensive experimentsontwochallengingpatternclassificationtasks. Finally,wegivethe conclusion in Section 4. 2. Regularized Maximum Correntropy Machine In this section we will introduce the classification machine maximizing the correntropy between the predicted class labels and the true class labels, while keeping the solution as simple as possible. 4 2.1. Objective Function To design the predictors fl(x), we first represent the data sample x as x in θ the linear space and the kernel space as: e x, (linear),  x= (1)  K(·,x), (kernel), e  where K(·,x) = [K(x ,x),··· ,K(x ,x)]⊤ ∈ RN and K(x ,x ) is a kernel 1 N i j function between x and x . Then a linear predictor fl(x) will be designed to i j θ predict whether the sample belongs to the l-th class as fl(x)=w⊤x+b , l =1,··· ,L, (2) θ l l e whereθ ={(w ,b )}L istheparametersofthepredictors,w ∈RD isthelinear l l l=1 l coefficient vector and b ∈ R is a bias term for the l-th predictor. The target l of predictor designing is to find the optimal parameters to have the prediction result fl(x ) of the i-th sample to fit its true class label indicator Y as well θ i li as possible, while keeping the solution as simple as possible. To this end, we considerthefollowingtwoproblemssimultaneouslywhendesigningtheobjective function: Prediction Accuracy Criterion based on Correntropy Toconsiderthepre- diction accuracy, we could learn the predictor parameters by minimizing a loss function listed in Table 1 as minL(F ,Y) (3) θ θ However,aswementionedinSection1,alltheselossfunctionsareapplied toallthetrainingsamplesequally,whichisnotrobusttothenoisysamples and outlying samples. To handle this problem, instead of minimizing a loss function to learn the predictor, we use the MCC [11] framework to learn the predictor by maximizing the correntropybetween the predicted results and the true labels. 5 Remark 1: In previous studies, it has been claimed that the MCC is in- sensitivetooutliers. Forexample,in[11],itisclaimedthat“themaximum correntropy criterion, ... is much more insensitive to outliers.” Based on this fact, we assume that the predictors developed based on MCC should also be insensitive to outliers. Correntropy is a generalized similarity measure between two arbitrary randomvariablesAandB. However,thejointprobabilitydensityfunction of A and B is usually unknown, and only a finite number of samples of them are available as {(a ,b )}d . It leads to the following sample i i i=1 estimator of correntropy: d 1 V(A,B)= g (a −b ), (4) dX σ i i i=1 where g (a −b )=exp −(ai−bi)2 is a Gaussian kernel function, and σ σ i i (cid:16) 2σ2 (cid:17) is a kernel width parameter. For a learning system, MCC is defined as d 1 max g (a −b ) (5) ϑdX σ i i i=1 where ϑ is the parameter to be optimized in the criterion so that B is as correlated to A as possible. Remark 2: ϑ is usually a parameter to define B, but not the kernel function parameter σ. In the learning system, we try to learn ϑ so that with the learned ϑ, B is correlated to A. For example, in this case, A is thetrueclasslabelmatrixwhileB isthepredictedclasslabelmatrix,and ϑ is the predictor parameter to define B. ToadapttheMCCframeworktothepredictorlearningproblem,weletA be the prediction result matrix F parameterizedby θ, and B be the true θ class label matrix Y, and we want to find the predictor parameter θ such thatF becomesascorrelatedtoY aspossibleundertheMCCframework. θ Then, the following correntropy-based predictor learning model will be obtained: 6 maxV(F ,Y), θ θ L N (6) 1 V(F ,Y)= g (F −Y ) θ L×N XX σ θli li l=1 i=1 Pleasenoticethatin[11],MCCisusedtomeasurethesimilaritybetweena test sample and its sparse linear representationof training samples, while inthisworkitisusedtomeasurethesimilaritybetweenthepredictedclass label and its true label. Also note that the dependence on σ in (6) and later(8),(11)reliesonthedependenceofthekernelfunctiong (·). Inour σ experiments, the σ value is calculated as σ = 1 L N kF − 2×L×N Pl=1Pi=1 θli Y k2 following [11]. li 2 Predictor Regularization Tocontrolthe complexityofthel-thpredictorin- dependently,weintroducethel -basedregularizer||w ||2 tothecoefficient 2 l vectorw ofthe l-thpredictor. We assume thatthe predictorsofdifferent l classes are equally important, and the following regularizer is introduced for multi-class learning problem: L 1 min ||w ||2 (7) {wl}Ll=1LXl=1 l Remark 3:The l norm is also used by support vector regression as a 2 measure of model complexity. However, in support vector classification, thisregularizationtermiseitherobtainedbya“maximalmargin”regular- ization or obtained by a “maximal robustness” regularization for certain type of feature noises [17]. Thus our l norm regularizationterm can also 2 be regarded as a term to seek maximal margin or robustness. Remark4: Thel -regularizationisusedincomparisontothel -regularization 2 1 in our model. Using l -regularizationwe can seek the sparsity of the pre- 1 dictor coefficient vector,but it cannotguaranteethe minimal model com- plexity, maximal marginor maximal robustness like the l -regularization, 2 thus we choose to use the l -regularization. In the future, we will ex- 2 7 plore the usage of l -regularization to see if the prediction results can be 1 improved. By substituting θ = {w,b }L , F = fl (x ), and combining both the l l l=1 θli wl,bl i predictor regularization term in (7) and the prediction accuracy criterion term based on correntropy in (6), we obtain the following maximization problem for the maximum correntropy machine: L N L 1 1 max g (fl (x )−Y )−α ||w ||2 (8) {(wl,bl)}Ll=1L×N Xl=1Xi=1 σ wl,bl i li LXl=1 l where α is a tradeoff parameter. This optimization problem is based on cor- rentropy using a Gaussian kernel function g (x). It treats the prediction of σ individual training samples of individual classes differently. By this way, we can give more emphasis on samples with correctly predicted class labels, while those noisy or outlying training samples will have small contributions to the correntropy. In fact, when the regularizer term is introduced, (8) is a case of the regularized correntropyframework [15]. 2.2. Optimization Due to the nonlinear attribute of the kernel function g (x) in the objective σ function in (8), direct optimization is difficult. An attribute of the kernel func- tion g (x) is that its derivative is also the same kernel function, and if we set σ its derivative to zero to seek the optimization of the objective, it is not easy to obtain a close form solution. However,according to the property of the convex conjugate function, we have: Proposition 1 There exists a convex conjugate function ϕ of g (x) such that σ g (x)=max (p||x||2−ϕ(p)) (9) σ p andforafixedx,themaximumisreachedatp=−g (x). ThisProposition σ is taken from [18], which is further derived from the theory of convex conjugatedfunctions. Itisfurtherdiscussedandusedinmanyapplications such as [11, 15, 19, 20]. 8 By substituting (9) to (8), we have the augmented optimization problem in an enlarged parameter space L N L 1 1 max P ||fl (x )−Y ||2−ϕ(P ) −α ||w ||2 {(wl,bl)}Ll=1,PL×N Xl=1Xi=1(cid:2) li wl,bl i li li (cid:3) LXl=1 l L N L 1 1 = P ||w⊤x +b −Y ||2−ϕ(P ) −α ||w ||2, L×N XX(cid:2) li l i l li li (cid:3) LX l l=1 i=1 e l=1 (10) where P = [P ] ∈ RN×L are the auxiliary variable matrix. To optimize (10), li we adapt the EM framework to solve P and {(w ,b )}L alternately. l l l=1 2.2.1. Expectation Step In the expectation step of the EM algorithm, we calculated the auxiliary variable matrix P by fixing θ. Obviously, according to Proposition 1, the maximum of (10) can be reached at P =−g (F −Y), σ θ (11) P =−g (w⊤x +b −Y ). li σ l i l li e Note that g (X) is the element-wise Gaussian function. With fixed predictor σ parameters, the auxiliary variable −P can be regarded as confidence of pre- li dictionresult ofthe i-thtraining sample regardingto the l-thclass. The better the l-thpredictionresultofthe i-thsamplefits the true label Y ,the largerthe li −P will be. li Remark 5: Itisinterestingtoseeifthereisanyrelationbetweentheauxil- iaryvariablesinP andthe slackvariablesinSVM. Actually, boththe auxiliary variables in P and the slack variables in SVM can be viewed as measures of classification losses. The slack variables in SVM are the upper boundaries of hingelossesofthe trainingsamples,whilethe auxiliaryvariablesinP areadis- similarity measure between the predicted labels and the true labels under the framework of the MCC rule, which is also a loss function. Meanwhile, the aux- iliary variables in P also play a role of weights of different training samples as 9 in (10), so thatthe learningcanbe robustto the noisy labels, but the auxiliary variables in SVM do not have such functions. Remark 6: In the expectation step, we actually solve an alternative opti- mization of solving P while fixing {(w,b )}L . However,according to Propo- l l l=1 sition1,thesolutionforthisoptimizationproblemisintheformof(11),which canbecalculateddirectlyandmakesitanexpectationstepoftheEMalgorithm. 2.2.2. Maximization Step In the maximization step of the EM algorithm, we solve the predictor pa- rameters {(w,b )}L while fixing P. The optimization problem in (10) turns l l l=1 to L N L 1 1 max P ||w⊤x +b −Y ||2−ϕ(P ) −α ||w ||2. {(wl,bl)}Ll=1L×N Xl=1Xi=1(cid:2) li l ei l li li (cid:3) LXl=1 l (12) Noticing P <0 and removing terms irrelevant to w and b , the maximization li l l problemin(12)canbereformulatedasthefollowingdualminimizationproblem: min O(w ,b ,··· ,w ,b ), 1 1 L L {(wl,bl)}Ll=1 L N L 1 O(w ,b ,··· ,w ,b )= (−P ||w⊤x +b −Y ||2)+α ||w ||2. 1 1 L L L×N XX li l i l li X l l=1 i=1 e l=1 (13) To simplify the notations, we define a vector u =[u ,··· ,u ]⊤ ∈RN so that l l1 lN u2 =−1P . With u , the objective function in (13) can be rewritten as li N li l L 1 O(w ,b ,··· ,w ,b )= ||u (w⊤x +b −Y )||2+α||w ||2 1 1 L L LX(cid:2) li l i l li l (cid:3) l=1 e L 1 = (w⊤X +b u⊤−Y )(w⊤X +b u⊤−Y )⊤+αw⊤w , LX(cid:2) l l l l l l l l l l l l(cid:3) l=1 (14) whereX =[u x ,··· ,u x ]∈RD×N isthematrixcontainingallthetraining l l1 1 lN N sample feature veectors weigehted by u , and Y = [u Y ,··· ,u Y ] ∈ RN is l l l1 l1 lN lN 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.