SubmittedtotheAnnalsofStatistics ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT By Junlong Zhao†, Guan Yu‡ and Yufeng Liu§,∗ Beijing Normal University, China† State University of New York at Buffalo, USA‡ University of North Carolina at Chapel Hill, USA§ Robustnessisadesirablepropertyformanystatisticaltechniques. As an important measure of robustness, breakdown point has been widelyusedforregressionproblemsandmanyothersettings.Despite the existing development, we observe that the standard breakdown pointcriterionisnotdirectlyapplicableformanyclassificationprob- lems. In this paper, we propose a new breakdown point criterion, namelyangularbreakdownpoint,tobetterquantifytherobustnessof differentclassificationmethods.Usingthisnewbreakdownpointcri- terion, we study the robustness of binary large margin classification techniques, although the idea is applicable to general classification methods. Both bounded and unbounded loss functions with linear and kernel learning are considered. These studies provide useful in- sightsontherobustnessofdifferentclassificationmethods.Numerical results further confirm our theoretical findings. 1. Introduction. Classification problems are commonly seen in prac- tice. There are numerous classification methods available in the literature, see Hastie et al. (2009) for a comprehensive review. Among the existing methods, large margin classification techniques, such as the Support Vector Machine (SVM) (Vapnik (1998)), have been extensively studied in recent years. Let X denote the domain of the p-dimensional vector of input vari- ables X, and Y denote the class label set that equals {−1,1} for binary classification. Assume that the training data {(X ,Y ),1 ≤ i ≤ n} are i.i.d. i i copies of (X,Y) ∈ X ×Y with the unknown distribution P. Manyclassificationmethodscanbeformulatedassolvinganoptimization problem.Forbinaryclassification,weaimtoestimateafunctionf(x) : Rp → ∗Thecorrespondingemail:yfl[email protected] NationalScienceFoundationofChinaGrants(No.11471030,61472475),theFundamental Research Funds for the Central Universities and KLAS 130026507. Guan Yu and Yufeng LiuweresupportedinpartbyUSNIHgrantsP01CA-142538,R01GM-126550,andNSF grants DMS-1407241 and IIS-1632951. MSC 2010 subject classifications: Primary 62H30; secondary 62G35 Keywords and phrases: breakdown point, classification, loss function, reproducing ker- nel Hilbert spaces, robustness 1 imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 2 J. ZHAO, G. YU AND Y. LIU R and use sign(f(x)) as the classification rule. Typically, large margin tech- niques can be fit in the general regularization framework which minimizes the objective function n−1(cid:80)n (cid:96)(Y f(X ))+λJ(f), where (cid:96)(u) is the loss i=1 i i function, Y ∈ {1,−1}, J(f) is the penalty function on f, and λ is the tun- i ing parameter. Many loss functions have been proposed and studied in the literature. In particular, the 0-1 loss (cid:96)(u) = 1(u ≤ 0) is the theoretical loss corresponding to the misclassification error directly. Due to the difficulty of minimizing the objective function with the 0-1 loss, one often uses surrogate loss functions in practice. The hinge loss (cid:96)(u) = (1−u) for the SVM (Vap- + nik (1998)), the exponential loss (cid:96)(u) = exp(−u) for the AdaBoost (Freund and Schapire (1997)), and the deviance loss (cid:96)(u) = log(1+exp(−u)) for the penalized logistic regression (Lin et al. (2000)) are commonly used. In practice, outliers are often encountered, which can greatly reduce the effectiveness of various methods. Therefore, robustness is a very important considerationinstatisticalmodeling.Forclassificationproblems,ithasbeen observed numerically that classifiers with the unbounded loss functions can be sensitive to outliers. For example, Biggio, Nelson and Laskov (2012) showed that a specifically selected outlier can significantly reduce the clas- sification accuracy of SVM. To overcome this problem, various techniques havebeendevelopedusingboundedlossfunctionstoreachrobustness(Shen et al. (2003); Liu and Shen (2006); Wu and Liu (2007)). Several other au- thors also proposed various robust variants of SVM, such as Krause and Singer (2004); Xu, Crammer and Schuurmans (2006). Robust classifiers are desirable for classification problems with potential outliers. However, a systematic comparison of different classification meth- ods in terms of robustness is nontrivial. In the literature, there exist several robustness measures such as qualitative robustness (Hampel (1971); Hable and Christmann (2011)), influence function (Hampel (1974)), and break- down point (Hampel (1971)). As pointed by Hable and Christmann (2011), qualitative robustness mainly concerns the equicontinuity of the estimator. Influencefunctiondescribestheeffectsofsmalldeviations(thelocalstability of a statistical procedure) whereas the breakdown point takes into account the global reliability and describes the effects of large deviations (Ronchetti (1997)). Despite commonly used in various settings, these robustness mea- sures are not sufficient for classification. For the qualitative robustness, in recent years, Hable and Christmann (2011) considered the qualitative ro- bustness of SVM. As to the influence function, Christmann and Steinwart (2004) considered the influence function in binary classification with convex losses. They showed that the influence function exists under some condi- tions; for example, (cid:96)(u) is twice continuously differentiable and either X or imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 ANGULAR BREAKDOWN POINT FOR CLASSIFICATION 3 the kernel function K(x,x) is bounded. For more general classification set- tings such as (cid:96)(u) being not differentiable or non-convex, not much work has been developed on the influence function. For the breakdown point, since the introduction of this concept, it has been extended for various settings (Donoho and Huber (1983); Stromberg and Ruppert (1992); Sakata and White (1995); Genton and Lucas (2003); Hubert, Rousseeuw and Van Aelst (2008)). Among these works, the finite sample breakdown point (Donoho and Huber (1983)) is simple and has been widely used. Besides this popular criterion, Genton and Lucas (2003) intro- duced a more general definition of breakdown point for different settings, such as times series, nonlinear regression, etc. According to Genton and Lucas (2003), an estimator breaks down if the remaining uncontaminated observations have no effect on the estimator any more. Despite the progress in different areas, the research for finite sample breakdown point in classifi- cationislimited.Kanamori,FujiwaraandTakeda(2014)developedarobust variant of the ν-SVM method (Sch¨olkopf et al. (2000)) and considered the finite sample breakdown point of their method. In general, the breakdown point for classification problems has not yet been studied systematically. To better understand the robustness of different classification methods, we consider the criterion of breakdown point in this paper. As will be shown in Section 2, the finite sample breakdown point, which is widely used in regression and other settings, is not suitable for classification problems in many cases. For classification, in contrast to the regression setting, the key effect of outliers is to change the classification boundary rather than the norm of coefficients in the classification function. Motivated from this, we propose a new criterion, namely angular breakdown point, to measure ro- bustness of classification methods. The proposed angular breakdown point, as an extension of the finite-sample breakdown point to classification prob- lems, is also a measurement on global reliability. We demonstrate that the proposed angular breakdown point provides new useful insights on robust- ness which can not be obtained via the existing robustness measures. The angularbreakdownpointisstudiedforclassificationproblemswithbounded or unbounded loss functions. Our theoretical and numerical studies illus- trate the robustness properties of different loss functions for both linear and kernel-based binary large margin classifiers. These results shed some lights on the potential advantages of bounded loss functions over unbounded ones. The rest of this paper is organized as follows. In Section 2, we show the motivationanddefinitionofangularbreakdownpoint.InSection3,westudy the effect of outliers on linear classification, and the theoretical properties of angular breakdown point for binary classification with linear learning, imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 4 J. ZHAO, G. YU AND Y. LIU where both bounded and unbounded loss functions are studied. In Section 4, the angular breakdown point for binary kernel learning with bounded or unbounded loss functions is considered. The simulation results and real data analysis are presented in Sections 5 and 6, respectively. In Section 7, we conclude this paper and discuss some potential applications of our proposedangularbreakdownpointcriterionindataanalysis.Selectedproofs are shown in Appendix. Other proofs are given in the online Supplementary Material (Zhao, Yu and Liu (2017)). 2. Motivation and definition of angular breakdown point. Let Z = {z : z = (x ,y ), i = 1,··· ,n}denoteni.i.d.samplesofZ = (X,Y). n i i i i For motivation, we firstly consider linear classification with a classification function f(x) = b + βTx where β ∈ Rp and b ∈ R. For a large margin classification method with a loss function (cid:96)(u), let β˜ = (b ,βT)T be the 0 0 0 population optimizer, that is, β˜ = arg min E [(cid:96)((b+βTX)Y)]. 0 Z b∈R,β∈Rp In practice, we estimate b and β by 0 0 n 1 (cid:88) (2.1) (ˆb,βˆ) = argmin (cid:96)(y (b+βTx ))+λJ(f), i i b,β n i=1 where λ is a tuning parameter and J(f) is a regularization term. 2.1. Motivation for angular breakdown point. One of the most popular measures for the robustness of an estimator is the replacement finite-sample breakdown point (FBP) (Donoho and Huber (1983)). If we use FBP to measure the robustness of βˆ, the breakdown point is defined as (cid:40) (cid:41) m (2.2) (cid:15)∗(βˆ,Z ) = min : sup(cid:107)βˆ(Z˜ )−βˆ(Z )(cid:107) = ∞ , n n n n Z˜ n whereZ˜ isthecontaminatedsampleobtainedbyreplacingmoftheoriginal n n observations Z with arbitrary values, βˆ(Z˜ ) is the estimate of β using n n the contaminated sample Z˜ , and (cid:107)·(cid:107) is the l norm. n 2 Although FBP is very effective for regression problems (Stromberg and Ruppert (1992); Sakata and White (1995)), the definition of breakdown point in (2.2) is not suitable for classification problems. For binary classifi- cation, whena largemargin classifieras in (2.1) is used,a newobservationx imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 ANGULAR BREAKDOWN POINT FOR CLASSIFICATION 5 isclassifiedaccordingtosign(ˆb+βˆTx).Incontrasttoregression,thescaleofβˆ (i.e., (cid:107)βˆ(cid:107)) does not directly reflect the classification performance. Compared with (cid:107)βˆ(cid:107), the direction of βˆ (i.e., βˆ/(cid:107)βˆ(cid:107)) plays a key role in classification. Even if (cid:107)βˆ(Z˜ )(cid:107) is very large, the decision boundary acquired by βˆ(Z˜ ) can n n be close to the true boundary, and therefore the classification performance can still be excellent. Another major drawback of FBP for classification is that (cid:107)βˆ(Z˜ )−βˆ(Z )(cid:107) = ∞ is often unattainable for classification (see Sec- n n tion 3.1). In fact, for a large margin classifier, the main effect of outliers is to change the direction or angle of the estimate rather than the norm. As a result, an alternative criterion to quantify the breakdown point for classifi- cation problems is needed. We illustrate this by the following toy example. Toy example. Consider a linear classification problem. We assume that the covariate vector X|Y follows the normal distribution N(sign(Y)u ,I ), 0 2 where Y ∈ {1,−1} and u = (1,0). We set the sample size for each group 0 to be 100. For the positive class (i.e., Y = 1), we replace one observation by an outlier generated from the normal distribution N((u ,0)T,I ) with u ∈ 1 2 1 {−90,−180,−240}.Forthisexample,threelossfunctionsareconsidered:the exponential loss (cid:96)(u) = exp(−u) used in AdaBoost, the deviance loss (cid:96)(u) = log(1+exp(−u))forlogisticregression,andthehingeloss(cid:96)(u) = (1−u) for + theSVM.Wesetthetuningparameterλ = 0forbothAdaBoostandlogistic regression. For the SVM, we set the tuning parameter λ = 1/200. Denote βˆ , βˆ , and βˆ as the estimates obtained by these three methods. ada log svm Figure 1 shows the decision boundaries of the Bayes classifier, AdaBoost, logistic regression, and SVM for four cases. When there is no outlier, the Bayes decision boundary is x = 0. The decision boundaries of AdaBoost, 1 logisticregressionandSVMareclosetotheoptimalBayesboundary.Asthe effect of the outlier increases (i.e., u decreases), the decision boundaries of 1 thesethreemethodschangesignificantlyandthecorrespondingclassification errors increase. For the case with u = −240, their decision boundaries are 1 almost orthogonal to the optimal Bayes decision boundary. For that case, the classification errors of AdaBoost, logistic regression and SVM are 0.45, 0.45 and 0.445, respectively. Although these three methods tend to have very poor classification performance as the effect of the outlier increases, we observe that (cid:107)βˆ (cid:107), (cid:107)βˆ (cid:107), and (cid:107)βˆ (cid:107) are always bounded. Therefore, the ada log svm definition of breakdown point in (2.2) is not effective for this problem. In addition, we check the inner product between these estimates (βˆ , βˆ , ada log and βˆ ) and the theoretical best coefficient vector β = (1,0)T. We found svm 0 that the inner products decrease dramatically as the effect of the outlier increases. In the case with u = −240, all inner products are negative, 1 which indicates that the angles between the estimates (βˆ , βˆ , and βˆ ) ada log svm imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 6 J. ZHAO, G. YU AND Y. LIU No outlier One outlier (u1=−90) 6 Bayes 6 AdaBoost 4 LSoVgMistic 4 2 2 x2 x2 2 2 − − 6 6 − − −4 −2 0 2 4 −4 −2 0 2 4 x1 x1 One outlier (u=−180) One outlier (u=−240) 1 1 6 6 4 4 2 2 x2 x2 2 2 − − 6 6 − − −4 −2 0 2 4 −4 −2 0 2 4 x1 x1 Fig1.Illustrationoftheeffectofoneoutlierforthetoyexample.Astheoutliergetsmore extreme,theestimateddecisionboundariesbecomenearorthogonaltotheBayesboundary. and β are larger than π/2. In this case, classification by these methods is 0 completely failed due to one extreme outlier. 2.2. Definition of angular breakdown point. Motivated by the above toy exampleandtheeffectofoutliersonclassificationwhichwillbetheoretically studied in Section 3.1, we propose the following novel angular breakdown point to quantity the robustness of large margin classification methods. Definition 1. (Population anglular breakdown point) The angular breakdown point for large margin classification is defined by (cid:110)m (cid:111) (cid:15)(β ,Z ) = min : βˆ(Z˜ ) ∈ S− , 0 n n n 0 where S− = {β : βTβ ≤ 0}. 0 0 As a remark, we note that the angular breakdown point represents the minimum fraction of outliers needed such that the angle between the esti- mated coefficient βˆ(Z˜ ) and the true coefficient β is at least π/2, the case n 0 when the classification method can be equivalent to the random guessing or have low discriminating power depending on the distribution of the uncon- taminatedsampleZ .Inpractice,thetruecoefficientβ isunknownandthe n 0 angular breakdown point in Definition 1 is intractable computationally. To imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 ANGULAR BREAKDOWN POINT FOR CLASSIFICATION 7 assess the robustness of the estimate, we define the following sample angu- lar breakdown point, considering the difference between estimates with and without outliers. This is similar to the traditional breakdown point. With- out loss of generality, we assume that the estimate of β using the original sample Z , denoted as βˆ(Z ), is nonzero throughout this paper. n n Definition 1(cid:48). (Sample angular breakdown point) Thesampleangular breakdown point for large margin classification is defined by (cid:110)m (cid:111) (cid:15)(βˆ,Z ) = min : βˆ(Z˜ ) ∈ Sˆ− , where Sˆ− = {β : βTβˆ(Z ) ≤ 0}. n n n 0 0 n As we will see below, the sample angular breakdown point generally has the same properties as those of the population angular breakdown point. In Sections3and4,westudythetheoreticalpropertiesofourproposedangular breakdown point. 3. Angular breakdown point in linear classification. In this sec- tion, we first study the effect of outliers on linear classification theoretically. Then,westudythetheoreticalpropertiesoftheproposedangularbreakdown point for binary classification with linear learning, where both bounded and unbounded loss functions are studied. Before proceeding further, we intro- duce some notations. Let Z = {z = (x ,y ), i = 1,··· ,n−m} and n−m i i i Zo = {zo = (xo,yo), i = 1,··· ,m} denote the n−m uncontaminated and m i i i mcontaminatedobservations,respectively,withZ˜ = Z ∪Zo represent- n n−m m ing the whole sample. Denote β˜= (b,βT)T. Then, the objective function for the linear binary classification with sample Z˜ can be formulated as n (cid:34) n−m (cid:35) m 1 (cid:88) 1 (cid:88) L (β˜,Z˜ ) = (cid:96)(y (b+βTx ))+λJ(β) + (cid:96)(yo(b+βTxo)) λ,n n n i i n i i i=1 i=1 (3.1) := G (β˜,Z )+F (β˜,Zo ), λ,n n−m n m whereG (β˜,Z )andF (β˜,Zo )aretwotermsonlyinvolvingtheuncon- λ,n n−m n m taminated and contaminated observations, respectively. We assume that the penalty function J(β) satisfies conditions: (1) J(β) ≥ 0 and J(β) = J(−β); (2) J(β) = 0 if and only if β = 0; and (3) J(β) → ∞, as (cid:107)β(cid:107) → ∞. 3.1. Effect of outliers on linear classification. Tofollowupthetoyexam- ple, we now theoretically study the effect of outliers on linear classification. To this end, we need to introduce linearly separable datasets. A dataset D = {(x ,y ), i = 1,··· ,n} ⊆ X ×{−1,1} of binary classification is lin- i i early separable, if there exists a hyperplane αTx+a = 0 for some α ∈ Rp and a ∈ R such that (αTx +a)y > 0 for any (x ,y ) ∈ D. i i i i imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 8 J. ZHAO, G. YU AND Y. LIU Proposition 1. Suppose that the nonnegative loss function (cid:96)(u) satis- fies the following conditions: (i) (cid:96)(0) < ∞; (ii) lim (cid:96)(u) = ∞. Then the u→−∞ following two conclusions hold: (1) For the original observations Z and any n contaminated observations Zo , we have (cid:107)βˆ(Z )(cid:107) < ∞ and (cid:107)βˆ(Z˜ )(cid:107) < ∞ m n n for any λ > 0; (2) If neither Z nor Z˜ is linearly separable, we have n n (cid:107)βˆ(Z )(cid:107) < ∞ and (cid:107)βˆ(Z˜ )(cid:107) < ∞ for λ = 0. Therefore, βˆ(Z˜ ) does not n n n break down in terms of the breakdown point defined in (2.2). In general, Z˜ is not linearly separable when there exists outliers. Fur- n thermore, the commonly used methods such as the SVM, penalized logistic regression, AdaBoost and the least square loss all satisfy the assumptions in Proposition 1. The estimates of these methods will not break down in terms of the breakdown point defined in (2.2). However, as shown in the toy example, these methods can be sensitive to outliers and break down if an extreme outlier exists. Thus, even if we move the outliers arbitrarily, the norm of βˆ can be still finite, and therefore the traditional breakdown point (2.2) will not be effective for classification problems. In addition, we point out that the general definition of breakdown point proposed by Genton and Lucas(2003)canbealsoineffectivehere.Accordingtothegeneraldefinition, an estimator breaks down if the uncontaminated sample does not affect the estimator any more. Since (3.3) below shows that the estimates of SVM, penalized logistic regression, AdaBoost and the least square loss are always affectedbytheremaininguncontaminatedsample,thesemethodscannotbe viewed as breakdown. However, we can see from Figure 1 that the classifica- tion boundaries of these methods are badly affected and the corresponding classification errors are close to 0.5, the case of random guessing. From both Proposition 1 and the toy example, we see that (cid:107)βˆ(Z˜ ) − n βˆ(Z )(cid:107) = ∞ is not attainable in general and thus the traditional breakdown n point in (2.2) is not applicable for classification problems. Given (cid:107)βˆ(Z˜ )(cid:107) < n ∞, since there are at least two observations z ,z such that y = 1 and i1 i2 i1 y = −1, one can check that |ˆb| < ∞ under the condition of Proposition i2 1. Therefore, without loss of generality, we assume that the minimization of the objection function in (3.1) is taken over the set ∆ = {(b,β),|b| < BL ∞,β ∈ Rp} to simplify the analysis. To further illustrate the effect of outliers, we first consider the case with a single outlier (that is, m = 1) denoted by zo = (xo,yo) with yo ∈ {1,−1}. 1 1 1 1 imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 ANGULAR BREAKDOWN POINT FOR CLASSIFICATION 9 Then Z˜ = Z ∪{zo} and n n−1 1 (cid:34) n−1 (cid:35) 1 (cid:88) 1 L (β˜,Z˜ ) = λJ(β)+ (cid:96)((b+βTx )y ) + (cid:96)((b+βTxo)yo) λ,n n n i i n 1 1 i=1 (3.2) := G (β˜,Z )+F (β˜,Zo), λ,n n−1 n 1 where Zo = zo. Denote the minimizer of (3.2) by 1 1 (ˆb,βˆ(Z˜ )) = arg min L (β˜,Z˜ ). n λ,n n (b,β)∈∆BL Assume that (cid:96)(u) is a nonnegative, unbounded and continuous decreas- ing function with lim (cid:96)(u) = ∞. To better understand the effect of this u→−∞ outlier, we set (cid:107)xo(cid:107) → ∞. Note that for any β with βTxoyo/(cid:107)xo(cid:107) < 0, we 1 1 1 1 have (cid:96)((b+βTxo)yo) → ∞ as (cid:107)xo(cid:107) → ∞ for any bounded b. As a result, 1 1 1 the minimizer (ˆb,βˆ(Z˜ )) of L (β˜,Z˜ ) must satisfy βˆ(Z˜ ) ∈ S+ := {β : n λ,n n n zo 1 βTx¯oyo ≥ 0}, where x¯o = xo/(cid:107)xo(cid:107). Therefore, the effect of the outlier zo is 1 1 1 1 1 1 equivalent to imposing a constraint on the feasible solution. Specifically, it can be rewritten as (3.3) min G (β˜,Z ), s.t. β ∈ S+. λ,n n−1 zo (b,β)∈∆BL 1 To further study (3.3), we observe that the set S+ is a cone, that is, if zo 1 β ∈ S+, then cβ ∈ S+ for any constant c ≥ 0. For λ > 0, one can see zo zo 1 1 that (cid:107)βˆ (Z˜ )(cid:107) is still finite, based on the fact that G (0,Z ) < ∞ and λ n λ,n n−1 G (β˜,Z ) = ∞ with (cid:107)β(cid:107) = ∞. For λ = 0, we have the same conclusion λ,n n−1 by Proposition 1. When(cid:107)xo(cid:107)islarge,asshownin(3.3),themaineffectofthecontaminated 1 observation (xo,yo) for large margin classifiers is to impose a constraint on 1 1 the feasible solution, equivalently, to change the direction of βˆ(Z˜ ) rather n than its norm. When m = 1, βˆ(Z˜ ) belongs to the feasible set S+ controlled n zo 1 by the outlier (xo,yo), and it also depends on the uncontaminated data set i i Z . Since it is difficult to measure the exact deviation of βˆ(Z˜ ) from the n−1 n theoretical optimizer β ∈ Rp, we consider the worst outlier by maximizing 0 the minimum angle between β and S+, i.e., 0 zo 1 (3.4) max min ∠(β ,β). 0 z1o β∈Sz+1o Note that (3.4) is equivalent to, (3.5) min max βTβ/((cid:107)β (cid:107)(cid:107)β(cid:107)). 0 0 z1o β∈Sz+1o imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017 10 J. ZHAO, G. YU AND Y. LIU We define βTβ/((cid:107)β (cid:107)(cid:107)β(cid:107)) = 0, if β = 0 or β = 0. When zo = (xo,yo) 0 0 0 1 1 1 satisfies xoyo = −c ·β for any c > 0, one can show that (3.4) equals to 1 1 1 0 1 the optimal value π/2, and equivalently (3.5) equals to 0. The assumption that (cid:107)xo(cid:107) → ∞ is satisfied by setting c → ∞. For this worst outlier, since 1 1 βˆ(Z˜ ) ∈ S+, we have (βˆ(Z˜ ))Tβ ≤ 0, i.e., ∠(β ,βˆ(Z˜ )) ≥ π/2. n zo n 0 0 n 1 In general, if there are m outliers Zo such that (cid:107)xo(cid:107) → ∞ for any i ∈ m i {1,2,...,m},wedefineS+ = (cid:84)m S+,whereS+’saresimilarlydefinedas Zo i=1 zo zo m i i S+.Theoptimalsolutionβˆ(Z˜ )isconstrainedinS+ ,anditisalsoaffected zo n Zo by1the uncontaminated sample Z . Thus, it is reamsonable to consider the n−m worst Z¯o defined as m (3.6) Z¯o = argmin(cid:2) sup βTβ/((cid:107)β (cid:107)(cid:107)β(cid:107))(cid:3) := argminA(Zo ). m 0 0 m Zom β∈SZ+om Zom We can check that the optimal solution in (3.6) is achieved at Z¯o := {zo = m i (xo,yo) : xoyo = −c β ,0 < c → ∞,1 ≤ i ≤ m},andA(Z¯o ) ≤ 0.Therefore, i i i i i 0 i m for any possible βˆ(Z˜ ), we have (βˆ(Z˜ ))Tβ ≤ 0, that is, the angle between n n 0 βˆ(Z˜ ) and the true coefficient β is at least π/2. n 0 In summary, as shown in the above theoretical study, for binary linear classification, the main effect of outliers for large margin classifiers is to impose a constraint on the feasible solution, equivalently, to change the direction of βˆ(Z˜ ) rather than its norm. n 3.2. Large margin classifiers with unbounded loss functions. In this sec- tion, we evaluate the angular breakdown point for different loss functions. We make the following assumption. (A1) Supposethat(cid:96)(u)isadecreasingandcontinuousfunctionwith lim (cid:96)(u) u→∞ = 0 and lim (cid:96)(u) = C ≤ ∞. l u→−∞ The assumption (A1) is a very weak assumption, which covers many com- monly used loss functions such as the hinge loss for the SVM, the deviance loss for logistic regression and the exponential loss for AdaBoost. For an unbounded loss with C = ∞, we have the following conclusion. l Theorem 1. (i) Assume that (cid:96)(u) satisfies (A1) with C = ∞. Then the l population angular breakdown point (cid:15)(β ,Z ) = 1/n for binary classification 0 n and the same is true for the sample angular breakdown point in Definition 1(cid:48). (ii) The same conclusion holds for the square loss (1−y(b+βTx))2. Theorem 1 indicates that for linear binary classification, the angular breakdown point for methods with an unbound loss is 1/n, that is, a single imsart-aos ver. 2014/10/16 file: final_manuscript_11_1_2017.tex date: November 1, 2017
Description: