ebook img

New multicategory boosting algorithms based on multicategory Fisher-consistent losses PDF

0.27 MB·English
by  Hui Zou
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview New multicategory boosting algorithms based on multicategory Fisher-consistent losses

TheAnnalsofAppliedStatistics 2008,Vol.2,No.4,1290–1306 DOI:10.1214/08-AOAS198 (cid:13)c InstituteofMathematicalStatistics,2008 NEW MULTICATEGORY BOOSTING ALGORITHMS BASED ON MULTICATEGORY FISHER-CONSISTENT LOSSES 9 By Hui Zou1, Ji Zhu and Trevor Hastie 0 0 University of Minnesota, University of Michigan and Stanford University 2 n Fisher-consistent loss functions play a fundamental role in the a constructionofsuccessful binarymargin-based classifiers. Inthispa- J per we establish the Fisher-consistency condition for multicategory 6 classification problems. Our approach uses the margin vector con- 2 cept which can be regarded as a multicategory generalization of the binary margin. We characterize a wide class of smooth convex loss ] P functions that are Fisher-consistent for multicategory classification. A Wethenconsiderusingthemargin-vector-basedlossfunctionstode- rive multicategory boosting algorithms. In particular, we derive two . t newmulticategoryboostingalgorithmsbyusingtheexponentialand a t logistic regression losses. s [ 1 1. Introduction. The margin-based classifiers, including the support v vector machine (SVM) [Vapnik (1996)] and boosting [Freund and Schapire 8 (1997)], have demonstrated their excellent performances in binaryclassifica- 8 tion problems. Recent statistical theory regards binary margin-based clas- 9 3 sifiers as regularized empirical risk minimizers with proper loss functions. . 1 Friedman, Hastie and Tibshirani (2000) showed that AdaBoost minimizes 0 the novel exponential loss by fitting a forward stage-wise additive model. In 9 thesamespirit,Lin(2002)showedthattheSVMsolvesapenalizedhingeloss 0 : problem and the population minimizer of the hinge loss is exactly the Bayes v rule, thus, the SVM directly approximates the Bayes rule without estimat- i X ingtheconditionalclassprobability.Furthermore,Lin(2004)introducedthe r concept of Fisher-consistent loss in binary classification and he showed that a any Fisher-consistent loss can be used to construct a binary margin-based classifier. Buja, Stuetzle and Shen (2005) discussed the proper scoring rules for binary classification and probability estimation which are closely related to the Fisher-consistent losses. Received March 2008; revised August 12. 1Supportedby NSFgrant DMS 07-06-733. Key words and phrases. Boosting,Fisher-consistentlosses,multicategoryclassification. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2008,Vol. 2, No. 4, 1290–1306. This reprint differs from the original in pagination and typographic detail. 1 2 H. ZOU,J. ZHU ANDT. HASTIE In the binary classification case, the Fisher-consistent loss function the- ory is often used to help us understand the successes of some margin-based classifiers, for the popular classifiers were proposed before the loss function theory. However, theimportant result in Lin (2004) suggests that it is possi- ble to go the other direction: we can firstdesign a nice Fisher-consistent loss function and then derive the corresponding margin-based classifier. This viewpoint is particularly beneficial in the case of multicategory classifica- tion. There has been a considerable amount of work in the literature to extend the binary margin-based classifiers to the multi-category case. A widely used strategy for solving the multi-category classification problem is to employ the one-versus-all method [Allwein, Schapire and Singer (2000)], such that a m-class problem is reduced to m binary classification problems. Rifkin and Klautau (2004) gave very provocative arguments to support the one-versus-all method. AdaBoost.MH [Schapire and Singer (1999)] is a suc- cessful example of the one-versus-all approach which solves a m-class prob- lembyapplyingAdaBoosttombinaryclassification problems.However,the one-versus-all approach could perform poorly with the SVM if there is no dominating class, as shown by Lee, Lin and Wahba (2004). To fix this prob- lem, Lee, Lin and Wahba (2004) proposed the multicategory SVM. Their approach was further analyzed in Zhang (2004a). Liu and Shen (2006) and Liu, Shen and Doss (2005) proposed the multicategory psi-machine. In this paper we extend Lin’s Fisher-consistency result to multicategory classification problems. We define the Fisher-consistent loss in the context of multicategory classification. Our approach is based on the margin vector, which is the multicategory generalization of the margin in binary classifi- cation. We then characterize a family of convex losses which are Fisher- consistent. With a multicategory Fisher-consistent loss function, one can produce a multicategory boosting algorithm by employing gradient decent to minimize the empirical margin-vector-based loss. To demonstrate this idea, we derive two new multicategory boosting algorithms. The rest of the paper is organized as follows. In Section 2 we briefly review binary margin-based classifiers. Section 3 contains the definition of multicategory Fisher-consistent losses. In Section 4 we characterize a class of convex multicategory Fisher-consistent losses. In Section 5 we introduce two new multicategory boosting algorithms that are tested on benchmark data sets. Technical proofs are relegated to the Appendix. 2. Review of binary margin-based losses and classifiers. In standard classification problems we want to predict the label using a set of features. y ∈C is the label where C is a discrete set of size m, and x denotes the feature vector. A classification rule δ is a mapping from x to C such that a label δ(x) is assigned to the data point x. Under the 0–1 loss, the misclas- sification error of δ is R(δ)=P(y6=δ(x)). The smallest classification error MULTICATEGORYBOOSTINGAND FISHER-CONSISTENTLOSSES 3 is achieved by the Bayes rule argmaxci∈Cp(y=ci|x). The conditional class probabilities p(y=c |x) are unknown, so is the Bayes rule. One must con- i struct a classifier δ based on n training samples (y ,x ),i=1,2,...,n,which i i are independent identically distributed (i.i.d.) samples from the underlying joint distribution p(y,x). In the book by Hastie, Tibshirani and Friedman (2001) readers can find detailed explanations of the support vector machine and boosting. Here we briefly discuss a unified statistical view of the binary margin-based classi- fier. In the binary classification problem, C is conveniently coded as {1,−1}, whichisimportantforthebinarymargin-basedclassifiers.Consideramargin- based loss function φ(y,f) = φ(yf), where the quantity yf is called the margin. We define the empirical φ risk as EMR (φ,f)= 1 n φ(y f(x )). n n i=1 i i Then a binary margin-based φ classifier is obtained by solving P fˆ(n)=arg min EMR (φ,f), n f∈Fn where F denotes a regularized functional space. The margin-based clas- n sifier is sign(fˆ(n)(x)). For the SVM, φ is the hinge loss and F is the n collection of penalized kernel estimators. AdaBoost amounts to using the exponential loss φ(y,f)=exp(−yf) and F is the space of decision trees. n The loss function plays a fundamental role in the margin-based classifica- tion. Friedman, Hastie and Tibshirani (2000) justified AdaBoost by show- ing that the population minimizer of the exponential loss is one-half the log-odds. Similarly, in the SVM case, Lin (2002) proved that the population minimizer of the hinge loss is exactly the Bayes rule. Lin (2004) further discussed a class of Fisher-consistent losses. A loss function φ is said to be Fisher-consistent if fˆ(x)=argmin[φ(f(x))p(y=1|x)+φ(−f(x))p(y=−1|x)] f(x) has a unique solution fˆ(x) and sign(fˆ(x))=sign(p(y=1|x)−1/2). The Fisher-consistent condition basically says that with infinite samples, one can exactly recover the Bayes rule by minimizing the φ loss. 3. MulticategoryFisher-consistentlosses. InthissectionweextendLin’s Fisher-consistentlossideatothemulticategory case.WeletC ={1,2,...,m} (m≥3). From the definition of the binary Fisher-consistent loss, we can re- gard the margin as an effective proxy for the conditional class probability, if the decision boundary implied by the “optimal” margin is identical to the Bayes decision boundary. To better illustrate this interpretation of the 4 H. ZOU,J. ZHU ANDT. HASTIE margin, recall that sign(p(y=1|x)−1/2) is the Bayes rule for binary clas- sification and sign(p(y=1|x)−1/2)=sign(p(y=1|x)−p(y=−1|x)), sign(fˆ(x))=sign(fˆ(x)−(−fˆ(x))). The binary margin is defined as yf. Since yf = f or −f, an equivalent formulation is to assign margin f to class 1 and margin −f to class −1. We regard f as the proxy of p(y=1|x) and −f as the proxy of p(y=−1|x), for the purpose of comparison. Then the Fisher-consistent loss is nothing but an effective device to producethe margins that are a legitimate proxy of the conditional class probabilities, in the sense that the class with the largest conditional probability always has the largest margin. Weshowthattheproxyinterpretation ofthemarginoffersagracefulmul- ticategory generalization ofthemargin.Themulticategory marginisconcep- tually identical to the binary margin, which we call the margin-vector. We define the margin vector together with the multicategory Fisher-consistent loss function. Definition 1. A m-vector f is said to be a margin vector if m (3.1) f =0. j j=1 X Suppose φ(·) is a loss function and f(x) is a margin vector for all x. Let p =p(y =j|x), j =1,2,...,m, be the conditional class probabilities and j denote p=(···p ···). Then we define the expected φ risk at x: j m (3.2) φ(p,f(x))= φ(f (x))p(y=j|x). j j=1 X Given n i.i.d. samples, the empirical margin-vector based φ risk is given by n 1 (3.3) EMR (φ)= φ(f (x )). n n yi i i=1 X Alossfunctionφ(·)issaidtobeFisher-consistentform-classclassification if ∀x in a set of full measure, the following optimization problem m (3.4) fˆ(x)=argminφ(p,f(x)) subject to f (x)=0 j f(x) j=1 X has a unique solution fˆ, and (3.5) argmaxfˆ(x)=argmaxp(y=j|x). j j j Furthermore, a loss function φ is said to be universally Fisher-consistent if φ is Fisher-consistent for m-class classification ∀m≥2. MULTICATEGORYBOOSTINGAND FISHER-CONSISTENTLOSSES 5 We have several remarks. Remark 1. We assign a margin f to class j as the proxy of the con- j ditional class probability p(y =j|x). The margin vector satisfies the sum- to-zero constraint such that when m=2, the margin vector becomes the usual binary margin. The sum-to-zero constraint also ensures the existence and uniqueness of the solution to (3.3). The sum-to-zero constraint was also used in Lee, Lin and Wahba (2004). Remark 2. We do not need any special coding scheme for y in our approach, which is very different from the proposal in Lee, Lin and Wahba (2004).Thedatapoint(y ,x )belongstoclassy ,hence,itsmarginisf (x ) i i i yi i and its margin-based risk is φ(f (x )). Thus,the empirical risk is defined as yi i that in (3.3). If we only know x, then y can be any class j with probability p(y=j|x), hence, we consider the expected risk defined in (3.2). Remark 3. TheFisher-consistent condition is a direct generalization of the definition of the Fisher-consistent loss in binary classification. It serves the same purpose: to produce a margin vector that is a legitimate proxy of the conditional class probabilities such that comparing the margins leads to the multicategory Bayes rule. Remark 4. There are many nice Fisher-consistent loss functions for bi- nary classification. It would be interesting to check if these losses for binary classification are also Fisher-consistent for multicategory problems. This question will be investigated in Section 4 where we show that most of pop- ular loss functions for binary classification are universally Fisher-consistent. Remark 5. Buja, Stuetzle and Shen (2005) showed the connection be- tween Fisher-consistent losses and proper scoring rules which estimate the class probabilities in a Fisher consistent manner. Of course, in classification it is sufficient to estimate the Bayes rule consistently, the Fisher-consistent condition is weaker than proper scoring rules. However, we show in the next section that many Fisher-consistent losses do provide estimates of the class probabilities. Thus,they can beconsidered as themulticategory proper scor- ing rules. 4. ConvexmulticategoryFisher-consistentlosses. Inthissectionweshow that thereare anumberof Fisher-consistent loss functionsfor multicategory classification. In this work all loss functions are assumed to benon-negative. Without loss of generality, we assume argmaxci∈Cp(y=ci|x) is unique. We have the following sufficient condition for a differentiable convex function to be universally Fisher-consistent. 6 H. ZOU,J. ZHU ANDT. HASTIE Theorem 1. Let φ(t) be a twice differentiable loss function. If φ′(0)<0 ′′ and φ (t)>0 ∀t, then φ is universally Fisher-consistent. Moreover, letting fˆbe the solution of (3.4), then we have 1/φ′(fˆ(x)) j (4.1) p(y=j|x)= . m 1/φ′(fˆ(x)) k=1 k P Theorem 1 immediately concludes that the two most popular smooth loss functions, namely, exponential loss and logistic regression loss (also called logit loss hereafter), are universally Fisher-consistent for multicategory clas- sification. The inversion formula (4.1) also shows that once the margin vec- tor is obtained, one can easily construct estimates for the conditional class probabilities. It is remarkable because we can not only do classification but also estimate the conditional class probabilities without using the likelihood approach. The conditions in Theorem 1 can be further relaxed without weakening the conclusion. Supposing φ satisfies the conditions in Theorem 1, we can consider the linearized version of φ. Define the set A as given in the proof of Theorem 1 (see Section 6) and let t =infA. If A is empty, we let t =∞. 1 1 Choosing a t <0, then we define a new convex loss as follows: 2 ′ φ(t )(t−t )+φ(t ), if t≤t , 2 2 2 2 ζ(t)= φ(t), if t <t<t , 2 1  φ(t1), if t1≤t. As a modifiedversion of φ, ζ is adecreasing convex function andapproaches infinity linearly. We show that ζ is also universally Fisher-consistent. Theorem 2. ζ(t) is universally Fisher-consistent and (4.1) holds for ζ. Theorem 2 covers the squared hinge loss and the modified Huber loss. Thus, Theorems 1 and 2 conclude that the popular smooth loss functions used in binary classification are universally Fisher-consistent for multicate- gory classification. In the reminder of this section we closely examine these loss functions. 4.1. Exponential loss. We consider the case φ (t)= e−t, φ′(t) = −e−t 1 1 and φ′′(t)=e−t.By Theorem 1,we know that theexponential loss is univer- 1 sally Fisher-consistent. In addition, the inversion formula (4.1) in Theorem 1 tells us that efˆj p = . j mk=1efˆk P MULTICATEGORYBOOSTINGAND FISHER-CONSISTENTLOSSES 7 To express fˆby p, we write m fˆ =log(p )+log efˆk . j j ! k=1 X Since m fˆ =0, we conclude that j=1 j P m m 0= log(p )+mlog efˆk , j ! j=1 k=1 X X or equivalently, m 1 fˆ =log(p )− log(p ). j j k m k=1 X Thus, the exponential loss derives exactly the same estimates by the multi- nomial deviance function. 4.2. Logit loss. The logit loss function is φ (t)=log(1+e−t), which is 2 essentially the negative binomial deviance. We compute φ′(t)= −1 and 2 1+et φ′′(t)= et .ThenTheorem1saysthatthelogitlossisuniversallyFisher- 2 (1+et)2 consistent. By the inversion formula (4.1), we also obtain 1+efˆj p = . j mk=1(1+efˆk) Tobetter appreciate formula(4P.1),let us try to expressthemargin vector in terms of the class probabilities. Let λ∗= m (1+efˆk). Then we have k=1 fˆ =log(−1+p λP∗). j j Note that pfˆ =0, thus, λ∗ is the root of equation j j P m log(−1+p λ)=0. j j=1 X When m=2, it is not hard to check that λ∗=p p . Hence, fˆ =log(p1) and 1 2 1 p2 fˆ =log(p2), which are the familiar results for binary classification. When 2 p1 m>2, fˆdepends on p in a much more complex way. But p is always easily computed from the margin vector fˆ. Thelogitlossisquiteunique,foritisessentiallythenegative(conditional) log-likelihoodinthebinaryclassificationproblem.Inthemulticategoryprob- lem, from the likelihood point of view, the multinomial likelihood should be used,notthelogit loss. From the viewpointof theFisher-consistent loss, the 8 H. ZOU,J. ZHU ANDT. HASTIE logit loss isalsoappropriateforthemulticategory classification problem,be- causeitisuniversallyFisher-consistent. Welater demonstratetheusefulness of the logit loss in multicategory classification by deriving a multicategory logit boosting algorithm. 4.3. Least squares loss, Squared hinge loss and modified Huber loss. The leastsquareslossisφ (t)=(1−t)2.Wecomputeφ′(t)=2(t−1)andφ′′(t)= 3 3 3 ′ 2. φ(0) =−2, hence, by Theorem 1, the least squares loss is universally Fisher-consistent. Moreover, the inversion formula (4.1) shows that 1/(1−fˆ) j p = . j m 1/(1−fˆ) k=1 k We observe that fˆj =1−(pjλ∗)−P1, where λ∗= mk=11/(1−fˆk). pj=1fˆj =0 implies that λ∗ is the root of equation mj=1P(1−(λpj)−1) =0P. We solve λ∗= m1( mj=11/pj). Thus, P P m −1 1 1 fˆ =1− · 1/p . j k pj m ! k=1 X When m = 2, we have the familiar result: fˆ = 2p −1, by simply using 1 1 1/p +1/p = 1/p p . In multicategory problems the above formula says 1 2 1 2 that with the least squares loss, the margin vector is directly linked to the inverse of the conditional class probability. We consider φ (t)=(1−t)2, where “+” means the positive part. φ is 4 + 4 called the squared hinge loss. It can be seen as a linearized version of least squares loss with t =1 and t =−∞. By Theorem 2, the squared hinge 1 2 loss is universally Fisher-consistent. Furthermore, it is interesting to note that the squared hinge loss shares the same population minimizer with least squares loss. ModifiedHuberlossisanotherlinearized versionofleast squareslosswith t =1 and t =−1, which is expressed as follows: 1 2 −4t, if t≤−1, φ (t)= (t−1)2, if −1<t<1, 5  0, if 1≤t. ByTheorem2,weknowmodifiedHuberlossisuniversallyFisher-consistent.  The first derivative of φ is 5 −4, if t≤−1, ′ φ (t)= 2(t−1), if −1<t<1, 5  0, if 1≤t, which is used to convert the margin vector to the conditional class proba-  bility. MULTICATEGORYBOOSTINGAND FISHER-CONSISTENTLOSSES 9 Algorithm 5.1 Multicategory GentleBoost 1. Start with w =1, i=1,2,...,n, G (x)=0, j=1,...,m. i j 2. For k=1 to M, repeat: (a) For j=1 to m, repeat: i. Let z =−1/m+I(y =j). Compute w∗=w z2 and re-normalize. i i i i i ii. Fittheregressionfunctiong (x)byweightedleast-squaresofwork- j ing response z−1 to x with weights w∗. i i i iii. Update G (x)=G (x)+g (x). j j j (b) Compute f (x)=G (x)− 1 m G (x). j j m k=1 k (c) Compute w =exp(−f (x )). i yi i P 3. Output the classifier argmax f (x). j j 5. Multicategory boosting algorithms. In this section we take advan- tage of the multicategory Fisher-consistent loss functions to construct mul- ticategory classifiers that treat all classes simultaneously without reducing the multicategory problem to a sequence of binary classification problems. We follow Friedman, Hastie and Tibshirani (2000) and Friedman (2001) to view boosting as a gradient decent algorithm that minimizes the expo- nential loss. This view was also adopted by Bu¨hlmann and Yu (2003) to derive L -boosting. For a nice overview of boosting, we refer the readers 2 to Bu¨hlamnn and Hothorn (2007). Borrowing the gradient decent idea, we show that some new multicategory boosting algorithms naturally emerge when using multicategory Fisher-consistent losses. 5.1. GentleBoost. Friedman, Hastie and Tibshirani (2000) proposed the binary Gentle AdaBoost algorithm to minimize the exponential loss by us- ing regression trees as base learners. In the same spirit we can derive the multicategory GentleBoost algorithm, as outlined in Algorithm 5.1. 5.1.1. Derivation of GentleBoost. By the symmetry constraint on f, we consider the following representation: m 1 (5.1) f (x)=G (x)− G (x) for j=1,...,m. j j k m k=1 X No restriction is put on G. We write the empirical risk in terms of G: n m 1 1 (5.2) exp −G (x )+ G (x ) :=L(G). n yi i m k i ! i=1 k=1 X X We want to find increments on G such that the empirical risk decreases most. Let g(x) be the increments. Following the derivation of the Gentle AdaBoostalgorithminFriedman, Hastie and Tibshirani(2000),weconsider 10 H. ZOU,J. ZHU ANDT. HASTIE Algorithm 5.2 AdaBoost.ML 1. Start with f (x)=0, j=1,...,m. j 2. For k=1 to M: (a) Compute weights w = 1 and re-normalize. (b) Fit a m-class classifiier 1T+e(xxp)(fytio(xtih))e training data using weights w . k i Define m−1 , if T (x)=j, k m g (x)=r j − 1 , if Tk(x)6=j. m(m−1) s (c) Compute γˆ =argmin 1 n log(1+exp(−f (x )−γg (x ))). k γ n i=1 yi i yi i (d) Update f(x)←f(x)+γˆ g(x). k P 3. Output the classifier argmax f (x). j j the expansion of (5.2) to the second order anduse adiagonal approximation to the Hessian, then we obtain n m 1 L(G+g)≈L(G)− g (x )z exp(−f (x )) n k i ik yi i ! i=1 k=1 X X n m 1 1 + g2z2 (x )exp(−f (x )) , n 2 k ik i yi i ! i=1 k=1 X X where z =−1/m+I(y =k). For each j, we seek g (x) that minimizes ik i j n n 1 − g (x )z exp(−f (x ))+ g2(x )z2 exp(−f (x )). j i ij yi i 2 j i ij yi i i=1 i=1 X X A straightforward solution is to fittheregression function g (x) by weighted j least-squares of z−1 to x with weights z2 exp(−f (x )). Then f is updated ij i ij yi i accordingly by (5.1). In the implementation of the multicategory Gentle- Boost algorithm we use regression trees to fit g (x). j 5.2. AdaBoost.ML. We propose a new logit boosting algorithm (Algo- rithm5.2)byminimizingthebinarylogit risk.SimilartoAdaBoost,thenew logit boostingalgorithm aggregates themulticategory decisiontree,thus,we call it AdaBoost.ML. 5.2.1. Derivation of AdaBoost.ML. We use the gradient decent algo- rithm to find fˆ(x) in the space of margin vectors to minimize n 1 EER (f)= log(1+exp(−f (x ))). n n yi i i=1 X

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.