PUBLISHED VERSION Shen, Chunhua; Kim, Junae; Wang, Lei; van den Hengel, Anton Positive semidefinite metric learning using boosting-like algorithms Journal of Machine Learning Research, 2012; 13:1007−1036. © 2012 Chunhua Shen, Junae Kim, Lei Wang, and Anton van den Hengel. PERMISSIONS Received from Dr. Chunhua Shen, 21st May 2012 http://digital.library.adelaide.edu.au/dspace/handle/2440/70243 JournalofMachineLearningResearch13(2012)1007-1036 Submitted2/11;Revised1/12;Published4/12 Positive Semidefinite Metric Learning Using Boosting-like Algorithms ChunhuaShen [email protected] TheUniversityofAdelaide Adelaide,SA5005,Australia JunaeKim [email protected] NICTA,CanberraResearchLaboratory LockedBag8001 Canberra,ACT2601,Australia LeiWang [email protected] UniversityofWollongong Wollongong,NSW2522,Australia AntonvandenHengel [email protected] TheUniversityofAdelaide Adelaide,SA5005,Australia Editors:So¨renSonnenburg,FrancisBach,ChengSoonOng Abstract The success of many machine learning and pattern recognition methods relies heavily upon the identificationofanappropriatedistancemetricontheinputdata.Itisoftenbeneficialtolearnsucha metricfromtheinputtrainingdata,insteadofusingadefaultonesuchastheEuclideandistance.In thiswork,weproposeaboosting-basedtechnique,termedBOOSTMETRIC,forlearningaquadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. BOOSTMETRICisinsteadbasedontheobservationthatanypositivesemidefinitema- trixcanbedecomposedintoalinearcombinationoftrace-onerank-onematrices. BOOSTMETRIC thususesrank-onepositivesemidefinitematricesasweaklearnerswithinanefficientandscalable boosting-basedlearningprocess. Theresultingmethodsareeasytoimplement, efficient, andcan accommodate various types of constraints. We extend traditional boosting algorithms in that its weaklearnerisapositivesemidefinitematrixwithtraceandrankbeingoneratherthanaclassifier orregressor. Experimentsonvariousdatasetsdemonstratethattheproposedalgorithmscompare favorablytothosestate-of-the-artmethodsintermsofclassificationaccuracyandrunningtime. Keywords: Mahalanobisdistance, semidefiniteprogramming, columngeneration, boosting, La- grangeduality,largemarginnearestneighbor 1. Introduction The identification of an effective metric by which to measure distances between data points is an essential component of many machine learning algorithms including k-nearest neighbor (kNN), k- meansclustering,andkernelregression. Thesemethodshavebeenappliedtoarangeofproblems, including image classification and retrieval (Hastie and Tibshirani, 1996; Yu et al., 2008; Jian and (cid:13)c2012ChunhuaShen,JunaeKim,LeiWangandAntonvandenHengel. SHEN,KIM,WANGANDVANDENHENGEL Vemuri, 2007; Xing et al., 2002; Bar-Hillel et al., 2005; Boiman et al., 2008; Frome et al., 2007) amongstahostofothers. The Euclidean distance has been shown to be effective in a wide variety of circumstances. Boiman et al. (2008), for instance, showed that in generic object recognition with local features, kNN with a Euclidean metric can achieve comparable or better accuracy than more sophisticated classifiers such as support vector machines (SVMs). The Mahalanobis distance represents a gen- eralization of the Euclidean distance, and offers the opportunity to learn a distance metric directly fromthedata. ThislearnedMahalanobisdistanceapproachhasbeenshowntoofferimprovedper- formance over Euclidean distance-based approaches, and was particularly shown by Wang et al. (2010b) to represent an improvement upon the method of Boiman et al. (2008). It is the prospect of a significant performance improvement from fundamental machine learning algorithms which inspirestheapproachpresentedhere. If we let a,i = 1,2···, represent a set of points in RD, then the Mahalanobis distance, or i Gaussianquadraticdistance,betweentwopointsis ka −a k = (a −a )⊤X(a −a ), i j X i j i j q where X<0 is a positive semidefinite (p.s.d.) matrix. The Mahalanobis distance is thus param- eterized by a p.s.d. matrix, and methods for learning Mahalanobis distances are therefore often framed as constrained semidefinite programs. The approach we propose here, however, is based on boosting, which is more typically used for learning classifiers. The primary motivation for the boosting-basedapproachisthatitscaleswell,butitsefficiencyindealingwithlargedatasetsisalso advantageous. The learning of Mahalanobis distance metrics represents a specific application of a moregeneralmethodformatrixlearningwhichwepresentbelow. Weareinterestedhereinthecasewherethetrainingdataconsistofasetofconstraintsuponthe relativedistancesbetweendatapoints, I ={(a,a ,a )|dist <dist }, (1) i j k ij ik where dist measures the distance between a and a . Each such constraint implies that “a is ij i j i closer to a than a is to a ”. Constraints such as these often arise when it is known that a and a j i k i j belong to the same class of data points while a,a belong to different classes. These comparison i k constraintsarethusoftenmucheasiertoobtainthaneithertheclasslabelsordistancesbetweendata elements(SchultzandJoachims,2003). Forexample,invideocontentretrieval,facesextractedfrom successive frames at close locations can be safely assumed to belong to the same person, without requiring the individual to be identified. In web search, the results returned by a search engine are ranked according to the relevance, an ordering which allows a natural conversion into a set of constraints. The problem of learning a p.s.d. matrix such as X can be formulated in terms of estimating a projection matrix L where X = LL⊤. This approach has the advantage that the p.s.d. constraint is enforced through the parameterization, but the disadvantage is that the relationship between the distancemeasureandtheparametermatrixislessdirect. Inpracticethisapproachhasleadtolocal, ratherthangloballyoptimalsolutions,however(seeGoldbergeretal.,2004forexample). MethodssuchasXingetal.(2002), Weinbergeretal.(2005), WeinbergerandSaul(2006)and Globerson and Roweis (2005) which seek X directly are able to guarantee global optimality, but at the cost of a heavy computational burden and poor scalability as it is not trivial to preserve the 1008 METRICLEARNINGUSINGBOOSTING-LIKEALGORITHMS semidefinitenessofXduringthecourseoflearning. Standardapproachessuchasinterior-point(IP) NewtonmethodsneedtocalculatetheHessian. ThistypicallyrequiresO(D4)storageandhasworst- case computational complexity of approximately O(D6.5) where D is the size of the p.s.d. matrix. Thisisprohibitiveformanyreal-worldproblems. Analternatingprojected(sub-)gradientapproach is adopted in Weinberger et al. (2005), Xing et al. (2002) and Globerson and Roweis (2005). The disadvantages of this algorithm, however, are: 1) it is not easy to implement; 2) many parameters areinvolved;3)usuallyitconvergesslowly. We propose here a method for learning a p.s.d. matrix labeled BOOSTMETRIC. The method is based on the observation that any positive semidefinite matrix can be decomposed into a lin- ear positive combination of trace-one rank-one matrices. The weak learner in BOOSTMETRIC is thusatrace-onerank-onep.s.d.matrix. Theproposed BOOSTMETRIC algorithmhasthefollowing desirableproperties: 1. BOOSTMETRICisefficientandscalable. Unlikemostexistingmethods,nosemidefinitepro- gramming is required. At each iteration, only the largest eigenvalue and its corresponding eigenvectorareneeded. 2. BOOSTMETRIC can accommodate various types of constraints. We demonstrate the use of the method to learn a Mahalanobis distance on the basis of a set of proximity comparison constraints. 3. LikeAdaBoost,BOOSTMETRICdoesnothaveanyparametertotune. Theuseronlyneedsto know when to stop. Also like AdaBoost itis easy to implement. No sophisticated optimiza- tion techniques are involved. The efficacy and efficiency of the proposed BOOSTMETRIC is demonstratedonvariousdatasets. 4. Wealsoproposeatotally-correctiveversionof BOOSTMETRIC. AsinTotalBoost(Warmuth et al., 2006) the weights of all the selected weak learners (rank-one matrices) are updated at eachiteration. Boththestage-wiseBOOSTMETRICandtotally-correctiveBOOSTMETRICmethodsarevery easytoimplement. Theprimarycontributionsofthisworkarethereforeasfollows: 1)Weextendtraditionalboost- ingalgorithmssuchthateachweaklearnerisamatrixwiththetraceandrankofone—whichmust be positive semidefinite—rather than a classifier or regressor; 2) The proposed algorithm can be used to solve many semidefinite optimization problems in machine learning and computer vision. We demonstrate the scalability and effectiveness of our algorithms on metric learning. Part of this workappearedinShenetal.(2008,2009). Moretheoreticalanalysisandexperimentsareincluded inthisversion. Next,wereviewsomerelevantworkbeforewepresentouralgorithms. 1.1 RelatedWork Distance metric learning is closely related to subspace methods. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two classical dimensionality reduction tech- niques. PCA finds the subspace that captures the maximum variance within the input data while LDAtriestoidentifytheprojectionwhichmaximizesthebetween-classdistanceandminimizesthe within-class variance. Locality preserving projection (LPP) finds a linear projection that preserves 1009 SHEN,KIM,WANGANDVANDENHENGEL theneighborhoodstructureofthedataset(Heetal.,2005). Essentially,LPPlinearlyapproximates the eigenfunctions of the Laplace Beltrami operator on the underlying manifold. The connection between LPP and LDA is also revealed in He et al. (2005). Wang et al. (2010a) extended LPP to supervised multi-label classification. Relevant component analysis (RCA) (Bar-Hillel et al., 2005) learns a metric from equivalence constraints. RCA can be viewed as extending LDA by incorpo- rating must-link constraints and cannot-link constraints into the learning procedure. Each of these methods may be seen as devising a linear projection from the input space to a lower-dimensional output space. If this projection is characterized by the matrix L, then note that these methods may be related to the problem of interest here by observing X=LL⊤. This typically implies that X is rank-deficient. Recently,therehasbeensignificantresearchinterestinsuperviseddistancemetriclearningusing sideinformationthatistypicallypresentedinasetofpairwiseconstraints. Mostofthesemethods, although appearing in different formats, share a similar essential idea: to learn an optimal dis- tance metric by keeping training examples in equivalence constraints close, and at the same time, examples in in-equivalence constraints well separated. Previous work of Xing et al. (2002), Wein- berger et al. (2005), Jian and Vemuri (2007), Goldberger et al. (2004), Bar-Hillel et al. (2005) and Schultz and Joachims (2003) fall into this category. The requirement that X must be p.s.d. has led to the development of a number of methods for learning a Mahalanobis distance which rely upon constrained semidefinite programing. This approach has a number of limitations, however, which we now discuss with reference to the problem of learning a p.s.d. matrix from a set of constraints uponpairwise-distancecomparisons. RelevantworkonthistopicincludesBar-Hilleletal.(2005), Xing et al. (2002), Jian and Vemuri (2007), Goldberger et al. (2004), Weinberger et al. (2005) and GlobersonandRoweis(2005)amongstothers. Xingetal.(2002)firstproposedtheideaoflearningaMahalanobismetricforclusteringusing convexoptimization. Theinputsaretwosets: asimilaritysetandadis-similarityset. Thealgorithm maximizesthedistancebetweenpointsinthedis-similaritysetundertheconstraintthatthedistance between points in the similarity set is upper-bounded. Neighborhood component analysis (NCA) (Goldberger et al., 2004) and large margin nearest neighbor (LMNN) (Weinberger et al., 2005) learn a metric by maintaining consistency in data’s neighborhood and keep a large margin at the boundariesofdifferentclasses. IthasbeenshowninWeinbergerandSaul(2009);Weinbergeretal. (2005) that LMNN delivers the state-of-the-art performance among most distance metric learning algorithms. Information theoretic metric learning (ITML) learns a suitable metric based on infor- mation theoretics (Davis et al., 2007). To partially alleviate the heavy computation of standard IP Newtonmethods, Bregman’scyclicprojectionisusedinDavisetal.(2007). Thisideaisextended inWangandJin(2009),whichhasaclosed-formsolutionandiscomputationallyefficient. There have been a number of approaches developed which aim to improve the scalability of theprocessoflearningametricparameterizedbyap.s.d.metricX. Forexample,RosalesandFung (2006)approximatethep.s.d.coneusingasetoflinearconstraintsbasedonthediagonaldominance theorem. Theapproximationisnotaccurate,however,inthesensethatitimposestoostrongacon- ditiononthelearnedmatrix—onemaynotwanttolearnadiagonallydominantmatrix. Alternative optimization is used in Xing et al. (2002) and Weinberger et al. (2005) to solve the semidefinite problem iteratively. At each iteration, a full eigen-decomposition is applied to project the solu- tionbackontothep.s.d.cone. BOOSTMETRIC isconceptuallyverydifferenttothisapproach,and additionally only requires the calculation of the first eigenvector. Tsuda et al. (2005) proposed to use matrix logarithms and exponentials to preserve positive definiteness. For the application of 1010 METRICLEARNINGUSINGBOOSTING-LIKEALGORITHMS semidefinitekernellearning,theydesignedamatrixexponentiatedgradientmethodtooptimizevon Neumanndivergencebasedobjectivefunctions. Ateachiterationofmatrixexponentiatedgradient, afulleigen-decompositionisneeded. Incontrast,weonlyneedtofindtheleadingeigenvector. TheapproachproposedhereisdirectlyinspiredbytheLMNNproposedinWeinbergerandSaul (2009); Weinberger et al. (2005). Instead of using the hinge loss, however, we use the exponential lossandlogisticlossfunctionsinordertoderiveanAdaBoost-like(orLogitBoost-like)optimization procedure. In theory, any differentiable convex loss function can be applied here. Hence, despite similar purposes, our algorithm differs essentially in the optimization. While the formulation of LMNN looks more similar to SVMs, our algorithm, termed BOOSTMETRIC, largely draws upon AdaBoost(Schapire,1999). Column generation was first proposed by Dantzig and Wolfe (1960) for solving a particular form of structured linear program with an extremely large number of variables. The general idea of column generation is that, instead of solving the original large-scale problem (master problem), one works on a restricted master problem with a reasonably small subset of the variables at each step. The dual of the restricted master problem is solved by the simplex method, and the optimal dualsolutionisusedtofindthenewcolumntobeincludedintotherestrictedmasterproblem. LP- Boost(Demirizetal.,2002)isadirectapplicationofcolumngenerationinboosting. Significantly, LPBoostshowedthatinanLPframework,unknownweakhypothesescanbelearnedfromthedual although the space of all weak hypotheses is infinitely large. Shen and Li (2010) applied column generationtoboostingwithgenerallossfunctions. ItistheseresultsthatunderpinBOOSTMETRIC. Theremainingcontentisorganizedasfollows. InSection2wepresentsomepreliminarymath- ematics. In Section 3, we show the main results. Experimental results are provided in Section 4. 2. Preliminaries We introduce some fundamental concepts that are necessary for setting up our problem. First, the notationusedinthispaperisasfollows. 2.1 Notation Throughout this paper, a matrix is denoted by a bold upper-case letter (X); a column vector is denoted by a bold lower-case letter (xxx). The ith row of X is denoted by X and the ith column X . i: :i 111and000arecolumnvectorsof1’sand0’s,respectively. Theirsizeshouldbeclearfromthecontext. WedenotethespaceofD×DsymmetricmatricesbySD,andpositivesemidefinitematricesbySD. + Tr(·) is the trace of a symmetric matrix and hX,Zi=Tr(XZ⊤)=(cid:229) X Z calculates the inner ij ij ij productoftwomatrices. Anelement-wiseinequalitybetweentwovectorslikeuuu≤vvvmeansu ≤v i i for all i. We use X<0 to indicate that matrix X is positive semidefinite. For a matrix X∈SD, the following statements are equivalent: 1) X<0 (X∈SD); 2) All eigenvalues of X are nonnegative + (l (X)≥0,i=1,···,D);and3)∀uuu∈RD,uuu⊤Xuuu≥0. i 2.2 ATheoremonTrace-oneSemidefiniteMatrices Before we present our main results, we introduce an important theorem that serves the theoretical basisof BOOSTMETRIC. 1011 SHEN,KIM,WANGANDVANDENHENGEL Definition1 Foranypositiveintegerm,givenasetofpoints{xxx ,...,xxx }inarealvectorormatrix 1 m spaceSp,theconvexhullofSpspannedbymelementsinSpisdefinedas: Conv (Sp)= (cid:229) m wxxx w ≥0,(cid:229) m w =1,xxx ∈Sp . m i=1 i i i i=1 i i n (cid:12) o DefinethelinearconvexspanofSpas:1 (cid:12)(cid:12) Conv(Sp)=[Convm(Sp)= (cid:229) mi=1wixxxi wi≥0,(cid:229) mi=1wi=1,xxxi∈Sp,m∈Z+ . m n (cid:12) o (cid:12) HereZ denotesthesetofallpositiveintegers. (cid:12) + Definition2 LetusdefineG tobethespaceofallpositivesemidefinitematricesX∈SD withtrace 1 + equalingone: G ={X|X<0,Tr(X)=1}; 1 andY tobethespaceofallpositivesemidefinitematriceswithbothtraceandrankequalingone: 1 Y ={Z|Z<0,Tr(Z)=1,Rank(Z)=1}. 1 WealsodefineG astheconvexhullofY ,thatis, 2 1 G =Conv(Y ). 2 1 Lemma3 LetY beaconvexpolytopedefinedasY ={lll ∈RD|l ≥0,∀k=0,···,D,(cid:229) D l = 2 2 k k=1 k 1},thenthepointswithonlyoneelementequalingoneandalltheothersbeingzerosaretheextreme points(vertexes)ofY . Alltheotherpointscannotbeextremepoints. 2 Proof Without loss of generality, let us consider such a point lll ′ ={1,0,···,0}. If lll ′ is not an extreme point of Y , then it must be possible to express it as a convex combination of a set of 2 other points in Y : lll ′ = (cid:229) m wlll i, w > 0, (cid:229) m w = 1 and lll i 6=lll ′. Then we have equations: 2 i=1 i i i=1 i (cid:229) m wl i =0, ∀k=2,···,D. It follows that l i =0, ∀i and k=2,···,D. That means, l i =1 ∀i. i=1 i k k 1 This is inconsistent withlll i 6=lll ′. Therefore such a convex combination does not exist andlll ′ must beanextremepoint. Itistrivialtoseethatanylll thathasmorethanoneactiveelementisanconvex combinationoftheabove-definedextremepoints. Sotheycannotbeextremepoints. Theorem4 G equals to G ; that is, G is also the convex hull of Y . In other words, all Z∈Y , 1 2 1 1 1 formthesetofextremepointsofG . 1 Proof It is easy to check that any convex combination (cid:229) wZ, such that Z ∈Y , resides in G , i i i i 1 1 with the following two facts: 1) a convex combination of p.s.d. matrices is still a p.s.d. matrix; 2) Tr (cid:229) wZ =(cid:229) w Tr(Z)=1. i i i i i i By denoting l ≥ ··· ≥ l ≥ 0 the eigenvalues of a Z ∈ G , we know that l ≤ 1 because 1 D 1 1 (cid:0) (cid:1) (cid:229) D l =Tr(Z)=1. Therefore, all eigenvalues of Z must satisfy: l ∈[0,1], ∀i=1,···,D and i=1 i i 1.Withslightabuseofnotation,wealsousethesymbolConv(·)todenoteconvexspan. Ingeneralitisnotaconvex hull. 1012 METRICLEARNINGUSINGBOOSTING-LIKEALGORITHMS (cid:229) Dl =1. By looking at the eigenvalues of Z and using Lemma 3, it is immediate to see that a i i matrixZsuchthatZ<0,Tr(Z)=1andRank(Z)>1cannotbeanextremepointofG . Theonly 1 candidatesforextremepointsarethoserank-onematrices(l =1andl =0). Moreover,itis 1 2,···,D not possible that some rank-one matrices are extreme points and others are not because the other twoconstraintsZ<0andTr(Z)=1donotdistinguishbetweendifferentrank-onematrices. Hence,allZ∈Y formthesetofextremepointsofG . Furthermore,G isaconvexandcompact 1 1 1 set, whichmusthaveextremepoints. TheKrein-MilmanTheorem(KreinandMilman,1940)tells usthataconvexandcompactsetisequaltotheconvexhullofitsextremepoints. This theorem is a special case of the results from Overton and Womersley (1992) in the context of eigenvalue optimization. A different proof for the above theorem’s general version can also be foundinFillmoreandWilliams(1971). In the context of semidefinite optimization, what is of interest about Theorem 4 is as follows: it tells us that a bounded p.s.d. matrix constraint X∈G can be equivalently replaced with a set of 1 constrains which belong to G . At the first glance, this is a highly counterintuitive proposition be- 2 causeG involvesmanymorecomplicatedconstraints. Bothw andZ (∀i=1,···,m)areunknown 2 i i variables. Evenworse,mcouldbeextremely(oreveninfinitely)large. Nevertheless,thisisthetype ofproblemsthatboostingalgorithmsaredesignedtosolve. Letusgiveabriefoverviewofboosting algorithms. 2.3 Boosting Boostingisanexampleofensemblelearning,wheremultiplelearnersaretrainedtosolvethesame problem. Typically a boosting algorithm (Schapire, 1999) creates a single strong learner by incre- mentallyaddingbase(weak)learnerstothefinalstronglearner. Thebaselearnerhasanimportant impactonthestronglearner. Ingeneral,aboostingalgorithmbuildsonauser-specifiedbaselearn- ingprocedureandrunsitrepeatedlyonmodifieddatathatareoutputsfromthepreviousiterations. ThegeneralformoftheboostingalgorithmissketchedinAlgorithm1. Theinputstoaboosting algorithmareasetoftrainingexamplexxx,andtheircorrespondingclasslabelsy. Thefinaloutputis astrongclassifierwhichtakestheform F (xxx)=(cid:229) J w h (xxx). (2) www j=1 j j Hereh (·)isabaselearner. FromTheorem4,weknowthatamatrixX∈G canbedecomposedas j 1 X=(cid:229) J w Z ,Z ∈G . (3) j=1 j j j 2 By observing the similarity between Equations (2) and (3), we may view Z as a weak classifier j and the matrix X as the strong classifier that we want to learn. This is exactly the problem that boostingmethodshavebeendesignedtosolve. Thisobservationinspiresustosolveaspecialtype ofsemidefiniteoptimizationproblemusingboostingtechniques. ThesparsegreedyapproximationalgorithmproposedbyZhang(2003)isanefficientmethodfor solvingaclassofconvexproblems,andachievesfastconvergencerates. Ithasalsobeenshownthat boosting algorithms can be interpreted within the general framework of Zhang (2003). The main ideaofsequentialgreedyapproximation, therefore, isasfollows. Givenaninitializationuuu , which 0 is in a convex subset of a linear vector space, a matrix space or a functional space, the algorithm findsuuu andl ∈(0,1)suchthattheobjectivefunctionF((1−l )uuu +l uuu)isminimized. Thenthe i i−1 i 1013 SHEN,KIM,WANGANDVANDENHENGEL Algorithm1Thegeneralframeworkofboosting. Input: Trainingdata. 1 Initializeaweightsetuuuonthetrainingexamples; 2 for j=1,2,···,do 3 ···Receiveaweakhypothesishj(·); 4 ···Calculatewj >0; 5 ···Updateuuu. Output: Aconvexcombinationoftheweakhypotheses: F (xxx)=(cid:229) J w h (xxx). www j=1 j j solutionuuu isupdatedasuuu =(1−l )uuu +l uuu andtheiterationgoeson. Clearly,uuu mustremainin i i i−1 i i theoriginalspace. Asshownnext,ourfirstcase,whichlearnsametricusingthehingeloss,greatly resemblesthisidea. 2.4 DistanceMetricLearningUsingProximityComparison TheprocessofmeasuringdistanceusingaMahalanobismetricisequivalenttolinearlytransforming thedatabyaprojectionmatrixL∈RD×d (usuallyD≥d)beforecalculatingthestandardEuclidean distance: dist2 =kL⊤a −L⊤a k2=(a −a )⊤LL⊤(a −a )=(a −a )⊤X(a −a ). ij i j 2 i j i j i j i j As described above, the problem of learning a Mahalanobis metric can be approached in terms of learning the matrix L, or the p.s.d. matrix X. If X=I, the Mahalanobis distance reduces to the Euclideandistance. IfXisdiagonal,theproblemcorrespondstolearningametricinwhichdifferent featuresaregivendifferentweights,a.k.a.,featureweighting. Ourapproachistolearnafullp.s.d. matrixX,however,usingBOOSTMETRIC. In the framework of large-margin learning, we want to maximize the distance between dist ij anddist . Thatis,wewishtomakedist2 −dist2 =(a −a )⊤X(a −a )−(a −a )⊤X(a −a )as ik ik ij i k i k i j i j largeaspossibleundersomeregularization. Tosimplifynotation, werewritethedistancebetween dist2 anddist2 asdist2 −dist2 =hA ,Xi,where ij ik ik ij r A =(a −a )(a −a )⊤−(a −a )(a −a )⊤, (4) r i k i k i j i j forr=1,···,|I|and|I|isthesizeofthesetofconstraintsI definedinEquation(1). 3. Algorithms Inthissection,wedefinetheoptimizationproblemsformetriclearning. Wemainlyinvestigatethe casesusingthehingeloss,exponentiallossandlogisticlossfunctions. Inordertoderiveanefficient optimizationstrategy,welookattheirLagrangedualproblemsanddesignboosting-likeapproaches forefficiency. 3.1 LearningwiththeHingeLoss Our goal is to derive a general algorithm for p.s.d. matrix learning with the hinge loss function. Assumethatwewanttofindap.s.d.matrixX<0suchthatasetofconstraints hA ,Xi>0,r=1,2,···, r 1014 METRICLEARNINGUSINGBOOSTING-LIKEALGORITHMS are satisfied as well as possible. Here A is as defined in (4). These constraints need not all be r strictlysatisfiedandthuswedefinethemarginr =hA ,Xi,∀r. r r Putting it into the maximum margin learning framework, we want to minimize the following tracenormregularizedobjectivefunction: (cid:229) F(hA ,Xi)+vTr(X),withF(·)aconvexlossfunction r r and v a regularization constant. Here we have used the trace norm regularization. Of course a Frobenius norm regularization term can also be used here. Minimizing the Frobenius norm ||X||2, F whichisequivalenttominimizetheℓ normoftheeigenvaluesofX,penalizesasolutionthatisfar 2 awayfromtheidentitymatrix. Withthehingeloss,wecanwritetheoptimizationproblemas: max r −v(cid:229) |I| x , s.t.:hA ,Xi≥r −x ,∀r;X<0,Tr(X)=1; xxx ≥000. (5) r ,X,xxx r=1 r r r HereTr(X)=1removesthescaleambiguitybecausethedistanceinequalitiesarescaleinvariant. We can decompose X into: X=(cid:229) J w Z , with w >0, Rank(Z )=1 and Tr(Z )=1, ∀j. j=1 j j j j j Sowehave hA ,Xi= A ,(cid:229) J w Z =(cid:229) J w A ,Z =(cid:229) J w H =H www,∀r. (6) r r j=1 j j j=1 j r j j=1 j rj r: Here H is a shorthand(cid:10)for H = A ,(cid:11)Z . Clearl(cid:10)y, Tr(X(cid:11))=111⊤www. Using Theorem 4, we replace rj rj r j the p.s.d. conic constraint in the primal (5) with a linear convex combination of rank-one unitary (cid:10) (cid:11) matrices: X=(cid:229) w Z ,and111⊤www=1. SubstitutingXin(5),wehave j j j maxr −v(cid:229) |I| x ,s.t.:H www≥r −x ,(r=1,...,|I|);www≥000,111⊤www=1; xxx ≥000. (7) r ,www,xxx r=1 r r: r TheLagrangedualproblemoftheabovelinearprogrammingproblem(7)iseasilyderived: min p s.t.:(cid:229) |I| u H ≤p 111⊤;111⊤uuu=1,000≤uuu≤v111. p ,uuu r=1 r r: Wecanthenusecolumngenerationtosolvetheoriginalproblemiterativelybylookingatboththe primal and dual problems. See Shen et al. (2008) for the algorithmic details. In this work we are moreinterestedinsmoothlossfunctionssuchastheexponentiallossandlogisticloss,aspresented inthesequel. 3.2 LearningwiththeExponentialLoss Byemployingtheexponentialloss,wewanttooptimize minlog (cid:229) |I| exp(−r ) +vTr(X) X,rrr r=1 r s.t.:r =(cid:0)hA ,Xi,r=1,(cid:1)···,|I|, X<0. (8) r r Notethat: 1)Weareproposingalogarithmicversionofthesumofexponentialloss. Thistransform doesnotchangetheoriginaloptimizationproblemofsumofexponentiallossbecausethelogarith- micfunctionisstrictlymonotonicallyincreasing. 2)AregularizationtermTr(X)hasbeenapplied. Without this regularization, one can always multiply X by an arbitrarily large scale factor in order tomaketheexponentiallossapproachzerointhecaseofallconstraintsbeingsatisfied. Thistrace- normregularizationmayalsoleadtolow-ranksolutions. 3)Anauxiliaryvariabler ,r=1,... must r beintroducedforderivingameaningfuldualproblem,asweshowlater. 1015
Description: