Compressed Support Vector Machines Zhixiang(Eddie)Xu JacobR.Gardner [email protected] [email protected] StephenTyree KilianQ.Weinberger 5 [email protected] [email protected] 1 0 2 DepartmentofComputerScience&Engineering b WashingtonUniversityinSt. Louis e St. Louis,MO,USA F 2 Abstract ] G L Support vector machines (SVM) can classify data sets along highly non-linear decision boundaries because of the kernel-trick. This expressiveness comes at a . s price: During test-time, the SVM classifier needs to compute the kernel inner- c product between a test sample and all support vectors. With large training data [ sets, the time required for this computation can be substantial. In this paper, 2 we introduce a post-processing algorithm, which compresses the learned SVM v model by reducing and optimizing support vectors. We evaluate our algorithm 8 on several medium-scaled real-world data sets, demonstrating that it maintains 7 high test accuracy while reducing the test-time evaluation cost by several orders 4 ofmagnitude—insomecasesfromhourstoseconds. Itisfairtosaythatmostof 6 theworkinthispaperwaspreviouslybeeninventedbyBurgesandScho¨lkopfal- 0 . most20yearsago. Formostofthetimeduringwhichweconductedthisresearch, 1 wewereunawareofthispriorwork.However,inthepasttwodecades,computing 0 power has increased drastically, and we can therefore provide empirical insights 5 thatwerenotpossibleintheiroriginalpaper. 1 : v i 1 Introductions X r a SupportVectorMachines(SVM)arearguablyoneofthegreatsuccessstoriesofmachinelearning and have been used in many real world applications, including email spam classification [7], face recognition [12] and gene selection [11]. In real world applications, the evaluation cost (in terms of memory and CPU) during test-time is of crucial importance. This is particularly prominent in settingswithstrongresourceconstraints(e.g.embeddeddevices,cellphonesortablets)orfrequently repeated tasks (e.g. webmail spam classification, web-search ranking, face detection in uploaded images), which can be performed billions of times per day. Reducing the resource requirements toclassifyaninputcanreducehardwarecosts,enableproductimprovements,andhelpcurbpower consumption. Test-timecostisdeterminedmainlybytwocomponents: classifierevaluationandfeatureextraction cost. Reducing feature extraction cost has recently obtained a significant amount of attention [3; 5; 15; 17; 19; 23; 25; 26]. These approaches reduce the test-time cost in scenarios where features areheterogeneous,extractedon-demand,andaresignificantlymoreexpensivetocomputethanthe classifierevaluation. Inthispaper,wefocusontheothercommonscenariowheretheclassifierevaluationcostdominates the overall test-time cost. Specifically, we focus on kernel support vector machine (SVM) [20]. 1 Kernel computation can be expensive because it is linear in the number of support vectors and, in addition, often requires expensive exponentiation (e.g. for the radial basis or χ2 kernels). Previ- ousworkhasreducedtheclassifiercomplexitybyselectingfewsupportvectorsthroughbudgeted training[6;24]orwithheuristicselectionpriortolearning[13]. Wedescribeanapproachthatdoesnotselectsupportvectorsfromthetrainingset,butinsteadlearns themtomatchapre-definedSVMdecisionboundary.GivenanexistingSVMmodelwithmsupport vectors,itlearnsr(cid:28)m“artificialsupportvectors”,whicharenotoriginallypartofthetrainingset. TheresultingmodelisastandardSVMclassifier(thuscanbesaved,forexample,inaLibSVM[4] compatiblefile). Relativetotheoriginalmodel, ithascomparableaccuracy, butitisuptoseveral ordersofmagnitudessmallerandfastertoevaluate.WerefertoouralgorithmasCompressedVector Machine(CVM)anddemonstrateoneightreal-worlddatasetsofvarioussizeandcomplexitythat itachievesunmatchedaccuracyvs. test-timecosttradeoffs. 2 RelatedWork BurgesandScho¨lkopf[2]inventedCompressedVectorMachineslongbeforeus.Whileweconducted ourresearch,wewerenotawareoftheirworkuntilverylateduringthefinalstagesofpaperwriting. Westillconsiderourperspectiveandadditionalexperimentsvaluableanddecidedtopostourresults asatechreport. Howeverwedowanttoemphasizethatallacademiccreditshouldgotothemwho wereclearlyaheadofus. Reducingtest-timecosthasrecentlyattractedmuchattention. Muchwork[3;5;10;15;17;19;23; 25]focusesonscenarioswherefeaturesareextractedon-demandandtheextractioncostdominates theoveralltest-timecost. Theirobjectiveistominimizethefeatureextractioncost. Modelcompressionwaspioneeredby[1]. Ourworkwasinspiredbytheirvision,howeveritdiffers substantially,aswedonotfocusonensemblesofclassifiersandinsteadlearnamodelcompressor explicitlyforSVMs. Morerecently,[26]introducesanalgorithmtoreducethetest-timecostspecif- ically for the SVM classifier. However, similar to the approaches mentioned above, they focus on learninganewrepresentationsconsistingofcheapnon-linearfeaturesforlinearSVMs. [6]proposeanalgorithmtolimitthememoryusageforkernelbasedonlineclassification. Different from our approach, their algorithm is not a post-process procedure, and instead they modify the kernelfunctiondirectlytolimittheamountofmemorythealgorithmuses. Similarto[6],[24]also focussesononlinekernelSVM,andattacksprimarilythetrainingtimecomplexity. Ofparticularrelevanceis[13],which,specificallyreducestheSVMevaluationcostbyreducingthe numberofsupportvectors. Heuristicsareusedtoselectasmallsubsetofsupportvectors, uptoa givenbudget,duringtrainingtime,thussolvinganapproximateSVMoptimization. Incontrast,our methodisapost-processingcompressiontotheregularSVM.WebeginfromanexactSVMsolution and compress the set of support vectors by choosing and optimizing over a small set of support vectorstoapproximatetheoptimaldecisionboundary.Thispost-processingoptimizationframework renders unmatched accuracy and cost performance. Similar approaches have successfully learned pseudo-inputsforcompressednearestneighborclassificationsets[14]andsparseGaussianprocess regressionmodels[22]. 3 Background Letthedataconsistofinputvectors{x ,...,x } ∈ Rd andcorrespondinglabels{y ,...,y } ∈ 1 n 1 n {−1,+1}. For simplicity we assume binary classification in the following section, but our algo- rithmiseasilyextendedtomulti-classsettingsusingone-vs-one[21],one-vs-all[18],orDAG[16] approaches,andresultsareincludedforseveralmulti-classdatasets. Kernel support vector machines. SVMs are popular for their large margin enforcement, which leads to good generalization to unseen test data, and their formulation as a convex quadratic opti- mizationproblem,guaranteeingagloballyoptimalsolution. Mostimportantly,thekernel-trick[20] may be employed to learn highly non-linear decision boundaries for data sets that are not linearly separable. Specifically, the kernel-trick maps the original feature space x into a higher (possibly i infinite)dimensionalspaceφ(x ). i 2 SVMs learn a hyperplane in this higher dimensional space by maximizing the margin 1 and w penalizingtraininginstancesonthewrongsideofthehyperplane, (cid:107) (cid:107) (cid:88)n (cid:16) (cid:0) (cid:1)(cid:17)2 min(cid:107)w(cid:107)+C max 1−y w φ(x )+b,0 , (1) i (cid:62) i w,b i where b is the bias, and C trades-off regularization/margin and training accuracy. Note that we use the quadratic hinge loss penalty and thus (1) is differentiable. The power of the kernel trick is that the higher dimensional space φ(x ) never needs to be expressed explicitly, because (1) can i beformulatedintermsofinnerproductsbetweeninputvectors. LetamatrixKdenotetheseinner products,whereK = φ(x ) φ(x ),andKisthetrainingkernelmatrix. Theoptimizationin(1) ij i (cid:62) j canbethenexpressedintermsofkernelmatrixKinthedualform: (cid:88)n 1 (cid:88)n (cid:88)n max α − α α y y K , s.t. α y =0andα ≥0, (2) α1,...,αn i 2 i j i j ij i i i i=1 i,j=1 i=1 whereα aretheLagrangemultipliers. i theclassificationrulef(·)foratestinputx canalsobeexpressedbytestingkernelK˜ thatconsists t of inner products between test inputs E = {x } and support vectors S = {x |α (cid:54)= 0}, K˜ = t i i it φ(x ) φ(x ),where i (cid:62) t n f(φ(x ))=(cid:88)α y K˜ +b. (3) t i i it i=1 NotethatoncetestingkernelK˜ iscomputed,generatingthepredictionismerelyalinearcombina- tion,andthusthedominatingcostiscomputingthetestingkernelitself. Least angle regression. LARS [8] is a widely used forward selection algorithm because of its simplicity and efficiency. Given input vectors x, target labels y, and the quadratic loss (cid:96)(β) = (xβ−y)2,LARSlearnstoapproximatetargetsbybuildingupthecoefficientvectorβinsuccessive steps,startingfromanall-zerovector. Tominimizethelossfunction(cid:96),LARSinitiallydescendson acoordinatedirectionthathasthelargestgradient, ∂(cid:96) β =argmax . (4) t ∂β βt t Thealgorithmthenincorporatesthiscoordinateintoitsactiveset. Afteridentifyingthegradientdi- rection,LARSselectsthestepsizeverycarefully.Insteadoftoogreedyortootiny,LARScomputes astepsizethatanewdirectionoutsideoftheactivesethasthesamemaximumgradientasdirections intheactiveset. LARSthenincludethisnewdirectionintotheactiveset. Inthefollowingiterations,LARSgradientdescendsonadirectionthatmaintainsthesamegradient foralldirectionsintheactiveset.Inotherwords,LARSdescendsfollowinganequiangulardirection of all directions in the active set. The algorithm then repeats computing step-size, including new directions into the active set, and descending on an equiangular directions. This process makes LARSveryefficient,asafterT iterations,LARSsolutionhasexactlyT directionsintheactiveset, orequivalently,onlyT non-zerocoefficientsinβ. 4 Method Inthissection,wedetailtheCVMapproachtoreducethetest-timeSVMevaluationcost. Weregard CVM as a post-processing compression to the original SVM solution. After solving an SVM, we obtainasetofsupportvectorsS = {x |α (cid:54)= 0}, andthecorrespondingLagrangemultipliersα . i i i GiventheoriginalSVMsolution,wecanmodelthetest-timeevaluationcostexplicitly. Kernel SVM evaluation cost. Based on the prediction function (3) we can formulate the exact SVM classifier evaluation cost. Let e denote the cost of computing a test kernel entry K˜ (i.e. it kernel function of a test input x and a support vector x ). We assume the computation cost is t i identicalacrossalltestinputsandallsupportvectors. Asshownin(3),generatingapredictionfora 3 testinginputrequirescomputingthekernelentrybetweenthetestinputandallsupportvectors. The totalevaluationcostisafunctionofthenumberofsupportvectorsn . Afterobtainingthekernel sv entriesforatestpointx ,predictionissimplylinearcombinationofthekernelrowK˜ weightedby t t α. Thecostofcomputingthislinearcombinationisverylowcomparedtothekernelcomputation, andthereforethetotalevaluationcostc =n e. Weaimtoreducethesizeofthesupportvectorset e sv n withoutgreatlyaffectingpredictionaccuracy. sv Removing non-support vectors. Since the test-time evaluation cost is a function of the number of support vectors, the goal is to cherry-pick and optimize a subset of the optimal support vectors bounded in size by a user-specified compression ratio. We first note that all non-support vectors canberemovedduringthisprocesswithoutaffectingthefullSVMsolution. Ifwedefineadesign matrixKˆ ∈Rn n,whereKˆ =y K . ThesquaredpenaltySVMobjectivefunctionin(1)canbe × ij i ij expressedwithLagrangeparameterαandthekernelmatrixK: (cid:16) (cid:17)2 min max(1−Kˆα−yb,0) +α Kα. (5) (cid:62) α,b Since(5)isastronglyconvexfunction,andallnon-supportvectorshavethecorrespondingLagrange multiplier α = 0, we can remove all non-support vectors from the optimization problem and the i fullSVMoptimalsolutionstaysthesame. Tofindanoptimalsubsetofsupportvectorsgiventhecompressionratio,were-traintheSVMwith onlysupportvectorsandaconstraintonthenumberofsupportvectors. Notethatαareeffectively thecoefficientsofsupportvectors,andwecanefficientlycontrolthenumberofsupportvectorsby addinganl normonα. Theoptimizationproblembecomes 0 (cid:16) (cid:17)2 min 1−Kˆα−yb +α Kα (6) (cid:62) α,b 1 s.t.(cid:107)α(cid:107) ≤ B , 0 e e whereB evaluationcostbudget, andconsequently, 1B isthedesirednumberofsupportvectors e e e basedonthebudget. Notethatafterremovingnon-supportvectors, weobtainacondensedmatrix Kˆ ∈Rnsv×nsv. Forming ordinary least squares problem. The current form of equation (6) can be made more amenable to optimization by rewriting the objective function as an ordinary least square problem. Expandingthesquaredterm,simplifying,andfixingthebiastermb(asitdoesnotaffectthesolution dramatically),were-formattheobjectivefunction(6)into min(1−yb) (1−yb)−2α Kˆ (1−yb)+α (Kˆ Kˆ +K)α. (7) (cid:62) (cid:62) (cid:62) (cid:62) (cid:62) α WeintroducetwoauxiliaryvariablesΩandβ,whereΩ Ω=Kˆ Kˆ+KandΩ β =−Kˆ (1−yb). (cid:62) (cid:62) (cid:62) (cid:62) BecauseKˆ Kˆ +Kisasymmetricmatrix,wecancomputeitseigen-decomposition (cid:62) Kˆ Kˆ +K=SDS , (8) (cid:62) (cid:62) where D is the diagonal matrix of eigenvalues and S is the orthonormal matrix of eigenvectors. Moreover,becausethematrixKˆ Kˆ+Kispositivesemi-definite,wecanfurtherdecomposeSDS (cid:62) √ (cid:62) intoaninnerproductoftworealmatricesbytakingthesquarerootofD. LetΩ= DS ,andwe (cid:62) obtain amatrix Ω that satisfies Ω Ω = Kˆ Kˆ +K. After computing Ω, wecan readilycompute (cid:62) (cid:62) β =−(Ω ) 1Kˆ (1−yb),where(Ω ) 1 = 1 S . (cid:62) − (cid:62) (cid:62) − √D (cid:62) Withthehelpofthetwoauxiliaryvariables,weconvert(7),plusaconstantterm1,intoleastsquares format. Togetherwithrelaxationofthenon-continuousl -normconstrainttoanl -normconstraint, 0 1 weobtain 1 min(Ωα+β)2, s.t. (cid:107)α(cid:107) ≤ B . (9) 1 e α e (cid:16) (cid:17) 1(1−yb)(cid:62) Kˆ(Kˆ(cid:62)Kˆ +K)−1Kˆ(cid:62)−I (1−yb) 4 V⇤ P1 P1V⇤ P2P2V⇤ V 2 V1 P1V1 P2V1 P1V2 P2V2 Figure1:IllustrationofsearchingforasubspaceV ∈R2thatbestapproximatespredictionsP and 1 P oftraininginstancesinR3 space. NeitherV orV ,spannedbyexistingcolumnsinthekernel 2 1 2 matrix, is a good approximation. V spanned by kernel columns computed from two artificial ∗ supportvectorsistheoptimalsolution. Compressingthesupportvectorset. Thesquaredlossandl constraintin(9)leadnaturallytothe 1 LARS algorithm. Given a budget B , we can determine the maximum size m of the compressed e supportvectorset(m = Be). UsingLARS,westartfromanemptysupportvectorsetandaddm e supportvectorsincrementally. Sinceaddingasupportvectorisequivalenttoactivatingacoefficient inαtoanon-zerovalue,wecanobtainmoptimalsupportvectorsbyrunningLARSoptimization in (9) exactly m steps, where each step activates one coefficient. The resulting solution gives the optimalsetofmsupportvectors. WereferthisintermediatestepasLARS-SVM.Notethatthisstep is crucial for the problem, as this LARS-SVM solution serves as a very good initialization for the nextstep,whichisanon-convexoptimizationproblem. Gradientsupportvectors. Ifweinterpretαascoordinatesandthecorrespondingcolumnsinthe kernelmatrixKasbasisvectors,thenthesebasisvectorsspananRnsv spaceinwhichliepredictions oftheoriginalSVMmodel. Inthiscompressionalgorithm,ourgoalistofindalowerdimensional subspace that supports good approximations of the original predictions. After running LARS for miterations,weobtainmsupportvectorsandtheircoefficientsα,forminganRm subspaceofthe spacespannedbythefullkernelmatrix. WeillustratethislowerdimensionalapproximationinFigure1.VectorsP andP arepredictionsof 1 2 twotrainingpointsmadeinthefullSVMsolutionspace(R3andspannedbythreesupportvectors). We want to compress the model to two support vectors by looking for a subspace V ∈ R2 that supportsthebestapproximationsofthesetwopredictions. Usingexistingsupportvectorsasabasis, wecanfindsubspacesV andV ,eachspannedbyapairofsupportvectors. TheprojectionsofP 1 2 1 andP onplaneV (PV1andPV1)areclosesttotheoriginalpredictionsP andP ,andthusV isthe 2 1 1 2 1 2 1 betterapproximation. However,inthiscase,neitherV norV isaparticularlygoodapproximation. 1 2 Supposeweremovetherestrictionofselectingasubspacespannedbyexistingbasisvectorsinthe kernelmatrix, instead optimizingthebasisvectors toyielda moresuitablesubspace. InFigure 1, this is illustrated by the optimal subspace V which produces a better approximation to the target ∗ predictions. Notethatthebasisvectors(columnsofthekernelmatrix)areparameterizedbysupportvectors. By optimizingtheseunderlyingsupportvectors,wecansearchforabetterlow-dimensionalsubspace. IfwedenoteK asthetrainingkernelmatrixwithonlymcolumnscorrespondingtothesupport m vectorschosenbyLARS,andα asthecoefficientsofthesesupportvectors,wecanformulatethe m searchforartificialsupportvectorasanoptimizationproblem. Specifically,weminimizeasquared lossbetweenapproximateandfullSVMpredictionsoverallsupportvectors,andtheparametersare supportvectors. (cid:16) (cid:17)2 min L= K α −Kα , (10) m m (x1,...,xm) whereK isthekernelentry,andforsimplicity,weuseradialbasisfunction(RBF)kernelfunction ij (Kij = e(cid:107)xi2−σx2j(cid:107)2). However, other kernel functions are equally suitable. The unconstrained op- 5 Simulation data LARS - SVM CVM, iter 20 CVM, iter 40 6 6 6 optimized 6 LARS decision 4 4 decision boundary full model 4 boundary 4 decision 2 2 boundary 2 2 0 0 0 0 −2 −2 −2 −2 −4 −4subset of support vectors −4 −4 selected by LARS −6 −6 −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 (a) (b) (c) (d) CVM, iter 80 CVM, iter 160 CVM, iter 640 CVM, iter 2560 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 −2 −2 −2 −2 −4 −4 −4 −4 −6 −6 −6 −6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 (e) (f) (g) (h) Figure2: IllustrationofeachstepofCVMonasyntheticdataset. (a)Simulationinputsfromtwo classes(redandblue). Bydesign, thetwoclassesarenotlinearseparable. (b)Decisionboundary formedbyfullSVMsolution(blackcurve).AsmallsubsetofsupportvectorspickedbyLARS(gray points)andthecompresseddecisionboundaryformedbythissubsetofsupportvectors(graycurve). (c-h)Optimizationiterations. Thegradientsupportvectorsaremovedbytheiterativeoptimization. The optimized decision boundary formed by gradient support vectors (green curve) gradually ap- proachestheoneformedbyfullSVMsolution. timizationproblem(10)canbesolvedbyconjugategradientdescentwithrespecttothechosenm supportvectors. Sinceα’sarethecoordinateswithrespecttothebasis,weoptimizeαjointlywith supportvectors,whichisfasterthanoptimizingbasisandsolvingcoordinatesalternatively.Thegra- dientscanbecomputedveryefficientlyusingmatrixoperations. Sincegradientdescentonsupport vectorsisequivalenttomovingthesesupportvectorsinacontinuousspace, therebygeneratingm newsupportvectors,werefertothesenewlygeneratedsupportvectorsasgradientsupportvectors. WedenotethiscombinedmethodofLARS-SVMandgradientsupportvectorsasCompressedVec- torMachine(CVM).Becausetheoptimizationproblemin(10)isnon-convexwithrespecttox ,we i initializeouralgorithmwiththebasisK andcoordinatesα returnedintheLARS-SVMsolution. m m Inpractice,itmaybedesirabletooptimizeboththeSVMcostparameterC andanykernelparam- eters (e.g. σ2 in the RBF kernel) for the final CVM model. Additionally, it may be preferable to optimizeCVMconstrainedbythevalidationaccuracyofthecompressedmodelratherthanthesize ofthesupportvectorbudget. ConstrainedBayesianoptimization[9]supportsefficientconstrained joint hyperparameter optimizations of this type. Additionally, the L1-penalized support vector se- lectionintheLARS-SVMstepmaybenefitfromrecentworkonhighlyparallelElasticNetsolvers [27]. 5 Results Inthissection,wefirstdemonstrateCompressedVectorMachine(CVM)onasyntheticdatasetto graphically illustrate each step in the algorithm. We then evaluate CVM on several medium-scale real-worlddatasets. Syntheticdataset. Thedatasetcontains600sampleinputsfromtwoclasses(redandblue),where each input contains two features. The blue inputs are sampled from a Gaussian distribution with mean atthe origin and variance 1, andred inputs are sampledfrom a noisy circlesurrounding the blueinputs. AsshowninFigure2(a),bydesignthedatasetisnotlinearlyseparable. Forsimplicity, 6 wetreatallinputsastraininginputs. ToevaluateCVM,wefirstlearnanSVMwiththeRBFkernel from the full training set. We plot the resulting optimal decision boundary in Figure 2(b) with a blackcurve. Intotal,thefullmodelhas80supportvectors. To compress the model, we first select a subset of support vectors by solving LARS-SVM opti- mization(9). Specifically,wecompressthemodelto10%ofitsoriginalsize,8supportvectors,by runningLARSfor8iterations. The8LARS-SVMsupportvectorsareshowninFigure2(b)assolid graypoints,andtheapproximateLARS-SVMdecisionboundaryisshownbythegraycurve. Since the subspace formed by 8 support vectors is heavily restricted by the discrete training input space,theapproximationispoor.Toovercomethisproblem,wesearchforabettersubspaceorbasis inacontinuousspace,andperformgradientdescentonsupportvectorsbyoptimizing(10).InFigure 2(c-h),weillustratetheoptimizationwithupdatedsupportvectorlocationsandoptimizeddecision boundariesaswegraduallyincreasethenumberofiterations.Theresultinggradientsupportvectors areshownasgraypointsandthenewoptimizeddecisionboundariesformedfromthesenewgradient supportvectorsareshownbygreencurves. After2560iterations,asshowninFigure2(h),wecan observethattheoptimizeddecisionboundary(green)isveryclosetotheboundarycapturedinthe full model (black). These optimized decision boundaries demonstrate that moving a small subset ofsupportvectorsinacontinuousspacecanefficientlyapproximatetheoptimaldecisionboundary formedbyfullSVMsolution,supportingeffectiveSVMmodelcompression. Statistics Pageblocks Magic Letters 20news MNIST DMOZ #trainingexam. 4379 15216 16000 11269 60000 7184 #testingexam. 1094 3804 4000 7505 10000 1796 #features 10 10 16 200 784 16498 #classes 2 2 26 20 10 16 Table1: Statisticsofallsixdatasets. Largereal-worlddatasets. ToevaluatetheperformanceofCVMonreal-worldapplications,we evaluate our algorithm on six data sets of varying size, dimensionality and complexity. Table 1 details the statistics of all six data sets. We use LibSVM [4] to train a regular RBF kernel SVM using regularization parameter C and RBF kernel width σ selected on a 20% validation split. For multi-classdatasets,weusetheone-vs-onemulti-classscheme. Theclassificationaccuracyoftest predictionsfromthisSVMmodelservesasabaselineinFigure3(fullSVM). GiventhefullSVMsolution,werunCVMintwosteps. First,weuseLARSsolvetheoptimization problem in (9) using all support vectors from the original SVM model. An initial compressed support vector set is selected with a target compressed size (e.g. 10 out of 500 support vectors). TheselectedsupportvectorsserveasthesecondbaselineinFigure3(LARS-SVM).Second,weshift thesesupportvectorsinacontinuousspacebyoptimizing(10)w.r.t.theinputsupportvectorsandthe correspondingLagrangemultipliersα,generatinggradientsupportvectors.Thisfinalsetofgradient support vectors constitutes the CVM model. To show the trend of accuracy/cost performance, we plottheclassificationaccuracyforCVMafteraddingevery10supportvectors. Figure3showsthe performanceofCVMandthebaselinesonallsixdatasets. Comparisonwithpriorwork.Figure3alsoshowsacomparisonofCVMwithReduced-SVM[13]. Thisalgorithmtakesaniterativetwophaseapproach. Firstasetofsupportvectorsisheuristically selected from random samples of the training set and added to the existing set of support vectors (initiallyempty). Then,themodelweightsareoptimizedbyanSVMwiththequadratichingeloss. Thealgorithmalternatesthesetwostepsuntilthetargetnumberofsupportvectorsisreached. AsshownintheFigure3,CVMsignificantlyimprovesoverallbaselines. Comparedtothecurrent state-of-the-art, Reduced-SVM, CVM has the capability of moving support vectors, generating a new basis, and learning a highly approximated basis to match the decision boundaries formed by the full SVM solution. It is this ability that distinguishes CVM from other algorithms when the evaluationbudgetislow. Acrossalldatasets,CVMmaintainsclosetothesameaccuracyasthefull SVMwithmerely10%ofthesupportvectors. 7 Pageblocks Letters 0.98 1 0.96 0.9 0.94 0.8 0.7 0.92 0.6 0.9 0.5 0.88 0.4 0.86 0.3 0.84 full SVM 0.2 fLuAllR SSV−MSVM LARS−SVM Reduced−SVM (Keerthi et. al. 2006) 0.82 Reduced−SVM (Keerthi et. al. 2006) 0.1 Random−SVM Compress SVM Compress SVM 0.810 1 102 103 010 3 104 105 106 Magic 20news 0.9 0.7 0.6 0.85 0.5 0.8 0.4 0.75 0.3 0.2 0.7 full SVM full SVM LARS−SVM 0.1 LARS−SVM Reduced−SVM (Keerthi et. al. 2006) Reduced−SVM (Keerthi et. al. 2006) Compress SVM Compress SVM 0.6510 1 102 103 104 010 3 104 105 DMOZ MNIST 0.65 1 0.6 0.9 0.55 0.8 0.5 0.45 0.7 0.4 0.6 0.35 0.5 0.3 full SVM full SVM 0.25 LRAeRduSc−eSdV−MSVM (Keerthi et. al. 2006) 0.4 LRAeRduSc−eSdV−MSVM (Keerthi et. al. 2006) Compress SVM Compress SVM 0.210 3 104 105 10 2 103 104 105 Figure3: Accuracyversusnumberofsupportvectors(inlogscale). 6 Acknowledgments MostofthisworkwaspreviouslyinventedbyBurgesandScho¨lkopf[2]whoseresearchwastruly visionaryatthetime. References [1] CristianBucilu,RichCaruana,andAlexandruNiculescu-Mizil. Modelcompression. InPro- ceedingsofthe12thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata mining,pages535–541.ACM,2006. [2] ChrisC.BurgesandBernhardScho¨lkopf. Improvingtheaccuracyandspeedofsupportvector machines. Advancesinneuralinformationprocessingsystems,9:375–381,1997. [3] R.Busa-Fekete, D.Benbouzid, B.Ke´gl, etal. Fastclassificationusingsparsedecisiondags. InICML,2012. [4] Chih-ChungChangandChih-JenLin. Libsvm: alibraryforsupportvectormachines. ACM TransactionsonIntelligentSystemsandTechnology(TIST),2(3):27,2011. [5] M.Chen,Z.Xu,K.Q.Weinberger,andO.Chapelle. Classifiercascadeforminimizingfeature evaluationcost. InAISTATS,2012. [6] OferDekel,ShaiShalev-Shwartz,andYoramSinger.Theforgetron:Akernel-basedperceptron onabudget. SIAMJournalonComputing,37(5):1342–1372,2008. [7] Harris Drucker, Donghui Wu, and Vladimir N Vapnik. Support vector machines for spam categorization. NeuralNetworks,IEEETransactionson,10(5):1048–1054,1999. [8] BradleyEfron, TrevorHastie, IainJohnstone, andRobertTibshirani. Leastangleregression. TheAnnalsofstatistics,32(2):407–499,2004. 8 [9] Jacob Gardner, Matt Kusner, Kilian Q. Weinberger, John Cunningham, and Zhixiang Xu. Bayesian optimization with inequality constraints. In Tony Jebara and Eric P. Xing, edi- tors,Proceedingsofthe31stInternationalConferenceonMachineLearning(ICML-14),pages 937–945.JMLRWorkshopandConferenceProceedings,2014. [10] A.GrubbandJ.A.Bagnell. Speedboost: Anytimepredictionwithuniformnear-optimality. In AISTATS,2012. [11] IsabelleGuyon,JasonWeston,StephenBarnhill,andVladimirVapnik.Geneselectionforcan- cerclassificationusingsupportvectormachines. Machinelearning,46(1-3):389–422,2002. [12] BerndHeisele,PurdyHo,andTomasoPoggio.Facerecognitionwithsupportvectormachines: Global versus component-based approach. In Computer Vision, 2001. ICCV 2001. Proceed- ings.EighthIEEEInternationalConferenceon,volume2,pages688–694.IEEE,2001. [13] SSathiyaKeerthi, OlivierChapelle, andDennisDeCoste. Buildingsupportvectormachines withreducedclassifiercomplexity. TheJournalofMachineLearningResearch,7:1493–1515, 2006. [14] MattKusner,StephenTyree,KilianQ.Weinberger,andKunalAgrawal. Stochasticneighbor compression. InTonyJebaraandEricP.Xing,editors,Proceedingsofthe31stInternational ConferenceonMachineLearning(ICML-14), pages622–630.JMLRWorkshopandConfer- enceProceedings,2014. [15] L.LefakisandF.Fleuret. Jointcascadeoptimizationusingaproductofboostedclassifiers. In NIPS,pages1315–1323.2010. [16] John C Platt, Nello Cristianini, and John Shawe-taylor. Large margin dags for multiclass classification. InAdvancesinNeuralInformationProcessingSystems12,2000. [17] J.Pujara,H.Daume´III,andL.Getoor. Usingclassifiercascadesforscalablee-mailclassifica- tion. InCEAS,2011. [18] Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. The Journal of MachineLearningResearch,5:101–141,2004. [19] M. Saberian and N. Vasconcelos. Boosting classifier cascades. In NIPS, pages 2047–2055. 2010. [20] B.Scho¨lkopfandA.J.Smola.Learningwithkernels:Supportvectormachines,regularization, optimization,andbeyond. MITpress,2001. [21] CristianiniShawe-TaylorandSmolaScho¨lkopf. Thesupportvectormachine. 2000. [22] EdwardSnelsonandZoubinGhahramani. Sparsegaussianprocessesusingpseudo-inputs. In AdvancesinNeuralInformationProcessingSystems,pages1257–1264,2005. [23] J. Wang and V. Saligrama. Local supervised learning through space partitioning. In NIPS, pages91–99,2012. [24] Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Breaking the curse of kernelization: Budgetedstochasticgradientdescentforlarge-scalesvmtraining. JournalofMachineLearn- ingResearch,13:3103–3131,2012. [25] Zhixiang Xu, Matt Kusner, Minmin Chen, and Kilian Q. Weinberger. Cost-sensitive tree of classifiers. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th In- ternationalConferenceonMachineLearning(ICML-13), volume28, pages133–141.JMLR WorkshopandConferenceProceedings,2013. [26] Zhixiang (Eddie) Xu, Matt J. Kusner, Gao Huang, and Kilian Q. Weinberger. Anytime rep- resentation learning. In Sanjoy Dasgupta and David McAllester, editors, ICML ’13, pages 1076–1084,2013. [27] Quan Zhou, Wenlin Chen, Shiji Song, Jacob R. Gardner, Kilian Q. Weinberger, and Yixin Chen. A reduction of the elastic net to support vector machines with an application to gpu computing. InAssociationfortheAdvancementofArtificialIntelligence(AAAI-15),2015. 9