ebook img

Tutorial on support vector regression PDF

24 Pages·2003·0.356 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Tutorial on support vector regression

∗ ATutorial onSupport Vector Regression † ‡ AlexJ.Smola andBernhardScho¨lkopf September30,2003 Abstract Assuch,itisfirmlygroundedintheframeworkofstatistical learningtheory,orVCtheory,whichhasbeendevelopedover Inthistutorialwegiveanoverviewofthebasicideasunder- the last three decades by Vapnik and Chervonenkis [1974], lying Support Vector (SV) machines for function estimation. Vapnik [1982, 1995]. In a nutshell, VC theory characterizes Furthermore,weincludea summaryofcurrentlyusedalgo- propertiesoflearningmachineswhichenablethemtogener- rithmsfortrainingSVmachines,coveringboththequadratic alizewelltounseendata. (or convex) programming part and advanced methods for In its present form, the SV machine was largely devel- dealingwithlargedatasets.Finally,wementionsomemodifi- oped at AT&T Bell Laboratories by Vapnik and co-workers cationsandextensionsthathavebeenappliedtothestandard [Boseretal.,1992,Guyonetal.,1993,CortesandVapnik,1995, SValgorithm,anddiscusstheaspectofregularizationfroma Scho¨lkopf et al., 1995, Scho¨lkopf et al., 1996, Vapnik et al., SVperspective. 1997]. Due to this industrial context, SV research has up to datehadasoundorientationtowardsreal-worldapplications. Initial work focused on OCR (optical character recognition). 1 Introduction Withinashortperiodoftime,SVclassifiersbecamecompeti- tivewiththebestavailablesystemsforbothOCRandobject Thepurposeofthispaperistwofold.Itshouldserveasaself- recognition tasks [Scho¨lkopf et al., 1996, 1998a, Blanz et al., containedintroductiontoSupportVectorregressionforread- 1996,Scho¨lkopf,1997]. AcomprehensivetutorialonSVclas- ersnewtothisrapidlydevelopingfieldofresearch.1 Onthe sifiers has been published by Burges [1998]. But also in re- otherhand,itattemptstogiveanoverviewofrecentdevelop- gressionandtimeseriespredictionapplications,excellentper- mentsinthefield. formances were soon obtained [Mu¨ller et al., 1997, Drucker To this end, we decided to organize the essay as follows. etal.,1997,Stitsonetal.,1999,MatteraandHaykin,1999]. A Westartbygivingabriefoverviewofthebasictechniquesin snapshot of the state of the art in SV learning was recently sections1, 2,and3, plusashortsummarywithanumberof takenattheannualNeuralInformationProcessingSystemscon- figuresanddiagramsinsection4.Section5reviewscurrental- ference[Scho¨lkopfetal.,1999a].SVlearninghasnowevolved gorithmictechniquesusedforactuallyimplementingSVma- intoanactiveareaofresearch. Moreover,itisintheprocess chines. This may be of most interest for practitioners. The of entering the standard methods toolbox of machine learn- followingsectioncoversmoreadvancedtopicssuchasexten- ing[Haykin,1998,CherkasskyandMulier,1998,Hearstetal., sionsofthebasicSValgorithm,connectionsbetweenSVma- 1998]. [Scho¨lkopfandSmola,2002]containsamorein-depth chines and regularization and briefly mentions methods for overview of SVM regression. Additionally, [Cristianini and carryingoutmodelselection. Weconcludewithadiscussion Shawe-Taylor,2000,Herbrich,2002]providefurtherdetailson ofopenquestionsandproblemsandcurrentdirectionsofSV kernelsinthecontextofclassification. research.Mostoftheresultspresentedinthisreviewpaperal- readyhavebeenpublishedelsewhere,butthecomprehensive presentationsandsomedetailsarenew. 1.2 TheBasicIdea Supposewearegiventrainingdata{(x1,y1),...,(x(cid:1),y(cid:1))} ⊂ 1.1 HistoricBackground X×R,whereX denotesthespaceoftheinputpatterns(e.g. X = Rd). These might be, for instance, exchange rates for The SV algorithmisa nonlinear generalization ofthe Gener- some currency measured at subsequent days together with alized Portrait algorithm developed in Russia in the sixties2 corresponding econometric indicators. In ε-SV regression [Vapnik and Lerner, 1963, Vapnik and Chervonenkis, 1964]. [Vapnik,1995], ourgoalistofindafunctionf(x)thathasat ∗ mostεdeviationfromtheactuallyobtainedtargetsyi forall AnextendedversionofthispaperisavailableasNeuroCOLTTechnicalRe- portTR-98-030. thetrainingdata,andatthesametimeisasflataspossible.In †RSISE, Australian National University, Canberra, 0200, Australia; otherwords,wedonotcareabouterrorsaslongastheyare Ale‡[email protected] lessthanε,butwillnotacceptanydeviationlargerthanthis. Max-Planck-Institutfu¨rbiologischeKybernetik,72076Tu¨bingen,Germany, [email protected] Thismaybeimportantifyouwanttobesurenottolosemore 1Ouruseoftheterm’regression’issomewhatloseinthatitalsoincludes thanεmoneywhendealingwithexchangerates,forinstance. casesoffunctionestimationwhereoneminimizeserrorsotherthanthemean Forpedagogicalreasons,webeginbydescribingthecaseof squareloss.Thisisdonemainlyforhistoricalreasons[Vapniketal.,1997]. 2Asimilarapproach,howeverusinglinearinsteadofquadraticprogram- linearfunctionsf,takingtheform ming,wastakenatthesametimeintheUSA,mainlybyMangasarian[1965, 1968,1969]. f(x)=(cid:2)w,x(cid:3)+bwithw∈X,b∈R (1) 1 where(cid:2)·, ·(cid:3)denotesthedotproductinX.Flatnessinthecase 1.3 DualProblemandQuadraticProgramms of(1)meansthatoneseeksasmallw. Onewaytoensurethis is to minimize the norm,3 i.e.(cid:5)w(cid:5)2 = (cid:2)w,w(cid:3). We can write ThekeyideaistoconstructaLagrangefunctionfromtheob- jectivefunction(itwillbecalledtheprimalobjectivefunction thisproblemasaconvexoptimizationproblem: in the rest of this article) and the correspondingconstraints, minimize 1(cid:5)w(cid:5)2 by introducinga dual set of variables. It can be shown that j2 thisfunctionhasasaddlepointwithrespecttotheprimaland subjectto (cid:2)ywi−,x(cid:2)iw(cid:3)+,xbi(cid:3)−−ybi ≤≤ εε (2) diauna,l1v9a6r9i,abMlecsCaotrmthiecks,o1lu9t8i3o,nV.aFnodredrebteaii,ls19se9e7]ea.gn.d[Mthaengexapsalar-- nationsinsection5.2.Weproceedasfollows: Thetacitassumptionin(2)wasthatsuchafunctionfactually ienxiostthsetrhawtoarpdpsr,othxaimttahteescoalnlvpeaxirosp(tximi,iyzia)tiwonithprεobplreemcisiisofne,asoir- L := 1(cid:5)w(cid:5)2+CX(cid:1) (ξi+ξi∗)−X(cid:1) (ηiξi+ηi∗ξi∗) (5) 2 ble. Sometimes,however,thismaynotbethecase,orwealso i=1 i=1 maywanttoallowforsomeerrors. Analogouslytothe“soft X(cid:1) margin”lossfunction[BennettandMangasarian,1992]which − αi(ε+ξi−yi+(cid:2)w,xi(cid:3)+b) wasadaptedtoSVmachinesbyCortesandVapnik[1995],one i=1 cfeaansiibnlterocdonuscteraslianctskovfatrhiaebolpestimξi,izξai∗titooncopproebwleimth(o2t)h.eHrewnicseewine- −X(cid:1) α∗i(ε+ξi∗+yi−(cid:2)w,xi(cid:3)−b) arriveattheformulationstatedin[Vapnik,1995]. i=1 HereListheLagrangianandηi,ηi∗,αi,α∗i areLagrangemul- P(cid:1) minimize 812(cid:5)w(cid:5)2+Ci=1(ξi+ξi∗) titipylcieornss.tHraeinntcse,tih.ee.dualvariablesin(5)havetosatisfypositiv- subjectto <: (cid:2)ξywii,−,ξxi∗(cid:2)iw(cid:3)+,xbi(cid:3)−−ybi ≤≥≤ ε0ε++ξξii∗ (3) Notethatbyα(i∗),werefαe(ir∗t)o,ηαi(∗i)a≥nd0α.∗i. (6) It follows from the saddle point condition that the par- TheconstantC >0determinesthetrade-offbetweentheflat- tial derivatives of L with respect to the primal variables nessoff andtheamountuptowhichdeviationslargerthan (w,b,ξi,ξi∗)havetovanishforoptimality. P εε–ainresetnosleitriavteedlo.sTshfuisncctoirornes|pξ|oεnddessctoribdeedalbinygwith asocalled ∂bL= P(cid:1)i=1(α∗i −αi) =0 (7) j ∂wL= w− (cid:1)i=1(αi−α∗i)xi =0 (8) |ξ|ε := 0|ξ|−ε ioft|hξe|r≤wiεse. (4) ∂ξi(∗)L= C−α(i∗)−ηi(∗) =0 (9) Substituting(7),(8),and(9)into(5)yieldsthedualoptimiza- Fig.1 depictsthe situation graphically. Onlythe pointsout- tionproblem. side the shaded region contribute to the cost insofar, as the 8 deviationsarepenalizedinalinearfashion. Itturnsoutthat >>< −21 P(cid:1) (αi−α∗i)(αj−α∗j)(cid:2)xi,xj(cid:3) maximize >>: −εPi(cid:1),j=(α1i+α∗i)+ P(cid:1) yi(αi−α∗i) (10) i=1 i=1  x x   subjectto P(cid:1) (αi−α∗i)=0andαi,α∗i ∈[0,C] x x  i=1 x x x x x x In deriving (10) we already eliminated the dual variables x x x x   ηi,ηi∗ through condition (9) which can be reformulated as x ηi(∗) =C−α(i∗).Eq.(8)canberewrittenasfollows X(cid:1) X(cid:1) Figure1:ThesoftmarginlosssettingforalinearSVM. w= (αi−α∗i)xi, thus f(x)= (αi−α∗i)(cid:2)xi,x(cid:3)+b. i=1 i=1 (11) inmostcasestheoptimizationproblem(3)canbesolvedmore Thisistheso-calledSupportVectorexpansion,i.e.wcanbecom- easily in its dual formulation.4 Moreover, as we will see in pletelydescribedasalinearcombinationofthetrainingpat- Sec. 2, the dual formulation provides the key for extending ternsxi.Inasense,thecomplexityofafunction’srepresenta- SVmachinetononlinearfunctions.Hencewewilluseastan- tionbySVsisindependentofthedimensionalityoftheinput dard dualization method utilizing Lagrange multipliers, as spaceX,anddependsonlyonthenumberofSVs. describedine.g.[Fletcher,1989]. Moreover, note that the complete algorithm can be de- 3See[Smola,1998]foranoverviewoverotherwaysofspecifyingflatnessof scribedintermsofdotproductsbetweenthedata.Evenwhen suchfunctions. evaluatingf(x)weneednotcomputewexplicitly. Theseob- 4Thisistrueaslongasthedimensionalityofw ismuchhigherthanthe servationswillcomeinhandyfortheformulationofanonlin- numberofobservations. Ifthisisnotthecase,specializedmethodscanoffer considerablecomputationalsavings[LeeandMangasarian,2001]. earextension. 2 1.4 Computingb While this approach seems reasonable in the particular ex- ampleabove,itcaneasilybecomecomputationallyinfeasible Sofarweneglectedtheissueofcomputingb.Thelattercanbe for both polynomial features of higher order and higher di- donebyexploitingthesocalledKarush–Kuhn–Tucker(KKT) mensionality`,asthe´numberofdifferentmonomialfeaturesof conditions[Karush,1939,KuhnandTucker,1951].Thesestate degree p is d+p−1 , where d = dim(X). Typical values for thatatthepointofthesolutiontheproductbetweendualvari- p OCR tasks (with good performance) [Scho¨lkopf et al., 1995, ablesandconstraintshastovanish. Scho¨lkopfetal.,1997,Vapnik,1995]arep = 7,d = 28·28 = αi(ε+ξi−yi+(cid:2)w,xi(cid:3)+b) = 0 784,correspondingtoapproximately3.7·1016features. α∗i(ε+ξi∗+yi−(cid:2)w,xi(cid:3)−b) = 0 (12) (C−αi)ξi = 0 and (C−α∗i)ξi∗ = 0. (13) 2.2 ImplicitMappingviaKernels Thisallowsustomakeseveralusefulconclusions.Firstlyonly Clearly this approach is not feasible and we have to find a samples(xi,yi)withcorrespondingα(i∗) = C lieoutsidethe computationally cheaper way. The key observation [Boser ε–insensitivetube. Secondlyαiα∗i =0,i.e.therecanneverbe etal.,1992]isthatforthefeaturemapofexample1wehave a set ofdual variablesαi,α∗i which areboth simultaneously D“ √ ” “ √ ”E nonzero.Thisallowsustoconcludethat x21, 2x1x2,x22 , x(cid:2)21, 2x(cid:2)1x(cid:2)2,x(cid:2)22 =(cid:2)x,x(cid:2)(cid:3)2. (17) ε−yi+(cid:2)w,xi(cid:3)+b≥0 and ξi=0 if αi<C (14) As noted in the previoussection, the SV algorithm only de- ε−yi+(cid:2)w,xi(cid:3)+b≤0 if αi>0 (15) pends on dot products between patterns xi. Hence it suf- Inconjunctionwithananalogousanalysisonα∗i wehave ficestoknowk(x,x(cid:2)):=(cid:2)Φ(x),Φ(x(cid:2))(cid:3)ratherthanΦexplicitly whichallowsustorestatetheSVoptimizationproblem: max{−ε+yi−(cid:2)w,xi(cid:3)|αi <Corα∗i >0}≤b≤ (16) 8 min{−ε+yi−(cid:2)w,xi(cid:3)|αi >0orα∗i <C} >>< −12 P(cid:1) (αi−α∗i)(αj−α∗j)k(xi,xj) Iaflssoom[Keeeαr(it∗h)i∈et(a0l.,,C20)0t1h]efoinrefquurtahleitriemsebaencsomofecheqouoasilnitgiebs.. See maximize >>: −εPi(cid:1),j=(α1i+α∗i)+ P(cid:1) yi(αi−α∗i) (18) Anotherwayofcomputingbwillbediscussedinthecon- P(cid:1) i=1 i=1 text of interior point optimization (cf. Sec. 5). There b turns subjectto (αi−α∗i)=0andαi,α∗i ∈[0,C] out to be a by-product of the optimization process. Further i=1 considerationsshallbedeferredtothecorrespondingsection. Likewisetheexpansionoff (11)maybewrittenas Seealso[Keerthietal.,1999]forfurthermethodstocompute theconstantoffset. X(cid:1) X(cid:1) AfinalnotehastobemaderegardingthesparsityoftheSV w= (αi−α∗i)Φ(xi)andf(x)= (αi−α∗i)k(xi,x)+b. expansion. From(12)itfollowsthatonlyfor|f(xi)−yi| ≥ ε i=1 i=1 (19) theLagrangemultipliersmaybenonzero,orinotherwords, The differenceto the linear case isthat w is no longergiven for all samples inside the ε–tube (i.e. the shaded region in Fig.1)theαi,α∗i vanish: for|f(xi)−yi| < εthesecondfac- explicitly. Also note that in the nonlinear setting, the opti- tor in (12) is nonzero, hence αi,α∗i has to be zero such that mizationproblemcorrespondstofindingtheflattestfunction infeaturespace,notininputspace. theKKTconditionsaresatisfied. Thereforewehaveasparse expansion of w in terms of xi (i.e. we do not need all xi to describew).Theexamplesthatcomewithnonvanishingcoef- 2.3 ConditionsforKernels ficientsarecalledSupportVectors. Thequestionthatarisesnowis,whichfunctionsk(x,x(cid:2))corre- spondtoadotproductinsomefeaturespaceF.Thefollowing 2 Kernels theoremcharacterizesthesefunctions(definedonX). 2.1 NonlinearitybyPreprocessing Theorem2(Mercer[1909]) Supposek ∈ L∞(X2)suchthatthe The next step is to make the SV algorithm nonlinear. This, integraloperatorTk :L2(X)→L2(X), forinstance, could be achieved by simplypreprocessingthe Z training patterns xi bya map Φ : X → F into somefeature Tkf(·):= k(·,x)f(x)dµ(x) (20) spaceF,asdescribedin[Aizermanetal.,1964,Nilsson,1965] X andthenapplyingthestandardSVregressionalgorithm. Let is positive (here µ denotes a measure on X with µ(X) finite and ushaveabrieflookatanexamplegivenin[Vapnik,1995]. supp(µ) = X). Let ψj ∈ L2(X) be the eigenfunction of Tk as- ERx2a→mpRle31w(QithuΦad(rxa1t,ixc2f)ea=tur`exs21,in√R2x2)1xC2,onxs22i´d.erIttihseumndaeprsΦtood: s(cid:5)oψcjia(cid:5)tLed2 =wi1thanthdeleetigψejnvdaenluoeteλitjsc(cid:10)=om0pleaxndconnjourgmaateli.zTedhesnuch that thatthesubscriptsinthiscaserefertothecomponentsofx ∈ R2. 1. (λj(T))j ∈(cid:8)1. Training a linear SV machine on the preprocessed features would yieldaquadraticfunction. 2. ψj ∈L∞(X)andsupj(cid:5)ψj(cid:5)L∞ <∞. 3 P 3. k(x,x(cid:2)) = λjψj(x)ψj(x(cid:2)) holds for almost all (x,x(cid:2)), Wewillgiveaproofandsomeadditionalexplanationstothis j∈N theorem in section 7. It follows from interpolation theory wheretheseriesconvergesabsolutelyanduniformlyforalmost [Micchelli, 1986] and the theory of regularization networks all(x,x(cid:2)). [Girosiet al.,1993]. Forkernelsofthedot–producttype, i.e. k(x,x(cid:2))=k((cid:2)x,x(cid:2)(cid:3)),thereexistsufficientconditionsforbeing Lessformallyspeakingthistheoremmeansthatif Z admissible. k(x,x(cid:2))f(x)f(x(cid:2))dxdx(cid:2)≥0forallf ∈L2(X) (21) X×X Theorem7(Burges[1999]) Any kernel of dot–product type k(x,x(cid:2))=k((cid:2)x,x(cid:2)(cid:3))hastosatisfy holdswecanwritek(x,x(cid:2))asadotproductinsomefeature space.Fromthisconditionwecanconcludesomesimplerules k(ξ)≥0, ∂ξk(ξ)≥0and∂ξk(ξ)+ξ∂ξ2k(ξ)≥0 (26) forcompositionsofkernels,whichthenalsosatisfyMercer’s condition [Scho¨lkopf et al., 1999a]. In the following we will foranyξ≥0inordertobeanadmissibleSVkernel. callsuchfunctionskadmissibleSVkernels. Note that the conditions in theorem 7 are only necessary but Corollary3(PositiveLinearCombinationsofKernels) not sufficient. The rules stated above can be useful tools for Denotebyk1,k2admissibleSVkernelsandc1,c2 ≥0then practitionersbothforcheckingwhetherakernelisanadmis- sibleSVkernelandforactuallyconstructingnewkernels.The (cid:2) (cid:2) (cid:2) k(x,x):=c1k1(x,x)+c2k2(x,x) (22) generalcaseisgivenbythefollowingtheorem. isanadmissiblekernel. Thisfollowsdirectlyfrom(21)byvirtueof Theorem8(Schoenberg[1942]) A kernel of dot–product type thelinearityofintegrals. k(x,x(cid:2)) = k((cid:2)x,x(cid:2)(cid:3)) defined on an infinite dimensional Hilbert Moregenerally,onecan show that the setofadmissibleker- space,withapowerseriesexpansion nelsformsaconvexcone,closedinthetopologyofpointwise X∞ convergenceBergetal.[1984]. n k(t)= ant (27) n=0 Corollary4(IntegralsofKernels) Let s(x,x(cid:2)) be a symmetric functiononX×Xsuchthat isadmissibleifandonlyifallan ≥0. Z (cid:2) (cid:2) A slightly weaker condition applies for finite dimensional k(x,x):= s(x,z)s(x,z)dz (23) X spaces. Forfurtherdetailssee[Bergetal.,1984,Smolaetal., 2001]. exists.ThenkisanadmissibleSVkernel. This can be shown directly from (21) and (23) by rearrang- 2.4 Examples ing the order of integration. We now state a necessary and sufficient condition fortranslation invariant kernels,i.e. In [Scho¨lkopf et al., 1998b] it has been shown, by explicitly k(x,x(cid:2)):=k(x−x(cid:2))asderivedin[Smolaetal.,1998c]. computingthemapping,that homogeneouspolynomialker- nelskwithp∈Nand Theorem5(ProductsofKernels) Denotebyk1andk2admissi- k(x,x(cid:2))=(cid:2)x,x(cid:2)(cid:3)p (28) bleSVkernelsthen (cid:2) (cid:2) (cid:2) aresuitableSVkernels(cf.Poggio[1975]). Fromthisobserva- k(x,x):=k1(x,x)k2(x,x) (24) tiononecanconcludeimmediately[Boseretal.,1992,Vapnik, isanadmissiblekernel. 1995]thatkernelsofthetype ` ´ This can be seen by an application of the “expan- k(x,x(cid:2))= (cid:2)x,x(cid:2)(cid:3)+c p (29) sion part” of Mercer’s theorem to the kernels k1 and kP2 and observing that each term in the double sum i.e. inhomogeneous polynomial kernels with p ∈ N,c ≥ 0 i,jλ1iλ2jψi1(x)ψi1(x(cid:2))ψj2(x)ψj2(x(cid:2))givesrisetoapositiveco- areadmissible,too: rewritek asasumofhomogeneousker- efficientwhenchecking(21). nelsandapplycorollary3. Anotherkernel,that mightseem appealing due to its resemblance to Neural Networks is the hyperbolictangentkernel Theorem6(Smola,Scho¨lkopf,andMu¨ller[1998c]) A trans- lationinvariantkernelk(x,x(cid:2)) = k(x−x(cid:2))is anadmissibleSV k(x,x(cid:2))=tanh`ϑ+φ(cid:2)x,x(cid:2)(cid:3)´. (30) kernelsifandonlyiftheFouriertransform Z Byapplyingtheorem8onecancheckthatthiskerneldoesnot −d −i(cid:5)ω,x(cid:6) F[k](ω)=(2π) 2 e k(x)dx (25) actually satisfy Mercer’scondition [Ovari, 2000]. Curiously, X thekernelhasbeensuccessfullyusedinpractice;cf.Scho¨lkopf isnonnegative. [1997]foradiscussionofthereasons. 4 Translationinvariantkernelsk(x,x(cid:2)) =k(x−x(cid:2))arequite leads to the regularized risk functional [Tikhonov and Ars- widespread.Itwasshownin[Aizermanetal.,1964,Micchelli, enin,1977,Morozov,1984,Vapnik,1982] 1986,Boseretal.,1992]that λ k(x,x(cid:2))=e−(cid:3)x−2σx2(cid:4)(cid:3)2 (31) Rreg[f]:=Remp[f]+ 2(cid:5)w(cid:5)2 (35) whereλ>0isasocalledregularizationconstant. Manyalgo- isan admissibleSVkernel. Moreoveronecan show [Smola, rithmslikeregularizationnetworks[Girosietal.,1993]orneu- 1996,Vapniketal.,1997]that(1X denotestheindicatorfunc- ralnetworkswithweightdecaynetworks[e.g.Bishop,1995] tiononthesetXand⊗theconvolutionoperation) minimizeanexpressionsimilarto(35). Ok k(x,x(cid:2))=B2n+1((cid:5)x−x(cid:2)(cid:5))withBk := 1[−1,1] (32) 3.2 MaximumLikelihoodandDensityModels 2 2 i=1 ThestandardsettingintheSVcaseis,asalreadymentioned B–splinesoforder2n+1,definedbythe2n+1convolutionof insection1.2,theε-insensitiveloss theunitinverval,arealsoadmissible. Weshallpostponefur- therconsiderationstosection7wheretheconnectiontoregu- c(x,y,f(x))=|y−f(x)|ε. (36) larizationoperatorswillbepointedoutinmoredetail. It is straightforward to show that minimizing (35) with the particularlossfunctionof(36)isequivalenttominimizing(3), 3 CostFunctions theonlydifferencebeingthatC =1/(λ(cid:8)). Loss functions such like |y −f(x)|pε with p > 1 may not So far the SV algorithm for regression may seem rather bedesirable,asthesuperlinearincreaseleadstoalossofthe strangeandhardlyrelatedtootherexistingmethodsoffunc- robustnesspropertiesoftheestimator[Huber,1981]: inthose tionestimation(e.g. [Huber,1981,Stone,1985,Ha¨rdle,1990, casesthederivativeofthecostfunctiongrowswithoutbound. Hastie and Tibshirani, 1990, Wahba, 1990]). However, once Forp<1,ontheotherhand,cbecomesnonconvex. castintoamorestandardmathematicalnotation,wewillob- Forthe case ofc(x,y,f(x)) = (y−f(x))2 we recoverthe servetheconnectionstopreviouswork. Forthesakeofsim- least mean squaresfit approach, which, unlike the standard plicitywewill,again,onlyconsiderthelinearcase,asexten- SV loss function, leads to a matrix inversion instead of a sions to the nonlinear one are straightforward by using the quadraticprogrammingproblem. kernelmethoddescribedinthepreviouschapter. Thequestioniswhichcostfunctionshouldbeusedin(35). On the one hand we will want to avoid a very complicated 3.1 TheRiskFunctional functioncasthismayleadtodifficultoptimizationproblems. Ontheotherhandoneshouldusethatparticularcostfunction Letusforamomentgobacktothecaseofsection1.2. There, thatsuitstheproblembest. Moreover,undertheassumption wehadsometrainingdataX:={(x1,y1),...,(x(cid:1),y(cid:1))}⊂X× thatthesamplesweregeneratedbyanunderlyingfunctional R. Wewillassumenow,thatthistrainingsethasbeendrawn dependencyplusadditivenoise,i.e.yi = ftrue(xi)+ξi with iid(independentandidenticallydistributed)fromsomeprob- density p(ξ), then the optimal cost function in a maximum abilitydistributionP(x,y).Ourgoalwillbetofindafunction likelihoodsenseis f minimizingtheexpectedrisk(cf.[Vapnik,1982]) Z c(x,y,f(x))=−logp(y−f(x)). (37) R[f]= c(x,y,f(x))dP(x,y) (33) Thiscanbeseenasfollows.Thelikelihoodofanestimate (pce(nxa,lyiz,fe(exs)t)imdeantiootnesearrcoorsst)fubnacsetidonodnettehremeimnipnigrihcoawl dwateawXill. Xf :={(x1,f(x1)),...,(x(cid:1),f(x(cid:1)))} (38) Given that we do not know the distribution P(x,y) we can foradditivenoiseandiiddatais onlyuseXforestimatingafunctionf thatminimizesR[f].A Y(cid:1) Y(cid:1) possible approximation consists in replacing the integration p(Xf|X)= p(f(xi)|(xi,yi))= p(yi−f(xi)). (39) by the empirical estimate, to get the so called empirical risk i=1 i=1 functional Maximizing P(Xf|X) is equivalent to minimizing 1X(cid:1) −logP(Xf|X).Byusing(37)weget Remp[f]:= (cid:8) c(xi,yi,f(xi)). (34) i=1 X(cid:1) −logP(Xf|X)= c(xi,yi,f(xi)). (40) Afirstattemptwouldbetofindtheempiricalriskminimizer f0 := argminf∈HRemp[f] for some function class H. How- i=1 ever,ifHisveryrich,i.e.its“capacity”isveryhigh,asforin- However, the cost function resulting from this reasoning stancewhendealingwithfewdatainveryhigh-dimensional might be nonconvex. In this case one would have to find a spaces, this may not be a good idea, as it will lead to over- convexproxyinordertodealwiththesituationefficiently(i.e. fitting and thus bad generalization properties. Hence one tofindanefficientimplementationofthecorrespondingopti- shouldaddacapacitycontrolterm,intheSVcase(cid:5)w(cid:5)2,which mizationproblem). 5 If,ontheotherhand, wearegivenaspecificcostfunction tediousnotation.Thisyields froma realworldproblem,one shouldtrytofind asclosea 8 pwrrotx.ythtiostphairsticcousltafrucnocsttiofunnacstipoonssthibalte,maastitteirsstuhletipmearftoelrym.ance >>>>>>< −12i,Pj(cid:1)=1(αi−α∗i)(αj−α∗j)(cid:2)xi,xj(cid:3) P(cid:1) m(37oT)da.ebllsea1ncdonthtaeincsorarnesopvoenrdviinegwloosvserfusnocmtieoncosmasmdoenfidneendsibtyy maximize >>>>>>: ++Ci=P1(cid:1)yiT(α(iξi−)+α∗iT)(−ξi∗ε)(αi+α∗i) 8 i=1 folTlohweionnglyisrtehqautirfeomrfiexnetdwxewanidlliymwpeosheaovnecc(oxn,vye,xfi(tyx)i)nifn(txh)e. where <: w = iP=(cid:1)1(αi−α∗i)xi Thisrequirementismade,aswewanttoensuretheexistence 8 T(ξ) := c˜(ξ)−ξ∂ξc˜(ξ) amnidzautinoinquperonbeslesm(fso[rFsletrticchtecro,n19v8e9x]i.ty) of a minimum of opti- >>>>< P(cid:1) (αi−α∗i) = 0 i=1 subjectto >>>>: ξα =≤ iCn∂f{ξξc˜(|ξC)∂ξc˜≥α} α,ξ ≥ 0 (43) 3.3 SolvingtheEquations 3.4 Examples Forthesakeofsimplicitywewilladditionallyassumectobe Let us consider the examples of table 1. We will show ex- symmetricandtohave(atmost)two(forsymmetry)discon- plicitly for two exampleshow (43) can be further simplified tinuitiesat±ε,ε ≥ 0inthefirstderivative,andtobezeroin to bring it into a form that is practically useful. In the ε– theinterval [−ε,ε]. Alllossfunctions fromtable 1belongto insensitivecase,i.e.c˜(ξ)=|ξ|weget thisclass.Hencecwilltakeonthefollowingform. T(ξ)=ξ−ξ·1=0. (44) Moroveronecanconcludefrom∂ξc˜(ξ)=1that c(x,y,f(x))=c˜(|y−f(x)|ε) (41) ξ=inf{ξ|C ≥α}=0andα∈[0,C]. (45) Forthe caseofpiecewisepolynomiallosswe have todistin- NotethesimilaritytoVapnik’sε–insensitiveloss. Itisrather guishtwodifferentcases: ξ ≤ σ andξ > σ. Inthefirstcase straightforwardtoextendthisspecialchoicetomoregeneral weget convex cost functions. For nonzero cost functions in the in- terval[−ε,ε]useanadditionalpairofslackvariables. More- T(ξ)= 1 ξp− 1 ξp =−p−1σ1−pξp (46) overwe mightchoosedifferentcostfunctionsc˜i,c˜∗i and dif- pσp−1 σp−1 p dfeirteionntavlaLluaegsraonfgεei,mε∗iuflotirpleiaecrhs isnamthpeled.uAatltfhoermexuplaetniosenoafdaddi-- andξ =inf{ξ|Cσ1−pξp−1 ≥α}=σC−p−11αp−11 andthus ttioo(n3a)lwdeisacrornivtienautitaiecsonalvseoxcmaninbimeitzaakteinoncaprreobolfe.mA[nSamloogloauasnldy T(ξ)=−p−1σC−p−p1αp−p1. (47) p Scho¨lkopf, 1998a]. To simplify notation we will stick to the oneof(3)anduseCinsteadofnormalizingbyλand(cid:8). Inthesecondcase(ξ≥σ)wehave p−1 p−1 T(ξ)=ξ−σ −ξ=−σ (48) p p P(cid:1) minimize 8<12(cid:5)wy(cid:5)i2−+(cid:2)wC,ix=i1(cid:3)(−c˜(ξbi)+≤c˜(ξεi∗+))ξi (42) Canodmξbi=niningf{bξot|hCca≥seαs}w=ehσa,vwehich,inturnyieldsα ∈ [0,C]. subjectto : (cid:2)ξwi,,ξxi∗i(cid:3)+b−yi ≤≥ ε0+ξi∗ α∈[0,C] andT(α)=−p−p 1σC−p−p1αp−p1. (49) Table2containsasummaryofthevariousconditionsonαand formulasforT(α)(strictlyspeakingT(ξ(α)))fordifferentcost Again,bystandardLagrangemultipliertechniques,exactlyin functions.5 Notethatthemaximumslopeofc˜determinesthe thesamemannerasinthe|·|εcase,onecancomputethedual regionoffeasibilityofα,i.e.s := supξ∈R+∂ξc˜(ξ) < ∞leads optimization problem (the main difference is that the slack tocompactintervals[0,Cs]forα. Thismeansthat theinflu- variable terms c˜(ξi(∗)) now have nonvanishing derivatives). enceofasinglepatternisbounded,leadingtorobustestima- ∗ We will omit the indices i and , where applicable to avoid tors[Huber,1972]. Onecanalsoobserveexperimentallythat 6 lossfunction densitymodel ε–insensitive c(ξ)=|ξ|ε p(ξ)= 2(11+ε)exp(−|ξ|ε) Laplacian c(ξ)=|ξ| p(ξ)= 1exp(−|ξ|) 2 Gaussian c(ξ)= 1ξ2 p(ξ)= √1 exp(−ξ2) j2 j2π 2 Huber’srobustloss c(ξ)= |2ξ1σ|(−ξ)σ2 ioft|hξe|r≤wiσse p(ξ)∝ eexxpp((−σ2ξ−σ2)|ξ|) iofth|ξe|r≤wiσse 2 2 Polynomial c(ξ)=(1p|ξ|p p(ξ)=(2Γ(p1/p)exp(−|ξ|p) Piecewisepolynomial c(ξ)= |pξσ|p1−−1σ(ξp−)pp1 iofth|ξe|r≤wiσse p(ξ)∝ eexxpp((σ−pp−pσξ1pp−−1)|ξ|) iofth|ξe|r≤wiσse Table1:Commonlossfunctionsandcorrespondingdensitymodels ε α CT(α) ε–insensitive ε(cid:10)=0 α∈[0,C] 0 Laplacian ε=0 α∈[0,C] 0 Σ output Σ Σi k (x,xi) + b Gaussian ε=0 α∈[0,∞) −1C−1α2 2 Huber’s ε=0 α∈[0,C] −1σC−1α2 ΣΣ ΣΣ Σ Σ Σ Σl weights robustloss 2 Polynomial ε=0 α∈[0,∞) −p−p1C−p−11αp−p1 ( . ) ( . ) Σ Σ Σ ( . ) dot product (Σ(x).Σ(xi)) = k (x,xi) Piecewise ε=0 α∈[0,C] −p−1σC−p−11αp−p1 polynomial p Σ(x1) Σ(x2) Σ(x) Σ(xl) mapped vectors Σ(xi), Σ(x) Table2:Termsoftheconvexoptimizationproblemdepending Σ Σ Σ support vectors x ... x 1 l onthechoiceofthelossfunction. test vector x theperformanceofaSVmachinedependssignificantlyonthe costfunctionused[Mu¨lleretal.,1997,Smolaetal.,1998b]. Acautionaryremarkisnecessaryregardingtheuseofcost Figure2:Architectureofaregressionmachineconstructedby functionsotherthan the ε–insensitiveone. Unlessε (cid:10)= 0we theSValgorithm. willlosetheadvantageofasparsedecomposition. Thismay beacceptableinthecaseoffewdata,butwillrenderthepre- dictionstep extremelyslowotherwise. Hence onewillhave similartoregressioninaneuralnetwork,withthedifference, totradeoffapotentiallossinpredictionaccuracywithfaster thatintheSVcasetheweightsintheinputlayerareasubset predictions.Note,however,thatalsoareducedsetalgorithm ofthetrainingpatterns. likein [Burges,1996, Burgesand Scho¨lkopf,1997, Scho¨lkopf Figure 3 demonstrates how the SV algorithm chooses the etal.,1999b]orsparsedecompositiontechniques[Smolaand flattestfunctionamongthoseapproximatingtheoriginaldata Scho¨lkopf,2000] couldbeappliedtoaddressthisissue. In a with a given precision. Although requiring flatness only in Bayesian setting, Tipping [2000] has recentlyshown how an featurespace,onecanobservethatthefunctionsalsoarevery L2costfunctioncanbeusedwithoutsacrificingsparsity. flatininputspace. Thisisduetothefact,thatkernelscanbe associated with flatness properties via regularization opera- tors.Thiswillbeexplainedinmoredetailinsection7. 4 TheBiggerPicture Finally Fig. 4 shows the relation between approximation quality and sparsity of representation in the SV case. The Beforedelvingintoalgorithmicdetailsoftheimplementation lower the precision required for approximating the original letusbrieflyreviewthebasicpropertiesoftheSValgorithm data, the fewerSVsareneededtoencodethat. Thenon-SVs forregressionasdescribedsofar. Figure2containsagraphi- areredundant,i.e.evenwithoutthesepatternsinthetraining caloverviewoverthedifferentstepsintheregressionstage. set,theSVmachinewouldhaveconstructedexactlythesame The input pattern (for which a prediction is to be made) function f. One might think that this could be an efficient is mapped into feature space by a map Φ. Then dot prod- way of data compression, namely by storing only the sup- ucts are computed with the images of the training patterns port patterns, fromwhich the estimate can be reconstructed underthemapΦ.Thiscorrespondstoevaluatingkernelfunc- completely. However,thissimpleanalogyturnsouttofailin tions k(xi,x). Finally the dot products are added up using thecaseofhigh-dimensionaldata,andevenmoredrastically theweightsνi =αi−α∗i.This,plustheconstanttermbyields in the presenceofnoise. In [Vapniket al.,1997] one can see thefinalpredictionoutput.Theprocessdescribedhereisvery thatevenformoderateapproximationquality,thenumberof 5ThetabledisplaysCT(α)insteadofT(α)sincetheformercanbeplugged SVs can be considerablyhigh, yieldingrates worse than the directlyintothecorrespondingoptimizationequations. Nyquistrate[Nyquist,1928,Shannon,1948]. 7 1 1 1 0.8 appssrinioncxc i xmx +a- t00io..n11 0.8 appssrinioncxc i xmx +a- t00io..n22 0.8 appssrinioncxc i xmx +a- t00io..n55 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4 -0.6 -0.6 -0.6 -0.8 -0.8 -0.8 -1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Figure3:Lefttoright:approximationofthefunctionsincxwithprecisionsε=0.1,0.2,and0.5.Thesolidtopandthebottom linesindicatethesizeoftheε–tube,thedottedlineinbetweenistheregression. Figure4:Lefttoright:regression(solidline),datapoints(smalldots)andSVs(bigdots)foranapproximationwithε=0.1,0.2, and0.5.NotethedecreaseinthenumberofSVs. 5 OptimizationAlgorithms proximationsarecloseenoughtogether,thesecondsub- algorithm,whichpermitsaquadraticobjectiveandcon- While there has been a large number of implementations of vergesveryrapidlyfromagoodstartingvalue, isused. SValgorithmsinthepastyears,wefocusonafewalgorithms Recently an interior point algorithm was added to the which will be presented in greater detail. This selection is softwaresuite. somewhatbiased,asitcontainsthesealgorithmstheauthors aremostfamiliarwith.However,wethinkthatthisoverview CPLEX by CPLEX Optimization Inc. [1994] uses a primal- containssomeofthemosteffectiveonesandwillbeusefulfor dual logarithmic barrier algorithm [Megiddo, 1989] in- practitioners who would liketo actually code a SV machine steadwithpredictor-correctorstep(seeeg.[Lustigetal., bythemselves. Butbeforedoingsowewillbrieflycoverma- 1992,MehrotraandSun,1992]). joroptimizationpackagesandstrategies. MINOS bytheStanfordOptimizationLaboratory[Murtagh and Saunders, 1983] uses a reduced gradient algorithm 5.1 Implementations inconjunctionwithaquasi-Newtonalgorithm. Thecon- straintsarehandledbyanactivesetstrategy. Feasibility Mostcommerciallyavailablepackagesforquadraticprogram- ismaintainedthroughouttheprocess.Ontheactivecon- mingcanalsobeusedtotrainSVmachines.Theseareusually straintmanifold,aquasi–Newtonapproximationisused. numerically very stable general purpose codes, with special enhancements forlargesparsesystems. Whilethe latterisa MATLAB Until recently the matlab QP optimizer delivered featurethatisnotneededatallinSVproblems(therethedot onlyagreeable,althoughbelowaverageperformanceon productmatrixisdenseandhuge)theystillcanbeusedwith classificationtasksandwasnotalltoousefulforregres- goodsuccess.6 siontasks(forproblemsmuch largerthan 100 samples) duetothefactthatoneiseffectivelydealingwithanop- OSL This package was written by [IBM Corporation,1992]. timization problem of size 2(cid:8) where at least half of the It usesatwo phasealgorithm. Thefirststepconsistsof eigenvaluesoftheHessianvanish.Theseproblemsseem solvingalinearapproximationoftheQPproblembythe tohavebeenaddressedinversion5.3/R11.Matlabnow simplex algorithm [Dantzig, 1962]. Next a related very usesinteriorpointcodes. simple QP problem is dealt with. When successive ap- LOQO byVanderbei[1994]isanotherexampleofaninterior 6Thehighpricetagusuallyisthemajordeterrentfornotusingthem.More- point code. Section 5.3 discussesthe underlying strate- overonehastobearinmindthatinSVregression,onemayspeeduptheso- giesindetailandshowshowtheycanbeadaptedtoSV lutionconsiderablybyexploitingthefactthatthequadraticformhasaspecial structureorthattheremayexistrankdegeneraciesinthekernelmatrixitself. algorithms. 8 MaximumMarginPerceptron byKowalczyk[2000]isanal- the constraint qualifications of the strong duality theo- gorithm specifically tailored to SVs. Unlike most other rem[Bazaraaetal.,1993,Theorem6.2.4]aresatisfiedand techniquesitworksdirectlyinprimalspaceandthusdoes itfollowsthatgapvanishesatoptimality. Thusthedual- nothavetotaketheequalityconstraintontheLagrange itygapisameasurehowclose(intermsoftheobjective multipliersintoaccountexplicitly. function)thecurrentsetofvariablesistothesolution. IterativeFreeSetMethods The algorithm by Kaufman Karush–Kuhn–Tucker(KKT)conditions Asetofprimaland [Bunch et al., 1976, Bunch and Kaufman, 1977, 1980, dualvariablesthatisbothfeasibleandsatisfiestheKKT Drucker et al., 1997, Kaufman, 1999], uses such a tech- conditionsisthesolution(i.e.constraint·dualvariable= nique starting with all variables on the boundary and 0).ThesumoftheviolatedKKTtermsdeterminesexactly adding them as the Karush Kuhn Tucker conditions thesizeofthedualitygap(thatis,wesimplycomputethe becomemoreviolated. Thisapproachhastheadvantage constraint·Lagrangemultiplierpartasdonein(55)).This of not having to compute the full dot product matrix allowsustocomputethelatterquiteeasily. from the beginning. Instead it is evaluated on the fly, Asimpleintuitionisthatforviolatedconstraintsthedual yielding a performance improvement in comparison variablecouldbeincreasedarbitrarily,thusrenderingthe to tackling the whole optimization problem at once. Lagrangefunctionarbitrarilylarge. This,however, isin However, also other algorithms can be modified by contraditiontothesaddlepointproperty. subset selection techniques (see section 5.5) to address thisproblem. 5.3 InteriorPointAlgorithms 5.2 BasicNotions Inanutshelltheideaofaninteriorpointalgorithmistocom- pute the dual of the optimization problem (in our case the Mostalgorithmsrelyonresultsfromthedualitytheoryincon- vexoptimization.Althoughwealreadyhappenedtomention dual dualofRreg[f]) and solveboth primaland dual simul- taneously. ThisisdonebyonlygraduallyenforcingtheKKT somebasicideasinsection1.2wewill,forthesakeofconve- conditions to iteratively find a feasible solution and to use nience, briefly review without proof the core results. These the duality gap between primal and dual objective function areneededinparticulartoderiveaninteriorpointalgorithm. to determine the quality of the current set of variables. The Fordetailsandproofsseee.g.[Fletcher,1989]. specialflavour of algorithm we will describeis primal–dual Uniqueness Everyconvexconstrainedoptimizationproblem path–following[Vanderbei,1994]. hasauniqueminimum. Iftheproblemisstrictlyconvex In order to avoid tedious notation we will consider the thenthesolutionisunique. ThismeansthatSVsarenot slightlymoregeneralproblemandspecializetheresulttothe plaguedwiththeproblemoflocalminimaasNeuralNet- SVMlater.Itisunderstoodthatunlessstatedotherwise,vari- worksare.7 ableslikeαdenotevectorsandαidenotesitsi–thcomponent. LagrpanrigmeaFluonbcjeticotinveTfhuencLtaiognramngineufsutnhcetiosunmisofgiavlelnprboydutchtes msuibnjiemctiztoe A12qα(α=)b+a(cid:2)ncd,αl(cid:3)≤α≤u (50) between constraints and corresponding Lagrange mul- with c,α,l,u ∈ Rn, A ∈ Rn·m, b ∈ Rm, the inequalities be- tipliers (cf. e.g. [Fletcher, 1989, Bertsekas, 1995]). Opti- tweenvectorsholdingcomponentwiseandq(α)beingacon- mization can be seen as minimzation of the Lagrangian vexfunctionofα. Nowwewilladdslackvariablestogetrid wrt.theprimalvariablesandsimultaneousmaximization ofallinequalitiesbutthepositivityconstraints.Thisyields: wrt.theLagrangemultipliers,i.e.dualvariables.Ithasa saddlepointatthesolution. UsuallytheLagrangefunc- minimize 1q(α)+(cid:2)c,α(cid:3) tionisonlyatheoreticaldevicetoderivethedualobjec- subjectto A2α=b, α−g=l, α+t=u, (51) tivefunction(cf.Sec.1.2). g,t≥0, αfree DualObjectiveFunction ItisderivedbyminimizingtheLa- Thedualof(51)is grangefunctionwithrespecttotheprimalvariablesand subsequenteliminationofthelatter.Henceitcanbewrit- maximize 1(q(α)−(cid:2)∂(cid:15)q(α),α)(cid:3)+(cid:2)b,y(cid:3)+(cid:2)l,z(cid:3)−(cid:2)u,s(cid:3) tensolelyintermsofthedualvariables. subjectto 21∂(cid:15)q(α)+c−(Ay)(cid:8)+s=z, s,z ≥0, yfree 2 (52) DualityGap For both feasible primal and dual variables MoreoverwegettheKKTconditions,namely the primalobjective function (ofaconvexminimization problem)isalwaysgreaterorequalthanthedualobjec- gizi =0andsiti =0foralli∈[1...n]. (53) tive function. Since SVMs have only linear constraints Anecessaryandsufficientconditionfortheoptimalsolution 7Forlargeandnoisyproblems(e.g.100.000patternsandmorewithasub- stantialfractionofnonboundLagrangemultipliers)itisimpossibletosolvethe is that the primal / dual variables satisfy both the feasibil- problemexactly: duetothesizeonehastousesubsetselectionalgorithms, ity conditions of (51) and (52) and the KKT conditions (53). hencejointoptimizationoverthetrainingsetisimpossible.However,unlikein Weproceedtosolve(51)–(53)iteratively. Thedetailscanbe NeuralNetworks,wecandeterminetheclosenesstotheoptimum. Notethat thisreasoningonlyholdsforconvexcostfunctions. foundinappendixA. 9 5.4 UsefulTricks 5.5 SubsetSelectionAlgorithms Before proceeding to further algorithms for quadratic opti- The convex programming algorithms described so far can mization let us briefly mention some useful tricks that can be used directly on moderately sized (up to 3000) samples beappliedtoallalgorithmsdescribedsubsequentlyandmay datasetswithoutanyfurthermodifications.Onlargedatasets, have significant impact despite their simplicity. They are in however, it is difficult, due to memory and cpu limitations, partderivedfromideasoftheinterior-pointapproach. to compute the dot product matrix k(xi,xj) and keep it in memory. A simple calculation shows that for instance stor- TrainingwithDifferentRegularizationParameters Forsev- ingthedotproductmatrixoftheNISTOCRdatabase(60.000 eralreasons(modelselection,controllingthenumberof samples) at single precision would consume 0.7 GBytes. A supportvectors,etc.)itmayhappenthatonehastotrain Cholesky decomposition thereof, which would additionally aSVmachinewithdifferentregularizationparametersC, requireroughlythesameamountofmemoryand64Teraflops butotherwiseratheridenticalsettings. Iftheparameters (countingmultipliesandaddsseparately),seemsunrealistic, Cnew = τCold is not too different it is advantageous to atleastatcurrentprocessorspeeds. use the rescaled values of the Lagrange multipliers (i.e. αi,α∗i)asastartingpointforthenewoptimizationprob- Afirstsolution,whichwasintroducedin[Vapnik,1982]re- liesontheobservationthatthesolutioncanbereconstructed lem. Rescalingisnecessaryto satisfythe modifiedcon- fromtheSVsalone.Hence,ifweknewtheSVsetbeforehand, straints.Onegets anditfittedintomemory,thenwecoulddirectlysolvethere- αnew =ταoldandlikewisebnew =τbold. (54) ducedproblem. ThecatchisthatwedonotknowtheSVset before solving the problem. The solution is to start with an Assuming that the (dominant) convex part q(α) of the primalobjectiveisquadratic,theq scaleswithτ2 where arbitrarysubset, afirstchunkthat fitsinto memory,trainthe SV algorithm on it, keepthe SVs and fill the chunk up with as the linearpart scaleswith τ. However, since the lin- datathecurrentestimatorwouldmakeerrorson(i.e.dataly- ear term dominates the objective function, the rescaled ingoutsidetheε–tubeofthecurrentregression).Thenretrain values are still a better starting point than α = 0. In thesystemandkeeponiteratinguntilaftertrainingallKKT– practice a speedupof approximately95% of the overall conditionsaresatisfied. training time can be observed when using the sequen- Thebasicchunkingalgorithmjustpostponedtheunderly- tialminimizationalgorithm,cf.[Smola,1998]. Asimilar ingproblemofdealingwithlargedatasetswhosedot–product reasoningcanbeappliedwhenretrainingwiththesame matrixcannotbekeptinmemory:itwilloccurforlargertrain- regularizationparameterbutdifferent(yetsimilar)width ingsetsizesthanoriginally,butitisnotcompletelyavoided. parametersofthekernelfunction. See[Cristianinietal., Hencethesolutionis[Osunaetal.,1997]touseonlyasubset 1998]fordetailsthereoninadifferentcontext. of the variables as a working set and optimize the problem MonitoringConvergenceviatheFeasibilityGap In the with respectto them while freezing the other variables. This case of both primal and dual feasible variables the methodisdescribedindetailin[Osunaetal.,1997,Joachims, followingconnectionbetweenprimalanddualobjective 1999,Saundersetal.,1998]forthecaseofpatternrecognition.8 functionholds: Anadaptationofthesetechniquestothecaseofregression X DualObj.=PrimalObj.− (gizi+siti) (55) withconvexcostfunctionscanbefoundinappendixB. The basicstructureofthemethodisdescribedbyalgorithm1. i Thiscanbeseenimmediatelybytheconstructionofthe Algorithm1Basicstructureofaworkingsetalgorithm. Lagrange function. In Regression EstimatPion (with the ε–insensitivelossfunction)oneobtainsfor igizi+siti Initializeαi,α∗i =0 Xi 2664 −−++mmmmaaiinnxx((((0000,,,,f(f(y(y(ixix−i−i))ε−ε−∗i∗i)()(y−y−ii+f+f((εxεxiiii))))))))αα((C∗iCi −−αα∗ii)) 3775. (56) rCehpASCoeoopoalmvpsteeepnaruredtbdiexiutcrBcoae)urdypolwipnotgirmtkeiirnzmgatssioe(ltninSpewraorbalnemdconstant)forSw(see Thus convergence with respectto the point of the solu- Choose new Sw from variablesαi,α∗i not satisfying the tion can be expressed in terms of the duality gap. An KKTconditions effectivestoppingruleistorequire untilworkingsetSw =∅ P |PrimailgOibzije+ctsivitei|+1 ≤εtol (57) for some precision εtol. This condition is much in the 5.6 SequentialMinimalOptimization spirit of primal dual interiorpoint path following algo- Recently an algorithm — Sequential Minimal Optimization rithms, where convergence is measured in terms of the (SMO)—wasproposed[Platt,1999]thatputschunkingtothe number of significant figures (which would be the dec- imal logarithm of (57)), a convention that will also be 8AsimilartechniquewasemployedbyBradleyandMangasarian[1998]in adoptedinthesubsequentpartsofthisexposition. thecontextoflinearprogramminginordertodealwithlargedatasets. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.