ebook img

Large-scale Kernel-based Feature Extraction via Budgeted Nonlinear Subspace Tracking PDF

0.69 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Large-scale Kernel-based Feature Extraction via Budgeted Nonlinear Subspace Tracking

IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 1 Large-scale Kernel-based Feature Extraction via Budgeted Nonlinear Subspace Tracking Fatemeh Sheikholeslami, Student Member, IEEE, Dimitris Berberidis, Student Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE Abstract—Kernel-basedmethodsenjoypowerfulgeneralizationcapabilitiesinhandlingavarietyoflearningtasks.Whensuchmethods areprovidedwithsufficienttrainingdata,broadly-applicableclassesofnonlinearfunctionscanbeapproximatedwithdesiredaccuracy. 6 Nevertheless, inherent to the nonparametric nature of kernel-based estimators are computational and memory requirements that 1 becomeprohibitivewithlarge-scaledatasets.Inresponsetothisformidablechallenge,thepresentworkputsforwardalow-rank,kernel- 0 based,featureextractionapproachthatisparticularlytailoredforonlineoperation,wheredatastreamsneednotbestoredinmemory.A 2 novelgenerativemodelisintroducedtoapproximatehigh-dimensional(possiblyinfinite)featuresviaalow-ranknonlinearsubspace,the learningofwhichleadstoadirectkernelfunctionapproximation.Offlineandonlinesolversaredevelopedforthesubspacelearningtask, n alongwithaffordableversions,inwhichthenumberofstoreddatavectorsisconfinedtoapredefinedbudget.Analyticalresultsprovide a performanceboundsonhowwellthekernelmatrixaswellaskernel-basedclassificationandregressiontaskscanbeapproximatedby J leveragingbudgetedonlinesubspacelearningandfeatureextractionschemes.Testsonsyntheticandrealdatasetsdemonstrateand 8 benchmarktheefficiencyoftheproposedmethodwhenlinearclassificationandregressionisappliedtotheextractedfeatures. 2 IndexTerms—Onlinefeatureextraction,kernelmethods,classification,regression,budgetedlearning,subspacetracking. ] L ✦ M . t 1 INTRODUCTION a t s KERNEL-BASEDexpansionscanboostthegeneralization selected data vectors [20]. Nystrom approximation further [ capability of learning tasks by powerfully modeling improvesthisapproachbyapproximatingthekernelmatrix nonlinearfunctions,whenlinearfunctionsfallshortinprac- as K K⊤W†K , where W is an r r submatrix of 1 ≃ S S × v tice.Infact,theupshotofkernelmachinesisthatwhenpro- K corresponding to the kernel values computed between 7 videdwithsufficienttrainingdata,arbitrarynonlinearfunc- the selected r data vectors [38]. In [11], [19], [40], further 4 tionscanbeapproximatedwithdesiredaccuracy.Although attemptsto improve the Nystrom approximationare made 9 “datadeluge”setsthestagebyprovidingthe“data-hungry” for ensembledatasets,while [1], [7], [41]analyzehowsuch 7 kernel methods with huge datasets, limited memory and low-rank approximations influence the accuracy of the tar- 0 computational capabilities prevent such tools from fully gettasks. . 1 exploiting their learning capabilities. In particular, given a 0 dataset with N vectors of dimension D, a standard kernel Feature approximation is a second possibility where 6 regressionorclassificationtooltakes (N2D)operationsto instead of approximating the kernel matrix, one directly v:1 ifto,ramndthe(NN3×)NopekreartnioenlsmtaotrfiinxdKth,emoOepmtiomryalOp(aNtte2r)nt.o store adpimpreonxsiimonaatelsfethartuouregshgainvern×by1 ftehaetumreavpepcitnogr zφth:exhigh- Xi In tOhis context, several efforts have been made in dif- φ(x), from which the kernel is induced as κ(xi,xj) =→< r ferentfieldsofstochasticandnumericaloptimization,func- φ(xi),φ(xj)>[21],[22],[27],[39].Basedonz,thenonlinear a tional analysis, and numerical algebra to speed up kernel kernel κ(xi,xj) is approximated by the linear one, mean- machines for “big data” applications [8], [11], [18], [22], ing κ(xi,xj) ≃ z⊤i zj. Exploiting the fast linear learning [27], [32], [38]. A common approach for scaling up kernel machines [12], [32], the kernel-based task then reduces to methods is to approximate the kernel matrix K by a low- learningalinearfunction,whichcanbeachievedin (Nr) O rank factorization; that is K Z⊤Z, where Z Rr×N operations. A third approach to alleviate the prohibitive ≃ ∈ with r ( N) is the reduced rank, through which storage requirements of kernel methods is offered by online algo- ≪ and computational requirements go down to (Nr) and rithms. Instead of loading the entire datasets in memory, (Nr2), respectively. Such a factorization canObe effected onlinemethodsiterativelypassoverthesetfromanexternal O by randomly selecting r training vectors to approximate memory[4],[16],[17],[18],[31],[32],[33].Thisisalsocritical the kernel matrix as K K⊤K , where K RN×r is when the entire dataset is not available beforehand, but ≃ S S S ∈ formedbyasubset ofcolumnsofKcorrespondingtothe is acquired one datum at a time. For large data streams S however, as the number of data increases with time, the numberofsupportvectors(SV)throughwhichthefunction • This work was supported by NSF grants 1343248, 1500713, and the is estimated, i.e., the set in the approximation f(x) NIHgrantno.1R01GM104975-01. Preliminarypartsofthisworkwere fˆ(x) = α κ(x ,x), aSlso increases. Thus, the functio≃n presentedattheProc.ofGlobalsipConf.,Orlando,FL,Dec.2015. i∈S i i AuthorsarewiththeDept.ofElectricalandComp.Engr.andtheDigital evaluationdelayaswellastherequiredmemoryforstoring P Tech.Center,UniversityofMinnesota,Minneapolis,MN55455,USA. the SV set eventually become unaffordable. Efforts have E-mails:{sheik081,bermp001,georgios}@umn.edu beendevotedtoreducingthenumberofSVswhiletryingto IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 2 maintainperformanceontheunseendata(a.k.a.generaliza- the rank regularization is replaced with orthonormality tion capability)[10]. Inmore recent attempts,byrestricting constraintsonthecolumnsofL.Undoubtedly,theaccuracy themaximumnumberof SVstoapredefinedbudgetB,the ofanylineardimensionalityreductionmethodisdictatedby growthofalgorithmcomplexityisrestrictedtoanaffordable howwellthemodel(1)fitsagivendataset,whichisrelated limit,andismaintainedthroughouttheonlineclassification to how well the corresponding data covariance matrix can [9],[35],[36]orregression[34]task. beapproximatedbyalow-rankmatrix[15,p.534]. The present work introduces a generative model ac- In practice however, low-rank linear models often fail cording to which the high-dimensional (possibly infinite) to accurately capture the underlying characteristics of the features are approximated by their projection onto a low- datasets. A means to deal with nonlinearities in pattern rank nonlinear subspace, thus providing a direct kernel recognitiontasks,istofirstmapvectors x N toahigher functionapproximation(Section2). Incontrastto[11],[21], D¯-dimensionalspaceusingafunctionφ{:Rν}Dν=1 RD¯ (possi- [27], [40] where, due to the nature of randomization, the blywithD¯ = ),andsubsequentlyseekaline→arfunctionis number of required features for providing an accurate ker- ∞ soughtoverthelifteddataφ(x).Correspondingtothemap- nel function approximation is often large, systematically ping,aso-termedkernelfunctionκ(x ,x )= φ⊤(x )φ(x ) learning the ambient nonlinear subspace can provide an i j i j isinduced,whichchosentohaveaclosedform expression, accurate approximation through a smaller number of ex- circumventstheneedtoexplicitlyknow φ(x ) N -what tractedfeatures.Offlineandonlinesolversforthesubspace isreferredtoasthe“kerneltrick”.Upond{efiniνng}νth=e1D¯ N learningtaskaredevelopedinSections3and4respectively, × andtheirconvergenceisanalyzed.Furthermore,tokeepthe matrixΦN := [φ(x1),...,φ(xN)], theN ×N kernelmatrix related to the covariance of the lifted data, is expressed complexity and memory requirements affordable, budgeted versionsoftheproposedalgorithmsaredevisedinSection5, with (i,j) entry κ(xi,xj) as the K(x1:N,x1:N) = Φ⊤NΦN, in which the number of stored data vectors is confined to wherex1:N := [x1,x2,...,xN].Itscomputationandstorage are (N2D) and (N2) respectively, which is often not a predefined budget B. Budget maintenance is performed O O affordableinlargedatasets(N ). throughagreedyapproach,whoseeffectivenessiscorrobo- ≫ ratedthroughsimulatedtests.AnalyticalresultsinSection6 In practical large datasets however, K is approximately provideperformanceboundsonhowwellthekernelmatrix low rank. This fact is exploited in e.g., [11], [39] and [38] aswellaskernelizedclassificationandregressiontaskscan to approximate K via a low-rank factorization, hence alle- be approximated by leveraging the novel budgeted online viating the evaluation and memory requirements of offline subspace-learningand feature-extraction approach. Finally, kernel-based learning tasks from (N2) down to (Nr). O O Section7presentsclassificationandregressionexperiments Here, we further build on this observation to deduce that onsyntheticandrealdatasets,demonstratingtheefficiency thelow-rankpropertyofK = Φ⊤Φ impliesthatΦ can N N N oftheproposedmethodsintermsofaccuracyandruntime. alsobeapproximatedbyalow-rankmatrix,thusmotivating ourpursuitofonlinelow-rankfactorizationofΦ .Tothisend, N instead of projecting x s onto the columns of L as in (2), 2 PRELIMINARIES AND PROBLEM STATEMENT wewillproject φ(x){s}onL¯ RD¯×r whosecolumnsspan Consider a set of N real data vectors of size D 1. As whatwe referto{as“v}irtual”c∈olumnsubspacesinceD¯ can × large values of D and N hinder storage and processing of beinfinite.Specifically,weconsider[cf.(2)] suchdatasets,extractinginformativefeaturesfromthedata N (a.k.a. dimensionality reduction) results in huge savings on min 1 φ(x ) L¯q 2+ λ L¯ 2 + Q 2 mtalelmyobruyildansdoncotmheppurtaetmioinsealthreaqtuthireeminefonrtsm.aTthivisefpuanrdtaomf tehne- L¯,{qν}Nν=1 2N νX=1k ν − νk2 2N(cid:16)k kF k NkF(3(cid:17)) daraetawieslolfrelpowresdeinmteednsbiyonthre<geDne,raantidvealmsoodtheel data{xν}Nν=1 wherekL¯kF := rc=1klck2Hwithk.kHdenotingthenorm definedinthereqproducingkernelHilbertspace(RKHS)in- P xν =Lqν +vν , ν =1,...,n (1) ducedbythekernelκ(x,.),andlcdenotingthec-thcolumn ofL¯.ForafixedQ ,thecriterionin(3)isminimizedbythe N wherethetallD r matrixLhasrankr < D;vectorq is × ν subspace the r 1 projectionof x onto the column space of L; and ν vν de×noteszero-meannoise. L¯ =Φ Q⊤ Q Q⊤ +λI −1 :=Φ A (4) N N N N N N PursuitofsubspaceLandthelow-dimensionalfeatures {qν}Nν=1ispossibleusingablindleast-squares(LS)criterion wheretheN ×r factor(cid:16)Acanbeview(cid:17)edas“morphing’the regularizedbyarank-promotingtermusinge.g.,thenuclear columns of Φ to offer a flexible basis for the lifted data. N normofLQN,whereQN :=[q1,...,qN][28].Toarriveatan Substituting(4)backinto(3)andexploitingthekerneltrick, efficient subspace tracker, the nuclear norm can be further wearriveat surrogatedbytheFrobenious-normsofLandQN,yielding 1 N L,{mqνi}nNν=1 21N νX=n1kxν −Lqνk22+ 2λN(cid:16)kLk2F +kQNk2F(cid:17)(2) A,{mqνin}Nν=1 2N νX=1+(cid:16)κq(⊤νxAν,⊤xKν)(x−1:2Nk,⊤x(1x:N1:)NA,xqνν)Aqν (5) where λ controls the tradeoff between LS fit and rank reg- N(cid:17) λ ularization [24]. Principal component analysis (PCA) - the + 2N tr{A⊤K(x1:N,x1:N)A}+ kqνk22 “workhorse” of dimensionality reduction- solves (2) when (cid:16) νX=1 (cid:17) IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 3 where the N 1 vector k(x1:N,xn) in (5) is the n-th Proposition 1. For positive definite kernels and λ > 0, the × column of K(x1:N,x1:N), and since A has size N r, the sequence A[k],QN[k] generated byAlgorithm 1 convergesto minimizationin(5)doesnotdependonD¯. × astationa{rypointofthe}minimizationin(5). Our goal is to develop and analyze offline as well as Proof: Since the minimizations in (6a) and (8) are strictly online solvers for (5). By pre-specifying an affordable com- convexwithuniquesolutions,theresultfollowsform [2,p. plexity for the online solver, we aim at a low-complexity 272]. (cid:4) algorithm where subspace learning and feature extraction can be performed on-the-fly for streaming applications. Algorithm1BKFE:BatchKernel-basedFeatureExtraction Furthermore, we will introduce a novel approach to ex- Input x N ,λ tracting features on which the kernel-based learning tasks Initiali{zeνA}ν[1=]1atrandom of complexity O(N3) can be well approximated by linear Fork =1,...do counterparts of complexity (rN), hence realizing great −1 savings in memory and comOputation while maintaining S[k+1]= A⊤[k]K(x1:N,x1:N)A[k]+λIr A⊤[k] performance. Q[k+1]=(cid:16)S[k+1]K(x1:N,x1:N) (cid:17) −1 A[k+1]=Q⊤[k+1] Q [k+1]Q⊤[k+1]+λI N N N r 3 OFFLINE KERNEL-BASED FEATURE EXTRAC- (cid:16) (cid:17) RepeatUntilConvergence TION VIA LOW-RANK SUBSPACE TRACKING ReturnA[k], q [k] N { ν }ν=1 Given a dataset x N and leveraging the bi-convexity { ν}ν=1 of the minimization in (5), we introduce in this section Since matrix inversions in (7) and (9) cost (r3), and an offline solver, where two blocks of variables (A and Q andAhavesizer N andN r,respectivOely,theper N {qν}Nν=1) are updated alternately. The following two up- iterationcostisO(N2r×+Nr2+r3×).Althoughthenumber datesarecarriedoutiterativelyuntilconvergence. ofiterationsneededinpracticeforAlgorithm1toconverge Update 1. With A[k] given from iteration k, the projec- is effectively small, this per iteration complexity can be tionvectors{qν}Nν=1 initerationk+1areupdatedas unaffordableforlargedatasets.Inaddition,datasetsarenot λ always available offline, or due to their massive volume, qν[k+1]=argmqin ℓ(xν;A[k],q;x1:N)+ 2kqk22 (6a) they can not be uploaded into memory at once. To cope withtheseissues,anonlinesolverfor(5)isdevelopednext, wherethefittingcostℓ(.)isgivenby[cf.(3)-(5)] wheretheupdatesarecarriedoutbyiterativelypassingover 1 thedatasetonedatumatatime. ℓ(xν;A[k],q;x1:N):= 2kφ(xν)−ΦNA[k]qk2H (6b) =κ(xν,xν) 2k⊤(x1:N,xν)A[k]q − 4 ONLINE KERNEL-BASED FEATURE-EXTRACTION +q⊤A⊤[k]K(x1:N,x1:N)A[k]q. This section deals with low-cost, on-the-fly updates of the The minimizer of (6a) yields the features as regularized ‘virtual’ subspace L¯, or equivalently its factor A as well projection coefficients of the lifted data vectors onto the as the features q that are desirable to keep up with virtual subspace L¯ [k] = Φ A[k], and is given in closed { ν} N N streaming data. For such online updates, stochastic gradi- formby ent descent (SGD) offers a practical approach, especially qν[k+1]=(A⊤[k]K(x1:N,x1:N)A[k]+λIr)−1 for parametric settings. However, upon processing n data vectors, A has size n r, which obviously grows with ×A⊤[k]k(x1:N,xν), ν =1,...,N . (7) n. Hence, as the numb×er of subspace factor parameters Update2.With q [k+1] N fixedandafterdropping increases with the number of data, the task of interest is { ν }ν=1 irrelevantterms,thesubspacefactorisobtainedas[cf.(5)] a nonparametric one. Unfortunately, performance of SGD on nonparametric learning such as the one at hand is a 1 N relativelyunchartedterritory.Nevertheless,SGDcanstillbe A[k+1]=argmAinN ℓ(xν;A,qν[k+1];x1:N) performed on the initial formulation (2), where solving for νX=1 the virtual L¯ constitutes a parametric task, not dependent λ + tr A⊤K(x1:N,x1:N)A . (8) onn. 2N { } Starting with an update for L¯, an update for A will be Since K is strictly positive definite in practice, the task in derivedfirst,asanalternativetothosein[8],[32],and[35]. (8) involves a strictly convex minimization. Equating the Next,anSGDiterationforAwillbedevelopedinsubsection gradienttozero,yieldsthewantedsubspacefactorinclosed 4.2, while in subsection 4.3 a connection between the two form update rules will be drawn, suggesting how SGD can be −1 broadenedtolearningnonparametricmodelsaswell. A[k+1]=Q⊤[k+1] Q [k+1]Q⊤[k+1]+λI . (9) N N N r (cid:16) (cid:17) Algorithm 1 provides the pseudocode for the update rules 4.1 SGDon“parametric”nonlinearsubspacetracking (7) and (9) of the offline solver, and the following proposi- tion gives a guarantee on the convergence of the proposed Suppose that x is acquired at time n, posing the overall n solvertoalocalstationarypoint. joint subspace tracking and feature extraction problem as IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 4 [cf.(3)] factor n λ L¯,{mqνi}nnν=1 21nνX=1kφ(xν)−L¯qνk2H+2λn(cid:16)kL¯k2F +kQnk2F(1(cid:17)0). A[n]=A[n−1]−µn,LAµ[nn,L−q⊤1][(cid:16)nq][n]q⊤[n]+ nIr(cid:17) .  (16) Usinganalternatingminimizationapproach,weupdate Even though (16) is not the only iteration satisfying (15), featuresandthesubspaceperdatavectorasfollows. it offers an efficient update of the factor A. The update steps for the proposed parametric tracker are summarized Update1.Fixingthesubspaceestimateatitsrecentvalue L¯[n 1] := Φn−1A[n 1] from time n 1, the projection asAlgorithm2. − − − vectorofthenewdatavectorx isfoundas[cf.(6a)] Note that the multiplication and inversion in (9) are n avoided. However, per data vector processed, the kernel λ q[n]=argmqinℓ(xν;A[n−1],q;x1:n−1)+ 2kqk22 (11a) msuabtsrpixaciseefaxcptaonrdAedgrboywosnaeccroorwdianngdlyobnyeocnoelurmown,. while the whichthroughthekerneltrickreadilyyields q[n]=(A⊤[n 1]K(x1:n−1,x1:n−1)A[n 1]+λIr)−1 Algorithm2Onlinekernel-basedfeatureextraction(OK-FE) − − A⊤[n 1]k(x1:n−1,xn). (11b) Input{xν}nν=1,λ × − InitializeA[1]=11×r ,K(x1,x1)=κ(x1,x1) Although this update can be done for all the previous Forn=2,...do features q n−1aswell,itisskippedinpracticetoprevent explodin{g cνo}mν=p1lexity. In the proposed algorithm, feature q[n]=(A⊤[n 1]K(x1:n−1,x1:n−1)A[n 1]+λIr)−1 − − extractionisperformedonlyforthemostrecentdatavector A⊤[n 1]k(x1:n−1,xn) × − x . n Update 2.Havingobtainedq[n],thesubspaceupdateis K(x1:n,x1:n)= Kk(⊤x(1x:n1−:n1−,1x,1x:nn−)1) k(xκ1(:xn−n,1x,xnn)) givenbysolving (cid:20) (cid:21) λ 1 n A[n 1] µ A[n 1] q[n]q⊤[n]+ I min ¯(x ;L¯,q[ν]) (12a) A[n]= − − n,L − n r L¯ n L ν  µ q⊤[(cid:16)n] (cid:17) ν=1 n,L X   where ReturnA[n], q[ν] n { }ν=2 1 λ ¯(x ;L¯,q[ν]):= φ(x ) L¯q[ν] 2 + L¯ 2 . (12b) L ν 2k ν − kH 2nk kF Solving (12a) as time evolves, becomes increasingly com- plex, and eventually unaffordable. If data x n sat- { ν}ν=1 isfy the law of large numbers, then (12a) approximates 4.2 SGDfornonparametricsubspacetracking tmhienuL¯nEk[nL¯o(wxνn;pL¯r,oqbνa)b],ilwityhedriestreixbpuetciotantioofnthisewdaittha.rTeospreedctuctoe In this subsection, the feature extraction rule in (11b) is complexity of the minimization, one typically resorts to retained, while the update rule (16) is replaced by directly stochastic approximation solvers, where by dropping the acquiringtheSGDdirectionalongthegradientoftheinstan- expectation (equivalently, the sample averaging operator), taneousobjective term withrespectto A. Since, in contrast the‘virtual’subspaceupdateis to the fixed-size matrix L¯, the number of parameters in A grows withn,we refertothe solverdevelopedinthis sub- L¯[n]=L¯[n 1] µn,LG¯n (13) section as nonparametric subspacetracker. Furthermore, the − − with µ denoting a preselected stepsize, and G¯ the connection betweenthe two solvers is drawn insubsection n,L n 4.3,andconvergenceoftheproposedalgorithmisanalyzed gradientofthen-thsummandin(12a)givenby[cf.(12b)] insubsection4.4. G¯n :=∇L¯L¯(xn;L¯[n−1],q[n]) At time instance n, subproblem (12a) can be expanded = φ(x ) L¯[n 1]q[n] q⊤[n]+ λL¯[n 1] usingthekerneltrickas n − − − n − n =Φ(cid:16)n A[n−1q]⊤q[[nn]]q⊤[n] (cid:17)+ nλΦn A0[n1×−r1] . (14) A∈mRinn×r n1 ν=1L(xν;A,q[ν];x1:n)} (17) (cid:20) − (cid:21) (cid:20) (cid:21) X Using (4) to rewrite L¯[n] = Φ A[n], and substituting into where n (13),yields (xν;A,q[ν];x1:n):= ℓ(xν;A,q[ν];x1:n) L A[n 1] λ ΦnA[n]=Φn 01×−r + 2ntr{A⊤K(x1:n,x1:n)A} (18) (cid:20) (cid:21) λ withℓ(.)givenby(6b).Stochasticapproximationsolversof A[n 1] q[n]q⊤[n]+ I µn,LΦn − n r (15) (17)suggesttheupdate −  (cid:16) q⊤[n] (cid:17) − A[n 1] which suggests thefollowing update rule for thesubspace A[n]=(cid:20) 0⊤r×−1 (cid:21)−µn,AGn (19a) IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 5 where µ denotes the user-selected step size, and G parametric SGD solver in Algorithm 2 will allow us to n,A n denotes the gradient of the n-th summand in (17) with analyzeconvergenceofthetwoalgorithmsnext. respecttoAthatisgivenby Gn := A (xn;[A⊤[n 1],0r×1]⊤,q[n];x1:n) (19b) 4.4 Convergenceanalysis ∇ L − A[n 1] Thecostin(10)canbewrittenas =K(x1:n,x1:n) 0⊤− q[n]q⊤[n] (cid:20) r×1 (cid:21) 1 n −k(x1:n,xn)q⊤[n]+ nλK(x1:n,x1:n)(cid:20)A0[n⊤r×−11](cid:21) . Fn(L¯):= nνX=1mqinfν(xν;L¯,q) (22) with f (x ,L¯,q) := ¯(x ;L¯,q) + (λ/2) q 2, and ¯ as Substituting (19b) into (19a) yields the desired update of ν ν L ν k k2 L A which together with (11b) constitute our nonparametroc in (12b). Thus, the minimization in (10) is equivalent to solver,tabulatedunderAlgorithm3. minL¯Fn(L¯). To ensure convergence of the proposed algo- rithms,thefollowingassumptionsareadopted. Algorithm3Onlinekernel-basedfeatureextraction(OK-FE) (A1)Datavectorsaredrawnindependentlyfromanidentical Input x n ,λ distribution;and { ν}ν=1 (A2)Thesequence L¯[ν] ∞ isbounded. InitializeA[1]=11×r ,K(x1,x1)=κ(x1,x1) {k kF}ν=1 Theindependenceof x acrosstimeisstandardwhen Forn=2,...do { ν} studying the performance of online algorithms [24], while q[n]=(A⊤[n−1]K(x1:n−1,x1:n−1)A[n−1]+λIr)−1 boundednessoftheiterates{kL¯[ν]kF}∞ν=1isatechnicalcon- A⊤[n 1]k(x1:n−1,xn) ditionthatsimplifiestheanalysis,andinthepresentsetting × − is guaranteed due to the Frobenious-norm regularization K(x1:n,x1:n)=(cid:20)Kk(⊤x(1x:n1−:n1−,x1,1x:nn−)1) k(xκ1(:xn−n,1x,xnn))(cid:21) (Pbryopporosiptieornch2o.iUcenodferth(eA1st)e-(pAs2iz),e)i.f µn,L = 1/γ¯n with γ¯n := n γ and γ 2¯(x ;L¯,q[ν]) n, then the sub- Gn =K(x1:n,x1:n) A0[n⊤−1] q[n]q⊤[n] sPpaνc=e1iteνrates inν(1≥3)ks∇atisLfy lνimn→∞∇kFHn(L¯∀[n]) = 0 almost (cid:20) r×1 (cid:21) surely;thatis,theyfallintothestationarypointof (10). −k(x1:n,xn)q⊤[n]+ nλK(x1:n,x1:n)(cid:20)A0[n⊤r×−11](cid:21) P[2r3o]o,fa:nTdhecapnroboefskofettchheedpraolponogsittihoenfoisllionwspinirgedstebpys.[24] and Step1.First,wejudiciouslyintroduceasurrogateforF (L¯) n A[n 1] A[n]=(cid:20) 0⊤r×−1 (cid:21)−µn,AGn whoTsoemthinisimeiznedr,cowinecidheasvweiththtahteSmGiDnqufpνd(axtνe;sL¯in,q(1)3). ReturnA[n],{q[ν]}nν=2 ufνp(pxeνr;L¯b,oqu[νn]d)s;hetnhcee,cFˆons(tL¯)fu:=nc(ti1o/nn,) nanνm=1elfyν(xFν;(L¯L¯,)q[ν≤]) n Fˆ (L¯), L¯. Further approximating fPthrough a second≤- n n ∀ order Taylor’s expansion at the previous subspace update 4.3 Linking parametric with nonparametric SGD up- dates L¯[n 1],wearriveat − Considering that L¯[n] = ΦnA[n] holds for all n, it is f˜n(xn;L¯,q[n])=fn(xn;L¯[n 1],q[n]) (23) aimpppalireesntthfartotmhe(1u9pbd)aatnedru(l1e4i)nth(1a9taG)anm=ouΦnt⊤nsG¯tonp.eTrhfoermlatitnegr +tγr{∇L¯fn(xn−;L¯[n−1],q[n])(L¯ −L¯[n−1])⊤} SGDonL¯ withamatrixstepsizeDn =ΦnΦ⊤n;thatis, + 2nkL¯ −L¯[n−1]k2F . L¯[n]=L¯[n−1]−µn,ADnG¯n . (20) By choosing γn ≥ k∇2L¯fn(xn;L¯[n − 1],qn)kH = Consequently, it is important to note that such a choice k(q[n]q⊤[n])⊗ID¯ +(λ/n)IrD¯kHandusingthenormprop- ertiesintheHilbertspace,itisimportanttorecognizethat: constitutes a valid descent direction, which is guaranteed (i) f˜ is locally tight; i.e., f˜ (x ;L¯[n 1],q[n]) = since f (x ;L¯n[n 1],q[n]); n n − n n whGe¯re⊤nDnG¯n =H⊤nK⊤(x1:n,x1:n)K(x1:n,x1:n)Hn <0(21) 1],q((ii[iin)i)])gf˜r=adg∇−ileoL¯nbfatlnloy(fxmfn˜n;aLj¯ois[rnizlo−ecsa1tl]hl,yeq[tonirg]ih)g;ti;annai.dle.i,n∇staL¯nf˜tna(nxeno;uL¯s[nco−st λ n Hn :=A[n−1](qqnq⊤⊤n + nIr) . fn;Sthelaetcitsin,fgnn(oxwn;tL¯h,eqt[anr]g)e≤tsfu˜nrr(oxgna;tL¯e,cqo[snt]a),s∀L¯. − n   1 n Moreover, for positive definite kernel matrices, formed by F˜ (L¯)= f˜(x ;L¯,q[ν]) e.g., Gaussian kernels, we have G¯⊤D G¯ 0, which n n ν ν strongly guarantees that D G¯ isna denscennt≻direction [2, νX=1 − n n we have F (L¯) Fˆ (L¯) F˜ (L¯), L¯. Minimizing p. 35]. Leveraging this link, Algorithm 3 will be shown n n n next to enjoy the same convergence guarantee as that of the cost F˜n(L¯) am≤ounts to n≤ullifying th∀e gradient, i.e., Algorithm2. ∇L¯F˜n(L¯[n])=0,whichyields[24] Remark 1. Although SGD solver in Algorithm 3 can be 1 viewed as a special case of Algorithm 2, developing the L¯[n]=L¯[n 1] G¯n (24) − − γ¯ n IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 6 whereγ¯ := n γ .Bysettingµ =1/γ¯ ,theSGD-based 5.1 Censoringuninformativedata n ν=1 ν n n updateofL¯[n]nowcoincideswiththeminimizerofF˜ (L¯); n thatis,L¯[n]=PargminL¯F˜n(L¯). IntheLScostthatAlgorithms2and3relyon,smallvalues Step 2. The second step establishes that the surrogate of the fitting error can be tolerated in practice without no- costs F˜ (L¯) form a quasi-martingale sequence, and ticeableperformance degradation.Thissuggestsmodifying n { } using tightness of the surrogate cost we deduce that theLScostsothatsmallfittingerrors(sayupto ǫ)induce lim (F (L¯[n]) F˜ (L¯[n]))=0.Thus,thesurrogatecost no penalty, e.g., by invoking the ǫ insensitive c±ost that is asynm→p∞toticnallycon−vergnestotheoriginalcostF (L¯). popularinsupportvectorregressio−n(SVR)settings[15]. n Step 3. Leveraging the regularity of ¯(xν;L¯,qν), con- Consider henceforth positive-definite kernels for which L vergence of the cost sequence implies convergence of low-rank factors offer an approximation to the full-rank ∇{kL¯∇F˜L¯nF(Ln¯([nL¯][)n=])−0,∇yLi¯eFl˜dns(L¯{k[n∇])L¯kFHn}(Lt¯o[nz]e)rkoH,w}→hich0.alongwit(cid:4)h kkΦernne−lL¯mQantrkiHx,.TahnedselceoandsidtoeraatiognesnseuraglglyestnmonozdeirfyoinLgSt-hfiet So far, we have asserted convergence of the SGD-based LScostℓ(xn;A[n 1],q;x1:n−1)as algorithm for the “virtual” L¯ provided by Algorithm 2. A − relatedconvergenceresultforAlgorithm3isguaranteedby ℓˇ(xn;A[n 1],q;x1:n−1) − tPhreopfoolsloitwioinng3.aUrgnudmere(nAt1.)-(A2)andforpositivedefinitekernels, := 0 ifℓ(xn;A[n−1],q;x1:n−1)<ǫ if µ = 1/ξ¯ with ξ¯ := n ξ and ξ nγ , then (ℓ(xn;A[n−1],q;x1:n−1)−ǫ otherwise. the snu,Abspace iternates in (n19a) satisνfy=1limn nC≥(L¯[nn]) = 0 (25) n→∞ n P ∇ almost surely; that is, the subspace iterates will converge to the By proper choice of ǫ, the cost ℓˇ(.) implies that if stationarypointof (10). ℓ(xn;A[n 1],qn;x1:n−1) < ǫ, the virtual datum φ(xn) Proof: The proof follows the steps in Proposition2, with an − is captured well enough by the virtual current subspace extra step in the construction of the appropriate surrogate L¯[n 1] = Φn−1A[n 1], and the solver will not attempt cost in Step 1. In particular, using that n the optimal − − subspace is of the form L¯n = ΦnA, the o∀bjective f˜ν can tmoednetactrieoanseoiftsthLeSbearsroisr,Φwnh−ic1h, wsuitghgetshtes nskeiwppliinftgedthdeaatuugm- be further majorized over the subset of virtual subspaces φ(x )[3]. L¯ =Φ A,by n n In short, if the upper branch of (25) is in effect, φ(x ) n fˇ (x ;Φ ,A,q[n]):=f (x ;L¯[n 1],q[n]) is considered uninformative, and it is censored for the n n n n n − +tr{∇L¯fn(xn;L¯[n−1],q[n])(ΦnA−L¯[n−1])⊤} sduebemspsacφe(xup)daintefosrtmepa;tivweh,earneadsahuagvminegnttshethelowbaesrisbsreatncohf + ξn A A[n−1] 2 the virtual snubspace. The latter case gives rise to what we 2 k −(cid:20) 01×r (cid:21)kF term online support vectors (OSV), which must be stored, forwhichwehave unlike ‘censored’ data that are discarded from subsequent subspaceupdates. f˜ (x ;L¯,q[n]) fˇ(x ;Φ ,A,q ) n n ν ν n ν − In order to keep track of the OSVs, let n−1 denote the = γn L¯ L¯[n 1] 2 ξn A A[n−1] s2et.ofindices correspondingtotheSVsrevSealeduptotime TheCauchy-Schwa2rzkine−quality−imkpFlie−sth2akt −(cid:20) 01×r (cid:21)knmF.oAdcificoerddLinSgclyo,srteawsrℓˇi(txenL¯;[An[−n−1]1=],qΦ;SxnS−n1−A1)[,nd−ep1e]n,danindgtohne whichofthefollowingtwocasesemerges. A[n 1] kL¯ −L¯[n−1]k2F =kΦnA−Φn 01×−r k2F C1. If ℓˇ(xn;A[n−1],q;xSn−1) ≤ ǫ, the OSV set will not (cid:20) (cid:21) grow,andwewillhave n = n−1;or, A[n 1] S S ≤kΦnk2FkA− 01×−r k2F C2.Ifℓˇ(xn;A[n−1],q;xSn−1)>ǫ,theOSVsetwillgrow, (cid:20) (cid:21) andwewillhave n = n−1 n . fa˜nn(dxnb;yL¯c,hqo[no]s)ing≤ξfnˇν(≥xν;kΦΦnn,kA2F,γqnν)=. Snelγenc,tinwgenwoiwll fhˇνa(v.e) formThLe¯[nsu]b=spaΦcSeSnmAa[tSnri]x, wp∪ehre{riet}erΦatSinon:=wil[lφtnh1u,s...,taφkne|Snth|]e, auPsprodtphaoetesnitreiuowlnes2iu.nrr(o1g9aat)e, twhheorseestmoifnitmheizperrocoofinfcoildloewsswtihthattho(cid:4)ef rweipthlacSinng:x=1:n{inn1,Anl2g,o.r.i.t,hnm|Sn3|}w, iathndxSAn, A∈lgRor|Sitnh|×mr.4Ugipvoens the pseudocode for our reduced-complexity OKFE, which also includes a budget maintenance module that will be 5 REDUCED-COMPLEXITY OK-FE ON A BUDGET presentedintheensuingSection5.2. Per data vector processed, the iterative solvers of the pre- ModifyingtheLS-fitin(25)anddiscardingthecensored vious section have one column of Φ and one row of A data, certainly reduce the rate at which the memory and n added, which implies growing memory and complexity complexity requirements increase. Albeit at a slower rate, requirementsasngrows.Thepresentsectioncombinestwo maystillgrowunboundedastimeproceeds.Thus,one n |S | means of coping withthis formidable challenge: one based ismotivatedtorestrictthenumberof OSVstoa prescribed on censoring uninformative data, and the second based on affordable budget, B, and introduce a solver which n |S | ≤ “budget maintenance.” By modifying Algorithms 2 and maintains such a budget throughout the iterations. To this 3 accordingly, memory and complexity requirements are end, we introduce next a greedy ‘budget maintenance’ renderedaffordable. scheme. IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 7 5.2 Budgetmaintenance Algorithm4OK-FEBAlgorithm Input x n ,λ When inclusion of a new data vector into the OSV set { ν}ν=1 pushes its cardinality beyond the prescribed budget InitializeA[1]atrandomand 1 = 1 n S { } B, the budget mainten|San|ce module will discard the SV Forn=2,...do whose exclusion distorts minimally L¯[n]. Specifically, with q[n]=(A⊤[n 1]K(x ,x )A[n 1]+λI )−1 − Sn−1 Sn−1 − r Φn\i and A\i[n] denoting Φn and A[n] devoid of their i- A⊤[n 1]k(x ,x ) th column and row, respectively, our rule for selecting the × − Sn−1 n indextobeexcludedis ℓ =k(x ,x ) 2k⊤(x ,x )A[n 1]q[n] n n n − Sn−1 n − +q⊤A⊤[n 1]K(x ,x )A[n 1]q[n] n − Sn−1 Sn−1 − i =argmin Φ A[n] Φ A [n] 2 ∗ i∈Snk n − n\i \i kF ifℓn <ǫ then =argim∈Sinntr{A⊤[n]K(xSn,xSn)A[n] (26) elseSn =Sn−1 −2A⊤\i[n]K(xSn\i,xSn)A[n] Sn =Sn−1∪{n} A[n 1] +A⊤\i[n]K(xSn\i,xSn\i)A\i[n]}. Gˇn =K(xSn,xSn) 0⊤− q[n]q⊤[n] (cid:20) r×1 (cid:21) Enumeration over n and evaluation of the cost incurs λ A[n 1] cgoamteptlheexictoymOp(uBta3)tiofoSnralsoclovminpgle(x2i6t)y.,Haegnrceee,diynsocrhdeemretoismpitui-t −k(xSn,xn)q⊤[n]+ nK(xSn,xSn)(cid:20) 0⊤r×−1 (cid:21) A[n 1] fcoorrtrhe.spSoinncdeinegxcrlouwsiofnromoftahneSsVubwspilalcreefsauclttoirn,driesmcaorvdiinnggtthhee A[n]=(cid:20) 0⊤r×−1 (cid:21)−µn,AGˇn if >B then SV corresponding to the row with the smallest ℓ2 norm |Sn| − Runbudgetmaintenancemodule suggests a reasonable heuristic greedy policy. To this end, endif oneneedstofindtheindex endif ˆi∗ =arg min ai[n] 2 (27) EndFor i=1,2,...,B+1k k ReturnA[n], , q[ν] n Sn { }ν=1 where a⊤[n] denotes the i th row of A[n]. Subsequently, ˆi∗ as weill as the correspon−ding SV are discarded from n Algorithm5Budgetmaintenacemodule S and the SV set respectively, and an OSV set of cardinality Input ,A t|Shen|up=datBesiasndmtahinetpairnoepdo.seAdlggorerietdhymbsu4dgaentdma5inttaebnualnactee ˆi∗ =ar{gSmˆiin}i∈Ska⊤i k2 ∗ scheme,respectively. RSe←moSve\t{he}ˆi -throwofA ∗ Remark 2. In principle, methods related to those in [35], Return ,A including replacement of two SVs by a linear combination {S } ofthetwo,orprojectinganSVontheSVsetanddiscarding the projected SV, are also viable alternatives. In practice related operations that take place regardless of censoring. however,theirimprovedperformancerelativeto(27)isneg- Indeed, forming A⊤[n 1]k(x ,x ) requires Br mul- ligible and along with their increased complexity, renders − Sn−1 n tiplications for the matrix-vector product, and (BD) for suchalternativeslessattractiveforlarge-scaledatasets. O the evaluation of B kernels in k(x ,x ); the matrix- Sn−1 n vector product that remains for obtaining q[n] requires r2 5.3 Complexityanalysis additional multiplications. Overall, running OK-FEB on N Computational complexity of the proposed OK-FEB algo- dataandwithavalueofǫsuchthatN˜ N dataareusedfor rithm is evaluated in the present section. The computa- updatesrequires (N˜(Br(B+r)+r3)≤+N(B(D+r)+r2)). tions required by the n th iteration of Algorithm 4 for Alternatively,tunOingǫsuchthatPr ℓn >ǫ =E[N˜/N]=γ feature extraction and pa−rameter update depend on B,r, yields an expected complexity ({N(Br(γ}(B + r) + 1)+ andD, aswellas thecensoring processoutlinedinSection (γr + 1)r2 + BD)). As simulaOtion tests will corroborate, 5.1. Specifically, computing Gˇ and performing the first- the budget parameter B can be chosen as B = cr with n order stochastic update that yields A[n] requires O(B2r) c [1.5,5]. Thus, we can simplify the derived complexity multiplications, a cost that is saved for iterations that are or∈deras (Nr2(γr+1)+NDr). O skipped when ℓ < ǫ. Regarding the computation of q[n], n Br(B + r) multiplications are needed to form A⊤[n 1]K(x ,x )A[n 1], and (r3) multiplications fo−r 6 STABILITY ANALYSIS OF KERNEL APPROXIMA- theinvSenr−s1ionSonf−A1 ⊤[n −1]K(x O,x )A[n 1]+λI . TION − Sn−1 Sn−1 − r Fortunately, the aforementioned computations can also be In this section, the effect of low-rank approximation of avoided for iteration n, if the previous iteration n 1 the lifted vectors on kernel-matrix approximation as well − performs no update on A[n 1]; in this case, (A⊤[n as kernel-based classification and regression is analytically a1n]Kd(cxaSnn−s1im,xpSlny−b1e)Aa[cnce−sse1d] +fr−oλmIrm)−e1mroermy.aNinesveurnthchelaensgs,e−da squubasnptiaficeedo.Tbotatihniesdenbdy,rreucnanllitnhgatOgKiv-eFnEB{xiνs}NνL¯=1=,thΦevSiArtual “baseline”ofcomputationsisrequiredforfeatureextraction RD¯×r,andthecorrespondingprojectioncoefficientsareQ ∈. N IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 8 Bydefiningtherandomvariablese := φ(x ) φˆ(x ) 2 = features. Since linear pattern recognition tasks incur com- φ(x ) L¯q 2 capturingtheLSeirror,kwehiav−ethefiokllHow- plexity (NB2),theyscaleextremelywellforlargedatasets k i − ikH O ingresult. (withN ),comparedtokernelSVMthatincurscomplex- Proposition 4.Iftherandomvariablese [0 1]arei.i.d.with ity (N3≫). Furthermore,inthetestingphase,evaluationof mmeaatrnixe¯K:==E[eΦi]⊤,tΦhencafnorbkeeranpeplrsoxsaimtisaftyeiding∈by|κK(ˆxi:,=xjΦ)ˆ|⊤≤Φˆ1,,atnhde afutncOcotmiopnlefx(ixty)re(quirDes),κw(xhνe,rxei) foirst∀hie∈nuSmtobebreoefvSaVlusathteadt O |S| |S| withprobabilityatleast1 2e−2Nt2,itholdsthat typically grows with N. In contrast, if approximated by − the linear g(z), function evaluation requires (BD +Br) 1 O K Kˆ 2 √e¯+t(√e¯+t+2). (28) operations including the feature extraction and function Nk − kF ≤ evaluation. Setting the budget B to 1.5 to 5 times the rank parameterr,ourcomplexityisoforder (rD+r2),which Proof:UpondefiningE¯ :=Φˆ Φ,onecanwrite representsaconsiderabledecreaseover O( D). − O |S| K Kˆ = Φ⊤Φ Φˆ⊤Φˆ Subsequently,wewishtoquantifyhowtheperformance k − kF k − kF of linear classificationand regressionbasedon the features = Φ⊤Φ (Φ+E¯)⊤(Φ+E¯) F K1/2AQ comparestotheoneobtainedwhentrainingwith k − k S = 2E¯⊤Φ+E¯⊤E¯ theexactkernelmatrixK. F k k 2 E¯ Φ + E¯ 2 (29a) ≤ k kFk kF k kF 2√N E¯ + E¯ 2 (29b) 6.1 Stabilityanalysisforkernel-basedclassification ≤ k kF k kF Kernel-based classifiers using SVM solve the minimization wherein(29a)weusedthetriangleinequalityfortheFrobe- [29,p.205] niousnormalongwiththeproperty BC B C , F F F k k ≤k k k k and (29b) holds because for, e.g., radial kernels satisfying 1 α∗ =argmin α⊤YKYα 1⊤α (33) κ(xi,xj) 1,wehave α 2 − | |≤ s.t.y⊤α=0 N C kΦkF :=qtr(Φ⊤Φ)=vuuXi=1κ(xi,xi)≤√N . 0≤α≤ N1N t where Y is the diagonal matrix with the i-th label y as i Furthermore, since kE¯kF := Ni=1ei, and e¯N := its i-th diagonal entry, y⊤ := [y1,y2,...,yN], and 1N is a (1/N) N e withe [0,1],Hoeffqding’sinequalityyields n 1vectorof1’s.Thesolution(33)correspondstothedual i=1 i i ∈ P va×riablesoftheprimaloptimizationproblem,whichyields P Pr e¯ e¯ t e−2Nt2 N − ≥ ≤ 1 C N (cid:16) (cid:17) w¯∗ =arg min w¯ 2 + max 0,1 y w¯⊤φ(x ) . whichinturnimplies w¯∈RD¯ 2k kH N i=1 { − i i } X Pr 1 E¯ 2 e¯+t =Pr e¯ e¯+t e−2Nt2 . (30) (34) Nk kF ≥ N ≥ ≤ Here,parameterCcontrolsthetrade-offbetweenmaximiza- (cid:16) (cid:17) (cid:16) (cid:17) Finally,takingintoaccount(29b),it follows thatwithprob- tion of the margin 1/ w , and the minimization of the H abilityatleast1 2e−2Nt2,wehave misclassification penalkty,kwhile the solution of (34) can be − expressedasw¯∗ = N α∗y φ(x )[29, p.187]. K Kˆ N(2√e¯+t+(e¯+t)). (31) i=1 i i i k − kF ≤ Exploiting the reduced memory requirement offered P (cid:4) through the low-rank approximation of the kernel matrix Proposition 4 essentially bounds the kernel approxima- viaOK-FEB,thedualproblemcanbeapproximatedas tion mismatch based on how well the projection onto the 1 subspaceapproximatesthelifteddataφ(x). αˆ∗ =argmαin 2α⊤YKˆYα−1⊤α (35) Remark3.Considernowdecomposingthekernelmatrixas s.t.y⊤α=0 Kˆ :=Φˆ⊤Φˆ =(L¯Q)⊤(L¯Q)=Q⊤A⊤Φ⊤Φ AQ C S S 0 α 1N . =Q⊤A⊤K AQ=Z⊤Z (32) ≤ ≤ N S Viewing Kˆ as a linear kernel matrix over φˆ(x ) s (cf. twhheebreumdgaettreixdZSV:=sKet1S./T2AhiQs fhaactsosriizzeat|iSo|n×oNf,KaˆndcoSulddenhoatvees Rwermittaernkin3),thseimpirliamratolf(o3r3m),athse minimization(3{5) cain}be re- resultedfrom a linear kernel over the 1 training data |S|× N vectorsformingtheN columnsofZ.Thus,forkernel-based wˆ¯∗ =argmin 1 w¯ 2 + C max 0,1 y w¯⊤φˆ(x ) taskssuchaskernelclassification,regression,andclustering w¯ 2k kH N { − i i } i=1 appliedtolargedatasets,wecansimplymaptheD N data X (36) × XtothecorrespondingfeaturesZtrainedviatheproposed solvers,andthensimplyrelyonfastlinearlearningmethods for which we have wˆ¯∗ = Ni=1αˆ∗iyiφˆ(xi). Upon defining toapproximatetheoriginalkernel-basedlearningtask;that therandomvariableξ := φ(x ) φˆ(x ) withexpected is to approximate the function f(x) = i∈Sciκ(x,xi) by value ξ¯ := E[ξi], theifollokwPingi p−roposiitikoHn quantifies the thelinearfunctiong(z) = w⊤zexpressedviatheextracted gapbetweenw¯∗ andwˆ¯∗. P IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 9 Proposition 5. If variables ξ [0 1] are i.i.d., the mismatch 6.2 Stabilityanalysisforkernel-basedregression i ∈ between the linear classifiers given by (34) and (36) can be Consider the kernel-based ridge regression task on the bounded,andwithprobabilityatleast1 e−2Nt2,wehave dataset x ,y N ,namely − { i i}i=1 k∆wk2H :=kw¯∗−wˆ¯∗k2H ≤2C3/2 ξ¯+t . (37) min 1 y Kβ 22+λβ⊤Kβ (40) (cid:16) (cid:17) β Nk − k whichadmitstheclosed-formsolutionβ∗ =(K+λNI)−1y Proof:Itclearlyholdsthat [29, p.251]. Alleviating the memory requirement of order N (N2) through low-rank approximation of the kernel ma- ∆w 2 C( w¯∗ + wˆ¯ ) φ(x ) φˆ(x ) tOrix,thekernel-basedridgeregressioncanbeapproximated k kH ≤ N k k k k k i − i kH i=1 by X 1 2C3/2 N φ(xi) φˆ(xi) H mβinNky−Kˆβk22+λβ⊤Kˆβ (41) ≤ N k − k 2C3/2(Xiξ¯=+1 t) (38) whose solution is given as βˆ∗ = (Kˆ + λNI)−1y. The ≤ following proposition bounds the mismatch between β∗ where the first inequality relies on the strong convexity of andβˆ∗. (34), (36), and the fact that w¯∗ H √C and wˆ¯∗ H Proposition 7. If the random variables ei [0 , 1] are i.i.d., k k ≤ k k ≤ ∈ √C [32];whilethesecondinequalityholdswithprobability and yi By for i = 1,2,...,N, with probability at least 1 at least 1 e−2Nt2 using Hoeffding’s inequality for ξ¯ := 2e−2|Nt|2≤wehave − N (1/N) Ni=−1ξi. (cid:4) β∗ βˆ∗ 2 M√e¯+t(√e¯+t+2). (42) NotPethatunderthei.i.d.assumptionone := φ(x ) k − k ≤ λ2 i i φˆ(x ) 2 , random variables ξ are also i.i.d., rendkering th−e Proof:Following[7],wecanwrite i kH i conditionsofPropositions4and5equivalent. β∗ βˆ∗ =(K+λNI)−1y (Kˆ +λNI)−1y − − Next, we study the performance of linear SVM trained = (Kˆ +λNI)−1(K Kˆ)(K+λNI)−1 y on the set z ,y N , where z := K1/2Aq ; that is, the − − linearfunct{ioni g(iz})i==1 w⊤zisleiarnedbSyfindiing wherewehaveu(cid:16)sedtheidentityPˆ−1 P−1 = P−1(cid:17)(Pˆ N P)Pˆ−1, which holds for any invertibl−e matrices−P and Pˆ−. 1 C w∗ =argwm∈iRnr 2kwk2+ N max{0,1−yiw⊤zi}. Taking the ℓ2-norm of both sides and using the Cauchy- i=1 Schwarzinequality,wearriveat X (39) β∗ βˆ∗ (K+λNI)−1 K Kˆ (Kˆ +λNI)−1 y The following result asserts that the classifiers learned k − k≤k kk − kk kk k K Kˆ NB through (39) and (36) can afford identical generalization y k − k capabilities. ≤ λmin(K+λNI)λmin(Kˆ +λNI) Proposition 6. The generalization capability of classifiers (36) By K Kˆ 2 k − k . (43) and(39)isidentical,inthesensethatwˆ¯∗⊤φˆ(x)=w∗⊤z. ≤ λ2N Pmraotorfi:xSwinecehafvoerKˆthe=lZow⊤-Zra,nthkenap(3p5r)oaxnimda(3ti9o)naroefetqhueivkaelrennetl, tionU4s,inygieltdhsetihneeqbuoaulnitdywkPithk2p≤robkaPbkilFityal1on−g2we−it2hNPt2r.opos(cid:4)i- andconsequentlyw∗ = N αˆ∗y z .Now,onecanfurther i=1 i i i expandwˆ¯∗⊤φˆ(x)andw∗⊤ztoobtain 7 NUMERICAL TESTS P N N Performance of our proposed algorithm is assessed in this wˆ¯∗⊤φˆ(x)= αˆ∗iyiφˆ(xi)⊤φˆ(x)= αˆ∗iyiqiAΦˆ⊤SΦˆSAq section using synthetic and real datasets on kernel-matrix Xi=1 Xi=1 approximationaswellasclassificationandregressiontasks. and The synthetic test involves generating two equiprobable classes of 3 1 data vectors x , with each data vector N N ν × { } w∗⊤z= αˆ∗y z⊤z= αˆ∗y q AΦˆ⊤Φˆ Aq uniformly drawn from the surface of a sphere centered at i i i i i i S S i=1 i=1 the origin with radius Rc1 = 1 or Rc2 = 2, depending X X on whether its label y equals 1 or 1 respectively. A wheretheequivalencefollowsreadily. (cid:4) ν − noise component drawn from the Gaussian distribution In addition to markedly reduced computational cost (0D×1,σ2ID×D) is added to each xν, with the variance when utilizing linear (L)SVM, our novel classifier can also σN2 controlling the overlap between the two classes. Linear beefficientlytrainedonline[32]asnewdatabecomesavail- classifiers can not correctly classify data generated in this able(oriterativelywhentheentiredatasetscannotbestored manner; hence, a kernel-based classifier is well motivated. in memory which necessitates one-by-one acquisition). In Thisdatasetisusedtovalidateperformanceoftheproposed this case, the proposed OK-FEB in Algorithm 4 can be run OK-FEBinkernel-matrixapproximationandgeneralization in parallel with the online classifier training, an attribute capabilityofthetrainedlinearclassifier.Thesizeandspecifi- mostsuitableforbigdataapplications. cationsofthedatasetusedarelistedinTable1.Thedatasets IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.XX,NO.XX,2016 10 are accessible in the LIBSVM website1 or the UCI machine learningrepository2. 100 Online KFE (Non−parametric) r Online KFE (parametric) o TABLE1 rr Batch KFE e Specificationsofdatasets. ng 10−1 dataset task D N Fitti − S synthetic classification 3 70K L IJCNN classification 22 140K ge 10−2 a Adult classification 123 32K r e Cifar classification 3072 50K Av MNIST classification 786 60K COD-RNA classification 8 59K 10−3 100 101 102 103 104 CADATA regression 8 20.6K Iteration index Slice regression 384 53.5K Year regression 90 463.7K Fig.1.AverageLS-fitversusiterationindexforthesyntheticdataset. 7.1 Kernel-based feature extraction: Batch versus on- 0.4 line (r,B)=(2,10) 0.35 (r,B)=(5,10) PerformanceoftheproposedAlgorithms1,2and3onsolv- (r,B)=(5,20) r ingthe minimization(10)is comparedempirically.The test ro 0.3 (r,B)=(10,20) r is carried out for the synthetically generated data arriving e (r,B)=(10,40) g in streaming mode with ν = 1,2,...,5,000. The online n0.25 sAclhgeomriethsmcan1 sisolavlesotheempprloobyleedmtoons-tohlve-efly(1,0w)hfoilre athllenb.aWtche S−fitti 0.2 L cL¯o[mn]puasriengthteheovtherreaelldLifSfefiretngtisvoelnvebrsyatchreosssutbimspeac(ieteuraptdioante) age 0.15 r index n. Gaussian kernel κ(xi,xj) = exp(−kxi−xjk22/γ) Ave 0.1 isusedwithγ =100settothestandarddeviationofaverage 0.05 pairwisesquareddistancesof the data vectors. The param- eters for the OK-FE solvers are chosen as µ = 10/n, n,L 0 µ = 1000/n2, λ = 10−3, and the maximum number of 0 200 400 600 800 1000 1200 1400 1600 1800 2000 n,A Iteration index n iterationsinthebatchsolverissettoKmax =50. Figure1depictshowstochasticlow-complexityupdates ofAintheonlinesolversensureconvergenceoftheaverage Fig.2.AverageLS-fitfordifferentchoicesof(r,B)versusiterationindex forthesyntheticdataset. LS cost to the high-complexity batch solution. In addition, Fig. 2 plots the evolution of the average LS fit cost across iterations for different choices of parameters (r,B) in the onasubmatrixofsize104 104,averagedoverseveralselec- proposedOK-FEBsolver. × tions. While the proposed OK-FEB only requires acquiring one data vector at a time, the other comparable methods 7.2 Kernel-matrixapproximation are developed for batch settings, where the entire dataset must be loaded in memory, a requirement not always met To quantify the kernel matrix approximation accuracy, N−1 K Kˆ providedbythenovelapproachismeasured in practice. However, to meet the memory requirement of F k − k such algorithms, the experiments are performed here on a here on real datasets, and is compared with that of kernel- 16-core node machine with 2.50 GHz CPU and 225GB of matrixapproximationusingfourdifferentschemes:i)plain RAM. Nystrom approximation (random selection of “landscape” data points) [38]; ii) Nystrom approximation with nonuni- The budgeted subspace learning in the OK-FEB algo- form (probabilistic) data point selection [11]; iii) Improved rithmishighlyefficient,andthankstoitsfastconvergence, Nystrom (where landscape points are chosen via k-means) only one pass over a subset of training data vectors is [40];and,iv)rank-rapproximationofKviasingular-value- sufficient for stationary or batch datasets. Thus, subspace decomposition (SVD) as the benchmark. Since SVD incurs learningiscarriedover2 5%oftheset,whiletherestofthe − complexity (N3), it is computationally prohibitive for features are simply acquired by projecting onto the identi- O suchlargedatasets.Forthisreason,wehaveapproximated fiedsubspace.Thissuggeststhatinstationarybatchsettings, N−1 K Kˆ ofthefullN N kernelmatrixbythat the complexityof the proposedalgorithm is comparable to SVD F k − k × that of the Nystrom approximation, which is minimal in 1.http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/ such scenaria, while exhibiting considerable improvement 2.http://www.ics.uci.edu/∼mlearn/ inperformance.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.