1 Adaptive Subgradient Methods for Online AUC Maximization Yi Ding, Peilin Zhao, Steven C.H. Hoi, Yew-Soon Ong Abstract—LearningformaximizingAUCperformanceisanimportantresearchprobleminMachineLearningandArtificialIntelligence. UnliketraditionalbatchlearningmethodsformaximizingAUCwhichoftensufferfrompoorscalability,recentyearshavewitnessed someemergingstudiesthatattempttomaximizeAUCbysingle-passonlinelearningapproaches.Despitetheirencouragingresults reported,theexistingonlineAUCmaximizationalgorithmsoftenadoptsimpleonlinegradientdescentapproachesthatfailtoexploitthe geometricalknowledgeofthedataobservedduringtheonlinelearningprocess,andthuscouldsufferfromrelativelylargerregret.To addresstheabovelimitation,inthiswork,weexploreanovelalgorithmofAdaptiveOnlineAUCMaximization(AdaOAM)which 6 employsanadaptivegradientmethodthatexploitstheknowledgeofhistoricalgradientstoperformmoreinformativeonlinelearning. 1 ThenewadaptiveupdatingstrategyoftheAdaOAMislesssensitivetotheparametersettingsandmaintainsthesametimecomplexity 0 aspreviousnon-adaptivecounterparts.Additionally,weextendthealgorithmtohandlehigh-dimensionalsparsedata(SAdaOAM)and 2 addresssparsityinthesolutionbyperforminglazygradientupdating.Weanalyzethetheoreticalboundsandevaluatetheirempirical performanceonvarioustypesofdatasets.Theencouragingempiricalresultsobtainedclearlyhighlightedtheeffectivenessand b efficiencyoftheproposedalgorithms. e F IndexTerms—AUCmaximization,second-orderonlinelearning,adaptivegradient,high-dimensional,sparsity. 1 (cid:70) ] G L 1 INTRODUCTION . s AUC(AreaUnderROCcurve)[1]isanimportantmeasurefor scanthroughthetrainingdataonlyonce.Thebenefitofone-pass c characterizingmachinelearningperformancesinmanyreal- AUC optimization lies in the use of squared loss to represent the [ worldapplications,suchasranking,andanomalydetectiontasks, AUClossfunctionwhileprovidingproofsonitsconsistencywith 1 especially when misclassification costs are unknown. In general, theAUCmeasure[9]. v AUC measures the probability for a randomly drawn positive Although these algorithms have been shown to be capable of 1 instance to have a higher decision value than a randomly sample achieving fairly good AUC performances, they share a common 5 negative instance. Many efforts have been devoted recently to trait of employing the online gradient descent technique, which 3 0 developing efficient AUC optimization algorithms for both batch fail to take advantage of the geometrical property of the data 0 andonlinelearningtasks[2],[3],[4],[5],[6],[7]. observed from the online learning process, while recent studies . Due to its high efficiency and scalability in real-world ap- haveshowntheimportanceofexploitingthisinformationforon- 2 0 plications, online AUC optimization for streaming data has been lineoptimization[10].Toovercomethelimitationoftheexisting 6 actively studied in the research community in recent years. The works, we propose a novel framework of Adaptive Online AUC 1 key challenge for AUC optimization in online setting is that maximization (AdaOAM), which considers the adaptive gradient : AUC is a metric represented by the sum of pairwise losses optimization technique for exploiting the geometric property of v i between instances from different classes, which makes conven- the observed data to accelerate online AUC maximization tasks. X tionalonlinelearningalgorithmsunsuitablefordirectuseinmany Specifically,thetechniqueismotivatedbyasimpleintuition,that r real world scenarios. To address this challenge, two core types is, the frequently occurring features in online learning process a of Online AUC Maximization (OAM) frameworks have been should be assigned with low learning rates while the rarely proposed recently. The first framework is based on the idea of occurringfeaturesshouldbegivenhighlearningrates.Toachieve buffer sampling [6], [8], which stores some randomly sampled this purpose, we propose the AdaOAM algorithm by adopting historical examples in a buffer to represent the observed data the adaptive gradient updating framework proposed by [10] to for calculating the pairwise loss functions. The other framework control the learning rates for different features. We theoretically focuses on one-pass AUC optimization [7], where the algorithm provethattheregretboundoftheproposedalgorithmisbetterthan thoseoftheexistingnon-adaptivealgorithms.Wealsoempirically • Yi Ding is with the Department of Computer Science, The University of compared the proposed algorithm with several state-of-the-art Chicago,Chicago,IL,USA,60637,E-mail:[email protected]. online AUC optimization algorithms on both benchmark datasets and real-world online anomaly detection datasets. The promising • Correspondingauthor:StevenC.H.HoiiswiththeSchoolofInformation Systems, Singapore Management University, Singapore 178902, E-mail: results validate the effectiveness and efficiency of the proposed [email protected]. AdaOAM. Tofurtherhandlehigh-dimensionalsparsetasksinpractice,we • PeilinZhaoiswiththeDataAnalyticsDepartment,InstituteforInfocomm investigateanextensionoftheAdaOAMmethod,whichislabeled Research,A*STAR,Singapore138632,E-mail:[email protected]. hereastheSparseAdaOAMmethod(SAdaOAM).Themotivation • Yew-SoonOngiswiththeSchoolofComputerEngineering,NanyangTech- is that because the regular AdaOAM algorithm assumes every nologicalUniversity,Singapore639798,E-mail:[email protected]. featureisrelevantandthusmostoftheweightsforcorresponding 2 features are often non-zero, which leads to redundancy and low etc. in a batch mode. Online learning algorithms for AUC maxi- efficiency when rare features are informative for high dimension mizationinvolvinglarge-scaleapplicationshavealsobeenstudied. tasksinpractice.TomakeAdaOAMmoresuitableforsuchcases, AmongtheonlineAUCmaximizationapproaches,twocoreonline the SAdaOAM algorithm is proposed by inducing sparsity in the AUCoptimizationframeworkshavebeenproposedveryrecently. learningweightsusingadaptiveproximalonlinegradientdescent. The first framework is based on the idea of buffer sampling [6], To the best of our knowledge, this is the first effort to address [8],whichemployedafixed-sizebuffertorepresenttheobserved the problem of keeping the online model sparse in online AUC data for calculating the pairwise loss functions. A representative maximizationtask.Moreover,wehavetheoreticallyanalyzedthis study is available in [6], which leveraged the reservoir sampling algorithm, and empirically evaluated it on an extensive set of techniquetorepresenttheobserveddatainstancesbyafixed-size real-world public datasets, compared with several state-of-the- buffer where notable theoretical and empirical results have been art online AUC maximization algorithms. Promising results have reported.Then,[8]studiedtheimprovedgeneralizationcapability been obtained that validate the effectiveness and efficacy of the ofonlinelearningalgorithmsforpairwiselossfunctionswiththe proposedSAdaOAM. frameworkofbuffersampling.Themaincontributionoftheirwork Therestofthispaperisorganizedasfollows.Wefirstreview is the introduction of the stream subsampling with replacement the related works from three core areas: online learning, AUC as the buffer update strategy. The other framework which takes maximization, and sparse online learning, respectively. Then, we a different perspective was presented by [7]. They extended the presenttheformulationsoftheproposedapproachesforhandling previousonlineAUCmaximizationframeworkwitharegression- bothregularandhigh-dimensionalsparsedata,andtheirtheoreti- based one-pass learning mode, and achieved solid regret bounds cal analysis; we further show and discuss the comprehensive ex- by considering square loss for the AUC optimization task due to perimentalresults,thesensitivityoftheparameters,andtradeoffs itstheoreticalconsistencywithAUC. betweenthelevelofsparsityandAUCperformances.Finally,we Sparse Online Learning. The high dimensionality and high concludethepaperwithabriefsummaryofthepresentwork. sparsityaretwoimportantissuesforlarge-scalemachinelearning tasks. Many previous efforts have been devoted to tackling these issues in the batch setting, but they usually suffer from poor 2 RELATED WORK scalability when dealing with big data. Recent years have wit- nessed extensive research studies on sparse online learning [24], Our work is closely related to three topics in the context of [25], [26], [27], which aim to learn sparse classifiers by limiting machine learning, namely, online learning, AUC maximization, the number of active features. There are two core categories of and sparse online learning. Below we briefly review some of the methodsforsparseonlinelearning.Therepresentativeworkofthe importantrelatedworkintheseareas. first type follows the general framework of subgradient descent Online Learning. Online learning has been extensively stud- withtruncation.TakingtheFOBOSalgorithm[25]asanexample, ied in the machine learning communities [11], [12], [13], [14], which is based on the Forward-Backward Splitting method to [15],mainlyduetoitshighefficiencyandscalabilitytolarge-scale solve the sparse online learning problem by alternating between learningtasks.Differentfromconventionalbatchlearningmethods two phases: (i) an unconstraint stochastic subgradient descent thatassumealltraininginstancesareavailablepriortothelearning step with respect to the loss function, and (ii) an instantaneous phase,onlinelearningconsidersoneinstanceeachtimetoupdate optimization for a tradeoff between keeping close proximity to themodelsequentiallyanditeratively.Therefore,onlinelearningis the result of the first step and minimizing (cid:96) regularization term. 1 ideallyappropriatefortasksinwhichdataarrivessequentially.A Followingthisstrategy, [24]arguesthatthetruncationateachstep numberoffirst-orderalgorithmshavebeenproposedincludingthe is too aggressive and thus proposes the Truncated Gradient (TG) well-knownPerceptronalgorithm[16]andthePassive-Aggressive method,whichalleviatestheupdatesbytruncatingthecoefficients (PA) algorithm [12]. Although the PA introduces the concept ateveryKstepswhentheyarelowerthanapredefinedthreshold. of “maximum margin” for classification, it fails to control the Thesecondcategoryofmethodsaremainlymotivatedbythedual direction and scale of parameter updates during online learning averagingmethod[28].Themostpopularmethodinthiscategory phase. In order to address this issue, recent years have witnessed is the Regularized Dual Averaging (RDA) [26], which solves the some second-order online learning algorithms [17], [18], [19], optimization problem by using the running average of all past [20], which apply parameter confidence information to improve subgradients of the loss functions and the whole regularization online learning performance. Further, in order to solve the cost- terminsteadofthesubgradient.Inthismanner,theRDAmethod sensitiveclassificationtaskson-the-fly,onlinelearningresearchers hasbeenshowntoexploittheregularizationstructuremoreeasily have also proposed a few novel online learning algorithms to di- in the online phase and obtain the desired regularization effects rectlyoptimizesomemoremeaningfulcost-sensitivemetrics[21], moreefficiently. [22],[23]. Despitetheextensiveworksinthesedifferentfieldsofmachine AUCMaximization.AUC(AreaUnderROCcurve)isanim- learning,tothebestofourknowledge,ourcurrentworkrepresents portantperformancemeasurethathasbeenwidelyusedinimbal- the first effort to explore adaptive gradient optimization and anceddatadistributionclassification.TheROCcurveexplainsthe second order learning techniques for online AUC maximization rateofthetruepositiveagainstthefalsepositiveatvariousrange inbothregularandsparseonlinelearningsettings. ofthreshold.Thus,AUCrepresentstheprobabilitythataclassifier will rank a randomly chosen positive instance higher than a ran- 3 ADAPTIVE SUBGRADIENT METHODS FOR OAM domlychosennegativeone.Recently,manyalgorithmshavebeen 3.1 ProblemSetting developed to optimize AUC directly [2], [3], [4], [6], [7]. In [4], the author firstly presented a general framework for optimizing WeaimtolearnalinearclassificationmodelthatmaximizesAUC multivariatenonlinearperformancemeasuressuchastheAUC,F1, for a binary classification problem. Without loss of generality, 3 we assume positive class to be less than negative class. Denote Similarly,ify =−1, t (x ,y ) as the training instance received at the t-th trial, where x t∈ Rt d andy ∈ {−1,+1},andw ∈ Rd istheweightvector ∇Lt(w)=λw+xt−c+t+(xt−c+t)(xt−c+t)(cid:62)w+St+w, (4) t t t learnedsofar. where c+ = 1 (cid:80) x and S+ = 1 (cid:80) (x x(cid:62) − Given this setting, let us define the AUC measurement [1] t Tt+ i:yi=1 i t Tt+ i:yi=1 i i for binary classification task. Given a dataset D = {(xi,yi) ∈ c+t [c+t ](cid:62)) are the mean and covariance matrix of positive class, Rd × {−1,+1}| i ∈ [n]}, where [n] = {1,2,...,n}, we respectively. divide it into two sets naturally: the set of positive instances Upon obtaining gradient gt = ∇Lt(wt), a simple solution D+ = {(x+i ,+1)| i ∈ [n+]} and the set of negative instances is to move the weight√wt in the opposite direction of gt, while D− = {(x−j ,−1)| j ∈ [n−]}, where n+ and n− are the keeping(cid:107)wt+1(cid:107)≤1/ λviatheprojectedgradientupdate[29] numbers of positive and negative instances, respectively. For a w =Π (w −ηg )= argmin(cid:107)w−(w −ηg )(cid:107)2, linear classifier w ∈ Rd, its AUC measurement on D is defined t+1 √1 t t t t 2 λ (cid:107)w(cid:107)≤√1 asfollows: λ √ (cid:80)n+ (cid:80)n− I + 1I dueto(cid:107)w∗(cid:107)≤1/ λ. AUC(w)= i=1 j=1 (w·x+i>w·x−j) 2 (w·x+i=w·x−j), However, the above scheme is clearly insufficient, since it n n + − simply assigns different features with the same learning rate. In where I is the indicator function that outputs a (cid:48)1(cid:48) if the order to perform feature-wise gradient updating, we propose a π prediction π holds and (cid:48)0(cid:48) otherwise. We replace the indicator second-order gradient optimization method, i.e., Adaptive Gradi- entUpdatingstrategy,asinspiredby[10].Specifically,wedenote functionwiththefollowingconvexsurrogate,i.e.,thesquareloss g = [g ...g ] as the matrix obtained by concatenating the from[7]duetoitsconsistencywithAUC[9] 1:t 1 t gradient sequences. The i-th row of this matrix is g , which 1:t,i (cid:96)(w,x+−x−)=(1−w·(x+−x−))2, isalsoaconcatenationofthei-thcomponentofeachgradient.In i j i j addition, we define the outer product matrix G = (cid:80)t g g(cid:62). t τ=1 τ τ andfindtheoptimalclassifierbyminimizingthefollowingobjec- Usingthesenotations,thegeneralizationofthestandardadaptive tivefunction gradientdescentleadstothefollowingweightupdate L(w)= λ(cid:107)w(cid:107)2+ in(cid:80)=+1jn(cid:80)=−1(cid:96)(w,x+i −x−j ). (1) wt+1 =ΠG√1t1λ/2(wt−ηG−t 1/2gt), 2 2 2n+n− where ΠA (u) = argmin (cid:107)w − u(cid:107) = where λ(cid:107)w(cid:107)2 is introduced to regularize the complexity of the argmin √1λ (cid:104)w−u,A(w−u(cid:107))w(cid:105),(cid:107)w≤h√i1cλhistheMahaAlanobis 2 2 √ (cid:107)w(cid:107)≤√1 linear classifier. Note, the optimal w∗ satisfies (cid:107)w∗(cid:107)2 ≤ 1/ λ normtodenoteλtheprojectionofapointuonto{w|(cid:107)w(cid:107)≤ √1 }. λ accordingtothestrongdualitytheorem. However,anobviousdrawbackoftheaboveupdateliesinthe significantlylargeamountofcomputationaleffortsneededtohan- dle high-dimensional data tasks since it requires the calculations 3.2 AdaptiveOnlineAUCMaximization of the root and inverse root of the outer product matrix G . In t Here,weshallintroducetheproposedAdaptiveOnlineAUCMax- order to make the algorithm more efficient, we use the diagonal imization (AdaOAM) algorithm. Following the similar approach proxyofG andthustheupdatebecomes t in[7],wemodifythelossfunctionL(w)in(1)asasumoflosses forindividualtraininginstance (cid:80)T Lt(w)where wt+1 =Πd√i1ag(Gt)1/2(wt−ηdiag(Gt)−1/2gt). (5) λ t=1 In this way, both the root and inverse root of diag(G ) can be t t−1 (cid:80) Iy (cid:54)=y ](1−y (x −x )(cid:62)w)2 computed in linear time. Furthermore, as we discuss later, when λ [ i t t t i L (w)= (cid:107)w(cid:107)2+ i=1 , (2) thegradientvectorsaresparse,theupdateabovecanbeconducted t 2 2 2|i∈[t−1]:yiyt =−1| moreefficientlyintimeproportionaltothesupportofthegradient. Another issue with theupdating rule (5) to be concern isthat for i.i.d. sequence S = {(x ,y )|i ∈ [t]}, and it is an unbiased t i i thediag(G )maynotbeinvertibleinallcoordinates.Toaddress estimation to L(w). X+ and X− are denoted as the sets of t t t thisissue,wereplaceitwithH =δI+diag(G )1/2,whereδ > positive and negative instances of S respectively, and T+ and t t t t 0 is a smooth parameter. The parameter δ is introduced to make T− are their respective cardinalities. Besides, L (w) is set as 0 t t the diagonal matrix invertible and the algorithm robust, which is forT+T− =0.Ify =1,thegradientofL is t t t t usuallysetasaverysmallvaluesothatithaslittleinfluenceonthe ∇Lt(w)=λw+xtx(cid:62)t w−xt learningresults.GivenHt,theupdateofthefeature-wiseadaptive (cid:80) x+(x x(cid:62)−x x(cid:62)−x x(cid:62))w updatecanbecomputedas: i i i i t t i + i:yi=−1 Tt− . wt+1 =ΠH√1tλ(wt−ηHt−1gt). (6) The intuition of this update rule (6) is very natural, which If using c− = 1 (cid:80) x and S− = 1 (cid:80) (x x(cid:62) − t T− i t T− i i considers the rare occurring features as more informative and t i:yi=−1 t i:yi=−1 c−[c−](cid:62))torefertothemeanandcovariancematrixofnegative discriminative than those frequently occurring features. Thus, t t class,respectively,thegradientofL canbesimplifiedas these informative rare occurring features should be updated with t higher learning rates by incorporating the geometrical property ∇L (w)=λw−x +c−+(x −c−)(x −c−)(cid:62)w+S−w. (3) of the data observed in earlier stages. Besides, by using the t t t t t t t t 4 previouslyobservedgradients,theupdateprocesscanmitigatethe Algorithm1TheAdaptiveOAMAlgorithm(AdaOAM) effectsofnoiseandspeeduptheconvergencerateintuitively. Input: The regularization parameter λ, the learning rate Sofar,wehavereachedthekeyframeworkofthebasicupdate {ηt}Tt=1,thesmoothparameterδ ≥0. ruleformodellearningexceptthedetailsongradientcalculation. Output:Updatedclassifierwt. From the gradient derivation equations (3) and (4), we need to Variables:s∈Rd,H ∈Rd×d,g1:t,i ∈Rtfori∈{1,...,d}. maintain and update the mean vectors and covariance matrices of the incoming instance sequences observed. The mean vectors Initializew0 =0,c+0 =c−0 =0,Γ+0 =Γ−0 =[0]d×d,T0+ = are easy to be computed and stored here, while the covariance T0− =0,g1:0 =[]. matricesareabitdifficulttobeupdatedduetotheonlinesetting. fort=1,2,...,T do Therefore, we provide a simplified update scheme for covariance Receiveanincominginstance(xt,yt); matrixcomputationtoaddressthisissue.Foranincominginstance ifyt =+1then sequencex ,x ,...,x ,thecovariancematrixS isgivenby T+ =T+ +1,T− =T− ; 1 2 t t t t−1 t t−1 c+ =c+ + 1 (x −c+ )andc− =c− ; t t−1 T+ t t−1 t t−1 St =(cid:88)t xix(cid:62)i −tct[ct](cid:62) RUepcdeaitveeΓg+tradainednttΓg−tt==Γ∇−t−L1t(;w); i=1 Updateg =[g g ],s =(cid:107)g (cid:107) ; 1:t 1:t−1 t t,i 1:t,i 2 =(cid:88)t xix(cid:62)i −c [c ](cid:62) Ht =δI+diag(st); t t t gˆ =H−1g ; i=1 t t t =t−1(cid:88)t−1 xix(cid:62)i + xtx(cid:62)t −c [c ](cid:62) elsTe− =T− +1,T+ =T+ ; t t−1 t t t t t−1 t t−1 i=1 c−t =c−t−1+ T1−(xt−c−t−1)andc+t =c+t−1; =t−t 1((cid:88)t−1 xti−x(cid:62)i1 −ct−1[ct−1](cid:62)+ct−1[ct−1](cid:62))+ xttx(cid:62)t RUepcdeaitveeΓg−tradainednttΓg+t ==Γ∇+t−L1(;w); i=1 t t −c [c ](cid:62) Updateg1:t =[g1:t−1gt],st,i =(cid:107)g1:t,i(cid:107)2; t t H =δI+diag(s ); t−1 x x(cid:62) t t = t (St−1+ct−1[ct−1](cid:62))+ ttt −ct[ct](cid:62) gˆt =Ht−1gt; 1 endif =St−1− t(St−1+ct−1[ct−1](cid:62)−xtx(cid:62)t )+ct−1[ct−1](cid:62) wt =ΠH√1t (wt−1−ηtgˆt); −c [c ](cid:62). endfor λ t t Then, in our gradient update, if setting Γ+ = S+ and Γ− = t t t S−,thecovariancematricesareupdatedasfollows: t the efficacy of the original AdaOAM at the same time. To sum- x x(cid:62)−Γ+ −c+ [c+ ](cid:62) marize, SAdaOAM has two benefits over the original AdaOAM Γ+t=Γ+t−1+c+t−1[c+t−1](cid:62)−c+t [c+t ](cid:62)+ t t t−1T+ t−1 t−1 , algorithm: simple covariance matrix update and sparse model t update.Next,weintroducethesepropertiesseparately. x x(cid:62)−Γ− −c− [c− ](cid:62) Γ−=Γ− +c− [c− ](cid:62)−c−[c−](cid:62)+ t t t−1 t−1 t−1 . First, we employ a simpler covariance matrix update rule in t t−1 t−1 t−1 t t T− t thecaseofhandlinghigh-dimensionalsparsedatawhencompared to AdaOAM. The motivating factor behind a different update Itcanbeobservedthattheaboveupdatesfittheonlinesetting scheme here is because using the original covariance update rule well. Finally, Algorithm 1 summarizes the proposed AdaOAM of the AdaOAM method on high-dimensional data would lead to method. extreme high computational and storage costs, i.e. several matrix operationsamongmultiplevariablesintheΓupdateformulations would be necessary. Therefore, we fall back to the standard 3.3 Fast AdaOAM Algorithm for High-dimensional definitionofthecovariancematrixandconsiderasimplermethod SparseData forupdates.Sincethestandarddefinitionofthecovariancematrix AabocvhearisacttheartisitticexopflotihtsethperofpuollsfeedatAurdeasOfoArMweaiglghotrliethamrnindge,scwrihbiecdh is St = (cid:80)t xitx(cid:62)i −ct[ct](cid:62), we just need to maintain the mean i=1 may not be suitable or scalable for high-dimensional sparse data vectorandtheouterproductoftheinstanceateachiterationforthe tasks.Forexample,inspamemaildetectiontasks,thelengthofthe covariance update. In this case, we denote Z+ = (cid:80)t x+[x+](cid:62) vocabulary list can reach the million scale. Although the number t i i i=1 of the features is large, many feature inputs are zero and do not and Z− = (cid:80)t x−[x−](cid:62). Then, the covariance matrices S+ and provide any information to the detection task. The research work t i i t i=1 in [30] has shown that the classification performance saturates S− canbeformulatedas t withdozensoffeaturesoutoftensofthousandsoffeatures. Taking the cue, in order to improve the efficiency and S+ =Z+/T+−c+[c+](cid:62) and S− =Z−/T−−c−[c−](cid:62). t t t t t t t t t t scalability of the AdaOAM algorithm on working with high- dimensional sparse data, we propose the Sparse AdaOAM al- At each iteration, one only needs to update Zyt with Zyt = t t gorithm (SAdaOAM) which learns a sparse linear classifier that Zty−t1+xt[xt](cid:62)andthemeanvectorsofthepositiveandnegative containsalimitedsizeofactivefeatures.Inparticular,SAdaOAM instances, respectively, in the covariance matrices S+ and S−. t t addressestheissueofsparsityinthelearnedmodelandmaintains Withtheaboveupdatescheme,alowercomputationalandstorage 5 costs is attained since most of the elements in the covariance Algorithm2TheSparseAdaOAMAlgorithm(SAdaOAM) matricesarezeroonhigh-dimensionalsparsedata. Input:Theregularizationparametersλandθ,thelearningrate After presenting the efficient scheme of updating the covari- {ηt}Tt=1,thesmoothparameterδ ≥0. ancematricesforhigh-dimensionalsparsedata,weproceednextto Output:Updatedclassifierwt+1. presentthemethodforaddressingthesparsityinthelearnedmodel Variables:s∈Rd,H ∈Rd×d,g1:t,i ∈Rtfori∈{1,...,d}. and second-order adaptive gradient updates simultaneously. Here we consider to impose the soft-constraint (cid:96)1 norm regularization Initialize w0 = 0,c+0 = c−0 = 0,Z0+ = Z0− = ϕ(w) = θ(cid:107)w(cid:107)1 to the objective function (2). So, the new f[0or]dt×d=,T10,+2,=..T.0,−T=do0,g1:0 =[]. Receiveanincominginstance(x ,y ); objectivefunctionis t t ify =+1then t Lt(w)= λ2(cid:107)w(cid:107)22+ ti(cid:80)=−11I[2y|ii(cid:54)=∈y[tt]−(11−]:yyt(ixytt=−x−i1)|(cid:62)w)2 +θ(cid:107)w(cid:107)1, (7)TcU+ttp+d=a=tecT+tZt−+−+11+=+TZ11t+,+T(xt−t+−=xcT[+ttx−−−11])(cid:62);aannddcZ−t−==c−tZ−−1; ; t t−1 t t t t−1 In order to optimize this objective function, we apply the com- Receivegradientgt =∇Lt(w); posite mirror descent method [31] that is able to achieve a Updateg1:t =[g1:t−1gt],st,i =(cid:107)g1:t,i(cid:107)2; trade-off between the immediate adaptive gradient term gt and Ht,ii =δI+(cid:107)g1:t,i(cid:107)2; the regularizer ϕ(w). We denote the i-th diagonal element of wt+1,i = sign(wt,i − Htη,iigt,i)[|wt,i − Htη,iigt,i| − tdheerivmataitornixfoHrtthaescoHmt,pioisi=temδIirr+or(cid:107)dge1sc:te,in(cid:107)t2g.raTdhieenn,tuwpedagteivsewtihthe elseHηtθ,ii]+; (cid:96)1 regularization. T− =T− +1,T+ =T+ ; t t−1 t t−1 Following the framework of the composite mirror descent c− =c− + 1 (x −c− )andc+ =c+ ; updatein[10],theupdateneededtosolveis UtpdateZt−−1 =TZt−− t+x [tx−1](cid:62) andZt+ =tZ−−1 ; t t−1 t t t t−1 wt+1 = argmin{η(cid:104)gt,w(cid:105)+ηϕ(w)+Bψt(w,wt)}, (8) Receivegradientgt =∇Lt(w); (cid:107)w(cid:107)≤√1λ Updateg1:t =[g1:t−1gt],st,i =(cid:107)g1:t,i(cid:107)2; H =δI+(cid:107)g (cid:107) ; washseorceiagtetd=w∇ithLψt(tw(g)ta)n=dB(cid:104)gψtt,(Hwt,gwt(cid:105)t)(isseethdeeBtarielgsminanthdeivperrogoefncoef wηttθ+,ii1],i;= sign1(:wt,it,i2− Htη,iigt,i)[|wt,i − Htη,iigt,i| − Theorem1).Aftertheexpansion,thisupdateamountsto Ht,ii + endif 1 minη(cid:104)g ,w(cid:105)+ηϕ(w)+ (cid:104)w−w ,H (w−w )(cid:105). (9) endfor w t 2 t t t Foreasierderivation,werearrangeandrewritetheaboveobjective functionas From the Algorithm 2, it is observed that we perform “lazy” 1 1 computation when the gradient vectors are sparse [10]. Suppose mwin(cid:104)ηgt−Htwt,w(cid:105)+ 2(cid:104)w,Htw(cid:105)+ 2(cid:104)wt,Htwt(cid:105)+ηθ(cid:107)w(cid:107)1.that,fromiterationstept0 tot,thei-thcomponentofthegradient is “0”. Then, we can evaluate the updates on demand since H t,ii Let wˆ denote the optimal solution of the above optimiza- remainsintact.Therefore,atiterationsteptwhenw isneeded, t,i tion problem. Standard subgradient calculus indicates that when theupdatewillbe |w − η g | ≤ ηθ , the solution is wˆ = 0. Similarly, t,i Ht,ii t,i Ht,ii i ηθ when w − η g < − ηθ , then wˆ < 0, the objective is w =sign(w ,i)[|w ,i|− (t−t )] , differentti,aible,Hant,diithte,isolutioHnti,siiachievedibysettingthegradient t,i t0 t0 Ht0,ii 0 + tozero: where [t] means max(0,t). Obviously, this type of ”lazy” + updatesenjoyshighefficiency. ηg −H w −H wˆ −ηθ =0, t,i t,ii t,i t,ii i sothat 4 THEORETICAL ANALYSIS η ηθ wˆ = g −w − . In this section, we provide the regret bounds for the proposed i Ht,ii t,i t,i Ht,ii set of AdaOAM algorithms for handling both regular and high- dimensionalsparsedata,respectively. Similarly, when w − η g > ηθ , then wˆ > 0, and the t,i Ht,ii t,i Ht,ii i solutionis 4.1 RegretBoundswithRegularData η ηθ wˆi =wt,i− H gt,i− H . Firstly, we introduce two lemmas as follows, which will be used t,ii t,ii tofacilitateoursubsequentanalyses. Combining these three cases, we obtain the coordinate-wise up- Lemma1. Letg ,g ands bedefinedsameintheAlgorithm1. dateresultsforw : t 1:t t t+1,i Then η η ηθ wt+1,i =sign(wt,i− H gt,i)[|wt,i− H gt,i|− H ]+. (cid:88)T (cid:104)g ,diag(s )−1g (cid:105)≤2(cid:88)d (cid:107)g (cid:107) . t,ii t,ii t,ii t t t 1:T,i 2 The complete sparse online AUC maximization approach t=1 i=1 using the adaptive gradient updating algorithm (SAdaOAM) is Lemma2. Letthesequence{w ,g }⊂Rdbegeneratedbythe t 1:t summarizedinAlgorithm2. compositemirrordescentupdateinEquation(12)andassumethat 6 √ supw,u∈χ(cid:107)w−u(cid:107)∞ ≤D∞.Usinglearningrateη =D∞/ 2, where C ≤ (1+ √2 )2 is a constant to bound the scalar of the λ foranyoptimalw∗,thefollowingboundholds second term in the right side of the equation, and with rt,i = max |x −x |,wehave j<t j,i t,i T √ d (cid:88) (cid:88) (cid:118) [Lt(wt)−Lt(w∗)]≤ 2D∞ (cid:107)g1:T,i(cid:107)2. (cid:88)d (cid:88)d (cid:117)(cid:117)(cid:88)T t=1 i=1 (cid:107)g1:T,i(cid:107)2 = (cid:116) (gt,i)2 i=1 i=1 t=1 ThesetwolemmasareactuallytheLemma4andCorollary1 (cid:118) inthepaper[10]. √ (cid:88)d (cid:117)(cid:117)(cid:88)T ≤ 2 (cid:116) [(λw )2+C(r )2]. Usingthesetwolemmas,wecanderivethefollowingtheorem t,i t,i i=1 t=1 fortheproposedAdaOAMalgorithm. √ Finally,combiningtheaboveinequalities,wearriveat Theorem 1. Assume (cid:107)w (cid:107) ≤ 1/ λ, (∀t ∈ [T]) and the t diameterofχ={w|(cid:107)w(cid:107)≤ √1 }isboundedviasupw,u∈χ(cid:107)w− T d (cid:118)(cid:117) T λ (cid:88) (cid:88)(cid:117)(cid:88) u(cid:107)∞ ≤D∞,wehave [Lt(wt)−Lt(w∗)]≤2D∞ (cid:116) [(λwt,i)2+C(rt,i)2]. t=1 i=1 t=1 (cid:118) T d (cid:117) T (cid:88) (cid:88)(cid:117)(cid:88) [L (w )−L (w )]≤2D (cid:116) [(λw )2+C(r )2], t t t ∗ ∞ t,i t,i From the proof above, we can conclude that Algorithm 1 t=1 i=1 t=1 should have a lower regret than non-adaptive algorithms due to whereC ≤(1+ √2 )2,andrt,i =maxj<t|xj,i−xt,i|. its dependence on the geometry of the underlying data space. λ Proof. Wefirstdefinew asw =argmin(cid:80) L (w). If the features are normalized and sparse, the gradient terms in ∗ ∗ w t t (cid:80)d √ Basedontheregularizer λ(cid:107)w(cid:107)2,itiseasytoobtain(cid:107)w (cid:107)2 ≤ thebound (cid:107)g1:T,i(cid:107)2 shouldbemuchsmallerthan T,which 2 ∗ i=1 1/λduetothestrongconvexityproperty,anditisalsoreasonable leads to lower regret and faster convergence. If the featu√re space torestrictwtwith(cid:107)wt(cid:107)2 ≤1/λ.Denotetheprojectionofapoint isrelativedense,thentheconvergenceratewillbeO(1/ T)for w onto (cid:107)u(cid:107)2 ≤ √1λ according to norm (cid:107)·(cid:107)Ht by ΠH√1t (w) = thegeneralcaseasinOPAUCandOAMmethods. λ argmin (cid:107)u−w(cid:107) ,theAdaOAMactuallyemploysthe (cid:107)u(cid:107)≤√1 Ht followingupdatλerule: 4.2 RegretBoundswithHigh-dimensionalSparseData √ Theorem 2. Assume (cid:107)w (cid:107) ≤ 1/ λ, (∀t ∈ [T]) and the wt+1 =ΠH√1tλ(wt−ηHt−1gt), (10) diameterofχ={w|(cid:107)w(cid:107)≤t √1λ}isboundedviasupw,u∈χ(cid:107)w− u(cid:107) ≤ D , the regret bound with respect to (cid:96) regularization ∞ ∞ 1 whereHt =δI+diag(st)andδ ≥0. termis Ifwedenoteψ (g )=(cid:104)g ,H g (cid:105),andthedualnormof(cid:107)·(cid:107) t t t t t ψt T by (cid:107)·(cid:107)ψt∗, in which case (cid:107)gt(cid:107)ψt∗ = (cid:107)gt(cid:107)Ht−1, then it is easy to (cid:88)[Lt(wt)+θ(cid:107)wt(cid:107)1−Lt(w∗)−θ(cid:107)w∗(cid:107)1] checktheupdaterule10isthesamewiththefollowingcomposite t=1 mirrordescentmethod: (cid:118) d (cid:117) T (cid:88)(cid:117)(cid:88) ≤2D (cid:116) [(λw )2+C(r )2], w = argmin{η(cid:104)g ,w(cid:105)+ηϕ(w)+B (w,w )}, (11) ∞ t,i t,i t+1 t ψt t (cid:107)w(cid:107)≤√1 i=1 t=1 λ where the regularization function ϕ ≡ 0, and Bψt(w,wt) is wanhderrew=∗ =maarxgmi|nxw(cid:80)−Tt=x1[L|.t(w)+θ(cid:107)w(cid:107)1],C ≤(1+√2λ)2, the Bregman divergence associated with a strongly convex and t,i j<t j,i t,i Proof. In the case of high-dimensional sparse adaptive online differentiablefunctionψ t AUC maximization, the regret we plan to bound with respect to theoptimalweightw isformulatedas ∗ B (w,w )=ψ (w)−ψ (w )−(cid:104)∇ψ (w ),w−w (cid:105). ψt t t t t t t t T (cid:88) Since we have ϕ ≡ 0 in the case of regular data, the regret R(T)= [Lt(wt)+ϕ(wt)−Lt(w∗)−ϕ(w∗)], T t=1 (cid:80) bound R(T) = [L (w ) − L (w )]. Then, we follow the t t t ∗ where ϕ(w) = θ(cid:107)w(cid:107) is the (cid:96) regularization term to impose t=1 1 1 derivationresultsof[10]andattainthefollowingregretbound sparsitytothesolution.Similarly,ifdenoteψt(gt)=(cid:104)gt,Htgt(cid:105), T √ d a(cid:107)ngd(cid:107)thedu,aitlinsoeramsyofto(cid:107)c·h(cid:107)eψckt bthye(cid:107)up·d(cid:107)aψtt∗in,ginruwlehiocfhScAasdeaO(cid:107)gAtM(cid:107)ψt∗ = (cid:88) (cid:88) t H−1 [L (w )−L (w )]≤ 2D (cid:107)g (cid:107) , t t t t ∗ ∞ 1:T,i 2 η η ηθ t=1 i=1 w =sign(w − g )[|w − g |− ] , t+1,i t,i H t,i t,i H t,i H + t,ii t,ii t,ii where χ = {w|(cid:107)w(cid:107) ≤ √1λ} is bounded via supw,u∈χ(cid:107)w − isthesamewiththefollowingone u(cid:107) ≤ D . Next, we would like to analyze the features’ ∞ ∞ dependencyonthedataofthegradient.Since w = argmin{η(cid:104)g ,w(cid:105)+ηϕ(w)+B (w,w )}, (12) t+1 t ψt t (cid:107)w(cid:107)≤√1 t−1 λ (cid:80) (g )2 ≤(cid:104)λw + j=1(1−yt(cid:104)xt−xj,w(cid:105))yt(xj,i−xt,i)(cid:105)2 whereϕ(w)=θ||w||1.From[10],wehave t,i t,i Tt− R(T)≤ 1 max(cid:107)w −w (cid:107)2 (cid:88)d (cid:107)g (cid:107) +η(cid:88)d (cid:107)g (cid:107) . ≤2(λwt,i)2+2C(xj,i−xt,i)2 =2(λwt,i)2+2C(rt,i)2, 2η t≤T ∗ t ∞ 1:T,i 2 1:T,i 2 i=1 i=1 7 Furthermore, we assume sup (cid:107)w−u(cid:107) ≤ D and set 5.2 ExperimentalTestbedandSetup √ w,u∈χ ∞ ∞ η =D/ 2,thefinalregretboundis ToexaminetheperformanceoftheproposedAdaOAMincompar- isontotheexistingstate-of-the-artmethods,weconductextensive √ d (cid:88) experiments on sixteen benchmark datasets by maintaining con- R(T)≤ 2D (cid:107)g (cid:107) . ∞ 1:T,i 2 sistencytothepreviousstudiesononlineAUCmaximization[6], i=1 [7]. Table 1 shows the details of 16 binary-class datasets in our This theoretical result shows that the regret bound for sparse experiments. All of these datasets can be downloaded from the solutionisthesameasthatinthecasewhenϕ≡0. LIBSVM 2 and UCI machine learning repository 3. Note that several datasets (svmguide4, vehicle) are originally multi-class, As discussed above, the SAdaOAM algorithm should have whichwereconvertedtoclass-imbalancedbinarydatasetsforthe lower regret bound than non-adaptive algorithms do on high- purposeofinourexperimentalstudies. dimensional sparse data, though this depends on the geometric In the experiments, the features have been normalized fairly, property of the underlying feature distribution. If some features i.e., xt ← xt/(cid:107)xt(cid:107), which is reasonable since instances are d received sequentially in online learning setting. Each dataset has (cid:80) appear much more frequently than others, then (cid:107)g (cid:107) 1:T,i 2 been randomly divided into 5 folds, in which 4 folds are for i=1 indicates that we could have remarkably lower regret by using training and the remaining fold is for testing. We also generate higher learning rates for infrequent features and lower learning 4 independent 5-fold partitions per dataset to further reduce the ratesforoftenoccurringfeatures. effects of random partition on the algorithms. Therefore, the reported AUC values are the average results of 20 runs for each dataset.5-foldcrossvalidationisconductedonthetrainingsetsto 5 EXPERIMENTAL RESULTS decide on the learning rate η ∈ 2[−10:10] and the regularization parameter λ ∈ 2[−10:6]. For OAM and OAM , the buffer gra seq In this section, we evaluate the proposed set of the AdaOAM size is fixed at 100 as suggested in [6]. All experiments for algorithms in terms of AUC performance, convergence rate, and online setting comparisons were conducted with MATLAB on a examine their parameter sensitivity. The main framework of the computer workstation with 16GB memory and 3.20GHz CPU. experiments is based on the LIBOL, an open-source library for Ontheotherhand,forfaircomparisonsinbatchsettings,thecore onlinelearningalgorithms1 [32]. stepsofthealgorithmswereimplementedinC++sincewedirectly usetherespectivetoolboxes 45providedbytherespectiveauthors oftheSVM-perfandCAPOalgorithms. 5.1 ComparisonAlgorithms We conduct comprehensive empirical studies by comparing the 5.3 EvaluationofAdaOAMonBenchmarkDatasets proposed algorithms with various AUC optimization algorithms Table 2 summarizes the average AUC performance of the algo- for both online and batch scenarios. Specifically, the algorithms rithms under studied over the 16 datasets for online setting. In consideredinourexperimentsinclude: this table, we use •/◦ to indicate that AdaOAM is significantly • Online Uni-Exp: An online learning algorithm which opti- better/worse than the corresponding method (pairwise t-tests at mizesthe(weighted)univariateexponentialloss[33]; 95%significancelevel). • Online Uni-Log: An online learning algorithm which opti- From the results in Table 2, several interesting observations mizesthe(weighted)univariatelogisticloss[33]; can be drawn. Firstly, the win/tie/loss counts show that the • OAMseq: The OAM algorithm with reservoir sampling and AdaOAM is clearly superior to the counterpart algorithms con- sequentialupdatingmethod[6]; sideredforcomparison,asitwinsinmostcasesandhaszeroloses • OAMgra: The OAM algorithm with reservoir sampling and in terms of AUC performance. This indicates that the proposed onlinegradientupdatingmethod[6]; • OPAUC: The one-pass AUC optimization algorithm with AdaOAMisthemosteffectiveonlineAUCoptimizationalgorithm squarelossfunction[7]; among all others considered. Secondly, AdaOAM outperforms • SVM-perf: A batch algorithm which directly optimizes the first-order online AUC maximization algorithms, including, AUC[4]; OPAUC,OAM ,andOAM ,thusdemonstratingthatsecond- • CAPO: A batch algorithm which trains nonlinear auxiliary seq gra order information can help significantly improve the learning classifiers first and then adapts auxiliary classifiers for spe- efficacy of existing online AUC optimization algorithms. In ad- cificperformancemeasures[34]; • Batch Uni-Log: A batch algorithm which optimizes the dition,onsvmguide4,balancescale,andpokerhanddatasets,the (weighted)univariatelogisticloss[33]; optimization methods based on pairwise loss functions including • Batch Uni-Squ: A batch algorithm which optimizes the AdaOAM, OPAUC, OAM , and OAM perform far better seq gra (weighted)univariatesquareloss; than those methods based on univariate loss functions including • AdaOAM:Theproposedadaptivegradientmethodforonline Uni-LogandUni-Exp.Thishighlightsthesignificanceandeffec- AUCmaximization. • SAdaOAM: The proposed sparse adaptive subgradient tivenessofmethodsbasedonpairwiselossfunctionoptimization methodforonlineAUCmaximization. overunivariatelossfunctionones. It is noted that the OAM , OAM , and OPAUC are the To study the efficiency of the proposed AdaOAM algorithm, seq gra state-of-the-artmethodsforAUCmaximizationinonlinesettings. Figure 1 depicts the running time (in milliseconds) of AdaOAM Forbatchlearningscenarios,CAPOandSVM-perfarebothstrong 2.http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/ baselinestocompareagainst. 3.http://www.ics.uci.edu/∼mlearn/MLRepository.html 4.http://www.cs.cornell.edu/people/tj/svm light/svm perf.html 1.http://libol.stevenhoi.org/ 5.http://lamda.nju.edu.cn/code CAPO.ashx 8 TABLE1 Detailsofbenchmarkmachinelearningdatasets. datasets #inst #dim T /T datasets #inst #dim T /T − + − + glass 214 9 2.057 vehicle 846 18 3.251 heart 270 13 1.250 german 1,000 24 2.333 svmguide4 300 10 5.818 svmguide3 1,243 22 3.199 liver-disorders 345 6 1.379 a2a 2,265 123 2.959 balance 625 4 11.755 magic04 19,020 10 1.843 breast 683 10 1.857 cod-rna 59,535 8 2.000 australian 690 14 1.247 acoustic 78,823 50 3.316 diabetes 768 8 1.865 poker 1025,010 11 10.000 TABLE2 AUCperformanceevaluation(mean±std.)ofAdaOAMagainstotheronlinealgorithmsonbenchmarkdatasets.•/◦indicatesthatAdaOAMis significantlybetter/worsethanthecorrespondingmethod(pairwiset-testsat95%significancelevel). datasets AdaOAM OPAUC OAMseq OAMgra onlineUni-Log onlineUni-Exp glass .816±.058 .804±.059• .808±.086• .817±.057◦ .797±.074• .795±.074• heart .912±.029 .910±.029• .907±.027• .892±.030• .906±.030• .908±.029• svmguide4 .819±.089 .740±.077• .782±.051• .781±.069• .588±.106• .583±.101• liver-disorders .719±.034 .711±.036• .704±.047• .693±.060• .695±.042• .698±.040• balance .579±.106 .551±.114• .524±.098• .508±.078• .385±.091• .385±.090• breast .992±.005 .992±.006 .989±.008 .992±.006 .992±.006 .992±.006 australian .927±.016 .926±.016 .915±.024• .911±.021• .924±.016 .923±.017 diabetes .826±.031 .825±.031 .820±.030• .808±.046• .824±.032 .824±.032 vehicle .818±.026 .816±.025 .815±.027 .792±.035• .770±.031• .774±.031• german .771±.031 .730±.052• .751±.044• .730±.033• .662±.037• .663±.037• svmguide3 .734±.038 .724±.038• .719±.041• .696±.047• .674±.041• .678±.042• a2a .873±.019 .872±.021• .840±.022• .839±.019• .862±.019• .866±.020• magic04 .798±.007 .765±.009• .777±.012• .752±.023• .754±.008• .752±.008• cod-rna .962±.002 .919±.003• .942±.005• .936±.004• .919±.003• .919±.003• acoustic .894±.002 .888±.002• .882±.048• .871±.054• .768±.069• .796±.196• poker .521±.007 .520±.007 .507±.008• .508±.016• .488±.006• .488±.006• win/tie/loss 11/5/0 14/2/0 14/1/1 13/3/0 13/3/0 versus other online learning algorithms on all the 16 benchmark the AdaOAM, batch Uni-Log, and batch Uni-Squ in C++ in our datasets. worktoobtainthetimecostcomparisonsreportedhere. From the results in Figure 1, it can be observed that the SeveralcoreobservationscanbedrawnfromFigure2.Firstof empirical computational complexity of AdaOAM is in general all,theresultsontimecostshighlightsthehigherefficiencyofthe comparable to the other online learning algorithms, while be- AdaOAM against the other batch learning algorithms in general. ing more efficient than OAM and OAM on some of the Second, the empirical computational time cost of AdaOAM is seq gra datasets,suchas,glass,heart,etc.Thisindicatesthattheproposed noted to be significantly lower than the batch Uni-Log and batch algorithm is scalable and efficient, making it more attractive to Uni-Squ, which is attributed to the difference between an online large-scalereal-worldapplications. setting and a batch setting. Third, CAPO is noted to be less effi- Next we move from the online setting to a batch learning cient than SVM-perf in most cases. This is because CAPO being mode. In particular, Table 3 summarizes the average AUC per- an ensemble learning method for AUC optimization is expected formanceofthealgorithmsundercomparisonoverthe16datasets toincurhighertrainingtime.Besides,SVM-perfandCAPOneed in a batch setting. Note that here •/◦ is used to indicate that more time costs than that of the AdaOAM especially on large AdaOAM is significantly better/worse than the corresponding datasetsalthoughbothofthemaredesignedtobefastalgorithms method. forperformancemeasureoptimizationtasks.Oneexceptionfrom FromTable3,thefollowingobservationshavebeenobserved. the results where the time cost of AdaOAM on the “a2a” dataset Firstly,thewin/tie/losscountsshowthattheAdaOAMissuperior is observed to be higher than that of the CAPO method. The to batch Uni-Log and batch Uni-Squ in many cases. Since batch reasonisthat“a2a”datasetbeingahighlysparsedatasetisaclear Uni-LogandbatchUni-Squoperateonoptimizingunivariateloss advantage for the CAPO since it operates under the framework functions, the results demonstrates the significance of adopting of Core Vector Machine method [35], which is a very fast batch pair-wise loss function for AUC maximization. Secondly, the algorithm for training SVM model involving sparse data. On the performance of AdaOAM is competitive to CAPO, but under- otherhand,AdaOAMisnotoptimallydesignedtodealwithsparse performed SVM-perf, which is expected since AdaOAM trades datasetandsinceitneedtocomputethecovariancematriceswhen efficacywithefficiency. updatingthemodel,andatthesametimeC++programmingisnot ToanalyzetheefficiencyoftheproposedAdaOAMalgorithm, appropriateandefficienttodealwithmatrixcomputations. wesummarizetherunningtime(inmilliseconds)ofAdaOAMand theotherbatchlearningalgorithmsonallthebenchmarkdatasets 5.4 EvaluationofOnlinePerformance inFigure2.SincethecorestepsoftheSVM-perfandCAPOare implementedbasedonthespecifictoolboxes6 7 developedbythe Next we study the online performance of AdaOAM versus the respectiveauthorsinC++,wehaveimplementedthecorestepsof other online learning algorithms and highlight 6 datasets for illustration. Specifically, Figure 3(a)-(h) report the average AUC 6.http://www.cs.cornell.edu/people/tj/svm light/svm perf.html performance (across 20 independent runs) of the online model 7.http://lamda.nju.edu.cn/code CAPO.ashx on the testing datasets. From the results, AdaOAM is once again 9 Fig.1.Runningtime(inmilliseconds)ofAdaOAMandtheotheronlinelearningalgorithmsonall16benchmarkdatasets.Notethatthey-axisis presentedinlog-scale. TABLE3 AUCperformanceevaluation(mean±std.)ofAdaOAMagainstotherbatchalgorithmsonbenchmarkdatasets.•/◦indicatesthatAdaOAMis significantlybetter/worsethanthecorrespondingmethod(pairwiset-testsat95%significancelevel). datasets AdaOAM SVM-perf CAPO batchUni-Log batchUni-Squ glass .816±.058 .822±.060◦ .839±.057◦ .807±.071• .819±.060 heart .912±.029 .921±.016◦ .923±.013◦ .900±.034• .902±.034• svmguide4 .819±.089 .871±.048◦ .841±.022◦ .596±.100• .695±.091• liver-disorders .719±.034 .722±.084 .729±.045◦ .703±.052• .693±.057• balance .579±.106 .592±.025◦ .620±.010◦ .355±.081• .517±.206• breast .992±.005 .998±.069 .995±.044 .995±.004 .994±.005 australian .927±.016 .939±.052◦ .915±.030• .924±.016 .925±.016 diabetes .826±.031 .836±.020◦ .852±.024◦ .828±.031 .828±.032 vehicle .818±.026 .820±.034 .782±.029• .823±.029◦ .798±.029• german .771±.031 .777±.043 .783±.033◦ .766±.035• .734±.034• svmguide3 .734±.038 .753±.056◦ .752±.055◦ .743±.032◦ .729±.037• a2a .873±.019 .881±.012◦ .865±.049• .878±.017◦ .873±.019 magic04 .798±.007 .792±.031• .780±.045• .758±.008• .757±.008• cod-rna .962±.002 .954±.027• .986±.073◦ .949±.003• .924±.003• acoustic .894±.002 .897±.038 .892±.075 .878±.007• .865±.006• poker .521±.007 .524±.070 .517±.010• .497±.024• .496±.006• win/tie/loss 2/6/8 5/2/9 10/3/3 11/5/0 105 AdaOAM SVM−perf CAPO Batch Uni−Log Batch Uni−Squ 104 s) d n eco103 s milli e ( m ng ti102 ni n u r 101 100 glass heart svmguide4liver−disorders balance breast australian diabetes vehicle german svmguide3 a2a magic04 cod−rna acoustic poker datasets Fig.2.Runningtime(inmilliseconds)ofAdaOAMandtheotherbatchlearningalgorithmsonall16benchmarkdatasets.Notethatthey-axisis presentedinlog-scale. 10 showntosignificantlyoutperformalltheotherthreecounterparts Inconsistenttothepreviousexperimentalsettings,weconduct in the online learning process, which is consistent to our theoret- 5 fold cross validation on the training sets to identify the most ical analysis that AdaOAM can more effectively exploit second appropriate learning rate η ∈ 2[−8:4] and sparse regularization order information to achieve improved regret bounds and robust parameter θ ∈ 10[−8:−1]. We fix the parameter λ as 10−6 since performance. its effect on the learning performance is negligible, especially if it is sufficiently small. The performance of all the algorithms are evaluated across 4 independent trials of 5 fold cross validation, 5.5 EvaluationofParameterSensitivity andthenthereportedAUCvaluesaretheaverageofthe20runs. Inthissubsection,weproceedtoexaminetheparametersensitivity To showcase the benefit of the SAdaOAM algorithm on high- of the AdaOAM algorithm. In our study, we experimented the dimensionalsparsedatasets,theoriginalAdaOAMisalsoincluded AdaOAMwithasetofdifferentlearningratesthatliesinthewide for comparison, in addition to the state-of-the-art online learning range of η ∈ 2[−8:4]. The average test AUC results of AdaOAM algorithms considered in the earlier sections. The experimental acrossthewiderangeoflearningratesafterasinglepassthrough resultsofSAdaOAMandotheronlinelearningalgorithmsinterms the training data of the respective benchmark datasets are then of AUC performance on all the testing data are then reported in summarized in Figure 4(a)-(h). Due to the space constraints, the Table5. results of 8 datasets are reported here for illustrations. Since the From the results in Table 5, the following observations can AdaOAM algorithm provides a per-feature adaptive learning rate bedrawn.Firstly,allthepairwiselossfunctionoptimizationalgo- ateachiteration,itislesssensitivetothelearningrateη thanthe rithmshavebeenobservedtooutperformtheunivariatelossfunc- standardSGD. tion optimization algorithms i.e., Uni-Log and Uni-Exp, which In [7], the authors claimed that OPAUC was insensitive to stresses the benefits and high efficacy of pairwise loss function the parameter settings. From Figure 4, it can be observed that optimizationforAUCmaximizationtask.Inaddition,itisevident AdaOAM is clearly more robust or insensitive to the learning fromtheresultsthatourproposedalgorithmSAdaOAMissuperior rate than the OPAUC. The updating strategy by OPAUC is based to the non-adaptive or non-sparse methods in most cases, which on simple SGD, which usually requires quite some efforts of indicatesourproposedalgorithmwithsparsityispotentiallymore tuningthelearningrateparametersufficiently.Ontheotherhand, effective than existing online AUC maximization algorithms that the adaptive gradient strategy of AdaOAM is theoretically sound donotexploitthesparsityinthedata. for learning rate adaptation since it takes full advantage of the To bring deeper insights on the mechanisms of the proposed historical gradient information available in the learning process. algorithm, we take a further to analyze the sparsity level of the As such, AdaOAM is less sensitive to the parameter settings. finallearnedmodelgeneratedbySAdaOAMandtheotheronline Moreover, AdaOAM exhibits a natural phenomena of decreasing learning algorithms. The sparsity of the learned model plays a learningratewithincreasingiterations. significant role in large-scale machine learning tasks in terms of bothstoragecostandefficiency.Asparselearnedmodelnotonly speeds up the training process, but also reduces the storage cost 5.6 Evaluation of SAdaOAM on High-dimensional requirementsoflarge-scalesystems.Here,wemeasurethesparsity SparseDatasets levelofalearnedmodelbasedontheratioofzeroelementsinthe model and the results of the corresponding online algorithms are In this subsection, we move on to evaluate the empirical per- summarizedinTable6. formance of various online AUC maximization algorithms on Several observations can be drawn from Table 6. To begin the publicly available high-dimensional sparse datasets as sum- with, we found that all the algorithms failed to produce high marized in Table 4. The pcmac dataset is downloaded from SVMLin 8. The farm ads dataset is generated based on the text sparsitylevelsolutionsforonlineAUCmaximizationtaskexcept the proposed SAdaOAM algorithm. In contrast, the SAdaOAM ads found on twelve websites that deal with various farm animal approach gives consideration to both performance and sparsity related topics [36] and downloaded from UCI Machine Learning Repository. The Reuters dataset is the ModApte version 9 upon level for all the cases. In particular, it is found that the higher the feature dimension is, the larger the sparsity is achieved by removing documents with multiple category labels, and contains the proposed algorithm. To sum up, the SAdaOAM algorithm 8293documentsin65categories.Thesector,rcv1m,andnews20 is an ideal method of online AUC optimization for those high- datasets are taken from the LIBSVM dataset website. Note that dimensionalsparsedata. the original sector, Reuters, rcv1m, and news20 are multi-class datasets; in our experiments, we randomly group the multiple classes into two meta-class in which each contains the same 5.7 EvaluationofSAdaOAMonSparsity-AUCTradeoffs numberofclasses. Inthissetofexperiments,westudythetradeoffsbetweenthespar- TABLE4 sitylevelandtheAUCperformancefortheSAdaOAMalgorithm. Detailsofhighsparsedatasets. To achieve this, we set the regularization parameter for (cid:96) with 1 datasets #inst #dim T−/T+ sparsity the range θ ∈ 10[−8:0] for the SAdaOAM method. We apply the pcmac 1,946 7,510 1.0219 3.99% sameexperimentalsettingexecutedaboveandrecordtheaverage farmads 4,143 54,877 1.1433 0.36% testAUCperformanceversusthesparsityleveldesignatedbythe sector 9,619 55,197 1.0056 0.29% proportionofnon-zerosinthefinalweightsolutionafterasingle Reuters 8,293 18,933 42.6474 0.25% rcv1m 20,242 47,236 1.0759 0.16% pass through the training data. These experimental settings make news20 15,935 62,061 1.0042 0.13% thefinalmodelsolutionsrangingfromanalmostdenseweightto 8.http://vikas.sindhwani.org/svmlin.html anearlyall-zeroone.Werandomlychoosethreedatasetsforthis 9.http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html setofexperimentsandshowtheresultsinFigure5.