Stochastic Optimization with Importance Sampling Peilin Zhao Tong Zhang Department of Statistics Department of Statistics Rutgers University Rutgers University Piscataway, NJ, 08854, USA Piscataway, NJ, 08854, USA 5 [email protected] [email protected] 1 0 2 Abstract n a Uniform sampling of training data has been commonly used in traditional stochastic optimization J algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual 2 CoordinateAscent(prox-SDCA).Althoughuniformsamplingcanguaranteethatthesampledstochastic quantityis an unbiased estimate of thecorresponding truequantity,theresulting estimator may havea ] ratherhighvariance,whichnegativelyaffectstheconvergenceoftheunderlyingoptimization procedure. L Inthispaperwestudystochasticoptimizationwithimportancesampling,whichimprovestheconvergence M rate by reducing the stochastic variance. Specifically, we study prox-SGD (actually, stochastic mirror descent)withimportancesamplingandprox-SDCAwithimportancesampling. Forprox-SGD,insteadof . t adoptinguniformsamplingthroughoutthetrainingprocess,theproposedalgorithmemploysimportance a t samplingtominimize thevarianceofthestochastic gradient. Forprox-SDCA,theproposed importance s sampling scheme aims to achieve higher expected dual value at each dual coordinate ascent step. We [ provideextensive theoretical analysis to show that the convergence rates with theproposed importance 2 sampling methods can be significantly improved under suitable conditions both for prox-SGD and for v prox-SDCA.Experiments are provided toverify thetheoretical analysis. 3 5 7 1 Introduction 2 . 1 Stochastic optimization has been extensively studied in the machine learning community [1, 2, 3, 4, 5, 6, 0 7, 8, 9, 10, 11, 12, 13]. In general, at every step, a traditional stochastic optimization method will sample 4 one training example or one dual coordinate uniformly at random from the training data, and then update 1 : the model parameter using the sampled example or dual coordinate. Although uniform sampling simplifies v the analysis, it is insufficient because it may introduce a very high variance of the sampled quantity, which i X will negatively affect the convergence rate of the resulting optimization procedure. In this paper we study r stochastic optimization with importance sampling, which reduces the stochastic variance to significantly a improvetheconvergencerate. Specifically,this paperfocusonimportancesamplingtechniquesforProximal Stochastic Gradient Descent (prox-SGD) (actually general proximal stochastic mirror descent) [4, 14] and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA) [13]. For prox-SGD, the traditional algorithms such as Stochastic Gradient Descent (SGD) sample training examples uniformly at random during the entire learning process, so that the stochastic gradient is an unbiasedestimationofthetruegradient[1,2,3,4]. However,thevarianceoftheresultingstochasticgradient estimator may be very high since the stochastic gradient can vary significantly over different examples. In order to improve convergence, this paper proposes a sampling distribution and the corresponding unbiased importanceweightedgradientestimatorthatachievesminimalvariance. Tothisend,weanalyzetherelation between the variance of stochastic gradient and the sampling distribution. We show that to minimize the variance, the optimal sampling distribution should be roughly proportional to the norm of the stochastic gradient. To simplify computation,we alsoconsider the use ofupper bounds for the norms. Our theoretical analysis shows that under certain conditions, the proposed sampling method can significantly improve the 1 convergence rate, and our results include the existing theoretical results for uniformly sampled prox-SGD and SGD as special cases. Similarlyforprox-SDCA,thetraditionalapproachsuchasStochasticDualCoordinateAscent(SDCA)[12] picksacoordinatetoupdate bysamplingthetrainingdatauniformlyatrandom[5,6,7,8,9,10,11,12,13]. ItwasshownrecentlythatSDCAandprox-SDCAalgorithmwithuniformrandomsamplingconvergesmuch faster than a fixed cyclic ordering [12, 13]. However, this paper shows that if we employ an appropriately defined importance sampling strategy, the convergence could be further improved. To find the optimal sampling distribution, we analyze the connection between the expected increase of dual objective and the sampling distribution, and obtain the optimal solution that depends on the smooth parameters of the loss functions. Ouranalysisshowsthatundercertainconditions,theproposedsamplingmethodcansignificantly improve the convergence rate. In addition, our theoretical results include the existing results for uniformly sampled prox-SDCA and SDCA as special cases. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents some preliminaries. In section4, we will study stochastic optimizationwith importance sampling. Section5 lists several applications for the proposed algorithms. Section 6 gives our empirical evaluations. Section 7 concludes the paper. 2 Related Work We review some related work on Proximal Stochastic Gradient Descent (including more general proximal stochastic mirror descent) and ProximalStochastic Dual Coordinate Ascent. In recent years Proximal Stochastic Gradient Descent has been extensively studied [4, 14]. As a special case of prox-SGD, Stochastic Gradient Descent has been extensively studied in stochastic approximation theory [15]; however these results are often asymptotic, so there is no explicit bound in terms of T. Later on,finite sampleconvergencerateofSGD forsolvinglinear predictionproblemwerestudiedby anumber of authors[1,16]. Ingeneralprox-SGDcanachieveaconvergencerateofO(1/√T)forconvexlossfunctions,and aconvergencerateofO(logT/T)forstronglyconvexlossfunctions,whereT isthenumberofiterationsofthe algorithm. Morerecently,researchershaveimprovedthepreviousboundtoO(1/T)byα-SuffixAveraging [2], which means insteadof returning the averageof the entire sequence of classifiers,the algorithmwill average andreturnjustanα-suffix: the averageofthe lastαfractionofthe whole sequenceofclassifiers. Inpractice it may be difficult for users to decide when to compute the α-suffix. To solve this issue, a polynomial decay averagingstrategy is proposed by [3], which will decay the weights of old individual classifiers polynomially and also guarantee a O(1/T) convergence bound. ForProximalStochasticDualCoordinateAscent[13],Shalev-ShwartzandZhangrecentlyprovedthatthe algorithmachievesaconvergencerateofO(1/T)forLipschitzlossfunctions,andenjoysalinearconvergence rate of O(exp( O(T))) for smooth loss functions. For structural SVM, a similar result was also obtained in − [9]. Several others researchers [6, 7] have studied the convergence behavior of the related non-randomized DCA (dual coordinate ascent) algorithm for SVM, but could only obtain weaker convergence results. The relatedrandomizedcoordinatedescentmethodhasbeeninvestigatedbysomeotherauthors[8,10,17]. How- ever,whenappliedtoSDCA,theanalysiscanonlyleadtoaconvergencerateofthedualobjectivevaluewhile we are mainly interested in the convergence of primal objective in machine learning applications. Recently, ShaiShalev-ShwartzandTongZhanghasresolvedthisissuebyprovidingaprimal-dualanalysisthatshowed a linear convergence rate O(exp( O(T))) of the duality gap for SDCA with smooth loss function [12]. − Although both prox-SGD and prox-SDCA have been extensively studied, most of the existing work only considered the uniform sampling scheme during the entire learning process. Recently, we noticed that [18] Deanna Needell et. al. considered importance sampling for stochastic gradient descent, where they suggested similar or the same sampling distributions. Strohmer and Vershynin [19] proposed a variant of the Kaczmarz method (an iterative method for solving systems of linear equations) which selects rows with probability proportional to their squared norm. It is pointed out that, this algorithm is actually a SGD algorithm with importance sampling [18]. However, we have studied importance sampling for more generalcompositeobjectivesandmoregeneralproximalstochasticgradientdescent,i.e.,proximalstochastic 2 mirrordescentwhichcoverstheiralgorithmsasspecialcases. Furthermore,wehavealsostudiedprox-SDCA with importance sampling, which is not covered by their study. In addition, Xiao and Zhang [20] have also proposed a proximal stochastic gradient method with progressive variance reduction, where they also provide importance sampling strategy for only smooth loss functions, which is the same with ours. Because our analysis is based on the basic version of stochastic gradient (mirror) descent, the convergence rate is worse than the linear rates in SAG [21] and SVRG [20] for smooth strongly convex objective functions. However, our main concern is on the effectiveness of importance sampling, which could be applied to many other gradient based algorithms. We shall mention that for coordinate descent, some researchers have recently considered non-uniform sampling strategies [22, 23], but their results cannot be directly applied to proximal SDCA which we are interested in here. The reasons are several-folds. The primal-dual analysis of prox-SDCA in this paper is analogous to that of [12], which directly implies a convergence rate for the duality gap. The proof techniques rely on the structure of the regularized loss minimization, which can not be applied to general primal coordinate descent. The suggested distribution of the primal coordinate descent is propositional to the smoothness constant of every coordinate, while the distribution of prox-SDCA is propositional to a constant plus the smoothness constant of the primal individual loss function, which is the inverse of the stronglyconvexconstantof the dual coordinate. These two distributions arequite different. In addition,we also provide an importance sampling distribution when the individual loss functions are Lipschitz. We also noticed that a mini-batch SDCA [24] and an accelerated version of prox-SDCA [25] were studied recently by Shalev-Shwartz and Zhang. The accelerated version in [25] uses an inner-outer-iteration strategy, where the inner iterationis the standardprox-SDCAprocedure. Thereforethe importance sampling results ofthis paper can be directly applied to the accelerated prox-SDCA because the convergence of inner iteration is fasterthanthatofuniformsampling. Thereforeinthispaper wewillonly focusonshowingthe effectiveness of importance sampling for the unaccelerated prox-SDCA. Related to this paper, non-uniform sampling in the online setting is related to selective sampling, which can be regardedas a form of online active learning which has been extensively studied in the literature [26, 27, 28, 29, 30]. Similar to importance sampling in stochastic optimization, selective sampling also works in iterations. However the purposes are quite different. Specifically, selective sampling draws unlabeled instances uniformly at random from a fixed distribution and decides which samples to label — the goal is to reducethe numberoflabels neededto achievea certainaccuracy;importancesampling consideredinthis paper does not reduce the number of labels needed, and the goal is to reduce the training time. 3 Preliminaries Here, we briefly introduce some key definitions and propositions that are useful throughout the paper (for details, please refer to [31] ). We consider vector functions: φ:Rd R. → Definition 1. For σ 0, a function φ: Rd R is σ-strongly convex with respect to (w.r.t.) a norm , if for all u,v Rd, we≥have → k·k ∈ σ φ(u) φ(v)+ φ(v) (u v)+ u v 2, ⊤ ≥ ∇ − 2k − k or equivalently, s [0,1] ∀ ∈ σs(1 s) φ(su+(1 s)v) sφ(u)+(1 s)φ(v) − u v 2. − ≤ − − 2 k − k For example, φ(w)= 1 w 2 is 1-strongly convex w.r.t. . 2k k2 k·k2 Definition 2. A function φ:Rd R is L-Lipschitz w.r.t. a norm , if for all u,v Rd, we have → k·k ∈ φ(u) φ(v) L u v . | − |≤ k − k 3 Definition3. Afunctionφ:Rd Ris(1/γ)-smoothifitisdifferentiableanditsgradientis(1/γ)-Lipschitz, or, equivalently for all u,v Rd,→we have ∈ 1 φ(u) φ(v)+ φ(v) (u v)+ u v 2. ⊤ ≤ ∇ − 2γk − k For example, φ(w)= 1 w 2 is 1-smooth w.r.t. . 2k k2 k·k2 Proposition 1. If φ is (1/γ)-smooth with respect to a norm , then its dual function φ is γ-strongly P ∗ k·k convex with respect to its dual norm , where D k·k φ (v)=sup(v w φ(w)), ∗ ⊤ w − and the dual norm is defined as v = sup v w. D ⊤ k k kwkP=1 For example, the dual norm of is itself. The dual norm of is . The dual norm of is 2 1 p k·k k·k k·k∞ k·k , where 1/q+1/p=1. q k·k Definition4. Letψ :Rd Rbeacontinuously-differentiable real-valuedandstrictlyconvexfunction. Then → the Bregman divergence associated with ψ is (u,v)=ψ(u) ψ(v) ψ(v),u v , ψ B − −h∇ − i which is the difference between the value of ψ at u and the value of the first-order Taylor expansion of ψ around v evaluated at u. Throughout, ψ denotes a continuously differentiable function that is σ-strongly convex w.r.t. a norm , so that (u,v) σ u v 2. k·k Bψ ≥ 2k − k Definition 5. A function f :Rd R is µ-strongly convex with respect to a differentiable function ψ, if for → any u,v, we have f(u) f(v)+ f(v),u v +µ (u,v). ψ ≥ h∇ − i B For example, when ψ(w)= 1 w 2, we recover the usual definition of strongly convexity. 2k k2 Definition 6. A function f :Rd R is (1/γ)-smooth with respect to a differentiable function ψ, if for any → u,v, we have f(u) f(v)+ f(v),u v +(1/γ) (u,v). ψ ≤ h∇ − i B 4 Stochastic Optimization with Importance Sampling We consider the following generic optimization problem associated with regularized loss minimization of linear predictors. Let φ ,φ ,...,φ be n vector functions fromRd to R. Our goalis to find an approximate 1 2 n solution of the following optimization problem n 1 min P(w):= φ (w)+λr(w), (1) i w Rd n ∈ i=1 X f(w) where λ>0 is a regularizationparameter, and r|is a{rzegula}rizer. For example, given examples (x ,y ) where x Rd and y 1,+1 , the Support Vector Machine i i i i ∈ ∈ {− } problem is obtained by setting φ (w) = [1 y x w] , [z] = max(0,z), and r(w) = 1 w 2. Regression i − i ⊤i + + 2k k2 problems also fall into the above. For example, ridge regressionis obtained by setting φ (w)=(y x w)2 i i− ⊤i and r(w)= 1 w 2, lasso is obtained by setting φ (w)=(y x w)2 and r(w)= w . 2k k2 i i− ⊤i k k1 Let w be the optimum of (1). We say that a solution w is ǫ -sub-optimal if P(w) P(w ) ǫ . We ∗ P ∗ P − ≤ analyze the convergence rates of the proposed algorithms with respect to the number of iterations. 4 4.1 Proximal Stochastic Gradient Descent with Importance Sampling In this subsection, we would consider the proximal stochastic mirror descent with importance sampling. Because proximalstochastic mirrordescent is generalversionof proximalstochastic gradientdescent (prox- SGD), we will abuse SGD to replace stochastic mirror descent. If we directly apply full or stochastic gradientdescent to the optimizationproblem (1), the solution may notsatisfysomedesirableproperty. Forexample,whenr(w)= w ,theoptimalsolutionoftheproblem(1) 1 k k should be sparse, and we would like the approximate solution to be sparse as well. However, if we directly use stochastic (sub-)gradient descent, then the resulting solution will not achieve sparsity [4]. To effectively and efficiently solve the optimization problem (1), a well known method is the proximal stochastic(sub)-gradientdescent. Specifically,ProximalStochasticGradientDescentworksiniterations. At each iteration t= 1,2,..., i will be uniformly randomly draw from 1,2,...,n , and the iterative solution t { } will be updated according to the formula 1 wt+1 =argmin φ (wt),w +λr(w)+ (w,wt) . (2) w (cid:20)h∇ it i ηtBψ (cid:21) where is a Bregmandivergenceand φ (wt)denotes anarbitrary(sub-)gradientofφ . Intuitively, this Bψ ∇ it it method works by minimizing a first-order approximation of the function φ at the current iterate wt plus it theregularizerλr(w), andforcingthenextiteratewt+1 tolieclosetowt. Thestepsizeη isusedtocontrols t the trade-off between these two objectives.Because the expectation of φ (wt) is the same with f(wt), i.e., E[ φ (wt)wt] = 1 n φ (wt)= f(wt), the optimization pr∇oblietm (2) is an unbiased est∇imation ∇ it | n i=1∇ i ∇ of that for the proximal gradient descent. P We assume that the exact solution of the above optimization (2) can be efficiently solved. For example, when ψ(w)= 1 w 2, we have (u,v)= 1 u v 2, and the above optimization will produce the t+1-th 2k k2 Bψ 2k − k2 iterate as: wt+1 =prox wt η φ (wt) , ηtλr − t∇ it (cid:0) (cid:1) whereprox (x)=argmin h(w)+1 w x 2 . Furthermore,itisalsoassumedthattheproximalmapping h w 2k − k2 of ηtλr(w), i.e., proxηtλr(x)(cid:16), is easy to compu(cid:17)te. For example, when r(w) = kwk1, the proximal mapping of λr(w) is the following shrinkage operation prox (x)=sign(x) [x λ] , λr ⊙ | |− + where is the element wise product, which can be computed in time complexity O(d). ⊙ The advantage of proximalstochastic gradient descent is that eachstep only relies on a single derivative φ (), andthusthe computationalcostis 1/nofthatofthe standardproximalgradientdescent. However, ∇ it · a disadvantage of the method is that the randomness introduces variance - this is caused by the fact that φ (wt) equals the gradient f(wt) in expectation, but φ (wt) varies with i. In particular, if the ∇ it ∇ ∇ i stochastic gradient has a large variance, then the convergence will become slow. Now, we would like to study prox-SGD with importance sampling to reduce the variance of stochastic gradient. The idea of importance sampling is, at the t-th step, to assign each i 1,...,n a probability pt 0 such that n pt = 1. We then sample i from 1,...,n based on proba∈bi{lity pt =}(pt,...,pt) . i ≥ i=1 i t { } 1 n ⊤ If we adopt this distribution, then proximal SGD with importance sampling will work as follows: P 1 wt+1 =argmin (npt ) 1 φ (wt),w +λr(w)+ (w,wt) , (3) w h it − ∇ it i ηtBψ h i which is another unbiased estimation of the optimization problem for proximal gradient descent, because E[(npt ) 1 φ (wt)wt]= n pt(npt) 1 φ (wt)= f(wt). it − ∇ it | i=1 i i − ∇ i ∇ Similarly,ifψ(w)= 1 w 2,theproximalSGDwithimportancesamplingwillproducethet+1-thiterate 2kPk2 as: wt+1 =prox wt η (npt ) 1 φ (wt) . ηtλr − t it − ∇ it (cid:0) (cid:1) 5 In addition, setting the derivative of optimization function in equation (3) as zero, we can obtain the following implicit update rule for the iterative solution: ψ(wt+1)= ψ(wt) η (npt ) 1 φ (wt) η λ∂r(wt+1), ∇ ∇ − t it − ∇ it − t where ∂r(wt+1) is a subgradient. Now the key question that attracted us is which pt can optimally reduce the variance of the stochastic gradient. Toanswerthis question, wewill firstlyprovealemma, thatcanindicates the relationshipbetween pt and the convergence rate of prox-SGD with importance sampling. Lemma 1. Let wt+1 be defined by the update (3). Assume that ψ() is σ-strongly convex with respect to · a norm , and that f is µ-strongly convex and (1/γ)-smooth with respect to ψ, if r(w) is convex and k·k η (0,γ] then wt+1 satisfies the following inequality for any t 1, t ∈ ≥ 1 η E[P(wt+1)−P(w∗)]≤ η E[Bψ(w∗,wt)−Bψ(w∗,wt+1)]−µEBψ(w∗,wt)+ σtEV (nptit)−1∇φit(wt) , t (cid:0) (cid:1) wherethevarianceisdefinedasV((npt ) 1 φ (wt))=E (npt ) 1 φ (wt) f(wt) 2,andtheexpectation is take with the distribution pt. it − ∇ it k it − ∇ it −∇ k∗ Proof. Tosimplifythenotation,wedenoteg =(npt ) 1 φ (wt). Becausef(w)isµ-stronglyconvexw.r.t. t it − ∇ it ψ, and r(w) is convex, we can derive P(w ) f(wt)+ f(wt),w wt +µ (w ,wt)+λr(wt+1)+λ ∂r(wt+1),w wt+1 . ∗ ∗ ψ ∗ ∗ ≥ h∇ − i B h − i Using the fact f is (1/γ)-smooth w.r.t. ψ, we can further lower bound f(wt) by f(wt) f(wt+1) f(wt),wt+1 wt (1/γ) (wt+1,wt). ψ ≥ −h∇ − i− B Combining the above two inequalities, we have P(w ) P(wt+1)+ f(wt)+λ∂r(wt+1),w wt+1 +µ (w ,wt) (1/γ) (wt+1,wt). ∗ ∗ ψ ∗ ψ ≥ h∇ − i B − B Considering the second term on the right-hand side, we have f(wt)+λ∂r(wt+1),w wt+1 = f(wt)+[ ψ(wt) ψ(wt+1)]/η g ,w wt+1 ∗ t t ∗ h∇ − i h∇ ∇ −∇ − − i 1 = ψ(wt) ψ(wt+1),w wt+1 + g f(wt),wt+1 w . ∗ t ∗ η h∇ −∇ − i h −∇ − i t Combining the above two inequalities, we get P(w ) P(wt+1) µ (w ,wt) g f(wt),wt+1 w ∗ ψ ∗ t ∗ − − B −h −∇ − i f(wt)+λ∂r(wt+1),w wt+1 (1/γ) (wt+1,wt) g f(wt),wt+1 w ∗ ψ t ∗ ≥h∇ − i− B −h −∇ − i 1 = ψ(wt) ψ(wt+1),w∗ wt+1 (1/γ) ψ(wt+1,wt). η h∇ −∇ − i− B t Plugging the following equality (Lemma 11.1 from [32]) ψ(w∗,wt+1)+ ψ(wt+1,wt) ψ(w∗,wt)= ψ(wt) ψ(wt+1),w∗ wt+1 , B B −B h∇ −∇ − i into the previous inequality gives P(w∗) P(wt+1) µ ψ(w∗,wt) gt f(wt),wt+1 w∗ − − B −h −∇ − i 1 (w ,wt+1)+ (wt+1,wt) (w ,wt) (1/γ) (wt+1,wt) ψ ∗ ψ ψ ∗ ψ ≥ η B B −B − B t 1 (cid:2) (cid:3) (w ,wt+1) (w ,wt) , ψ ∗ ψ ∗ ≥ η B −B t (cid:2) (cid:3) 6 where η (0,γ] is used for the final inequality. Re-arranging the above inequality and taking expectation t ∈ on both sides result in 1 E[P(wt+1) P(w∗)] E[ ψ(w∗,wt) ψ(w∗,wt+1)] µE ψ(w∗,wt) E gt f(wt),wt+1 w∗ . − ≤ η B −B − B − h −∇ − i t To upper bound the lastinner producttermonthe right-handside, wecandefine the proximalfull gradient update as wt+1 =argmin f(wt),w +λr(w)+ 1 (w,wt) , which is independent with g . Then we w h∇ i ηtBψ t can bound E g f(wt)h,wt+1 w as follows i t ∗ − h −∇ − i b E g f(wt),wt+1 w = E g f(wt),wt+1 wt+1+wt+1 w t ∗ t ∗ − h −∇ − i − h −∇ − − i = E gt f(wt),wt+1 wt+1 E gt f(wt),wt+1 w∗ − h −∇ − i− h −∇ − i E g f(wt) wt+1 bwt+1 bE g f(wt),wt+1 w t t ∗ ≤ Ekηt g−∇ f(wkt∗)k2 E−gb kf−(wth),w−t+∇1 w b − i t t ∗ ≤ σk −∇ k∗− h b−∇ − i b η η = E t (npt ) 1 φ (wt) f(wt) 2 = tV (npt ) 1 φ (wt) , σk it − ∇ it −∇ k∗ σb it − ∇ it (cid:0) (cid:1) where, the first inequality is due to Cauchy-Schwartz inequality, the second inequality is due to Lemma 3, andthelastequalityisbecauseofE[ g f(wt),wt+1 w wt]=0. Finally,pluggingtheaboveinequality t ∗ h −∇ − i| into the previous one concludes the proof of this lemma. b From the above analysis, we can observe that the smaller the variance, the more reduction on objective function we have. In the next subsection, we will study how to adopt importance sampling to reduce the variance. This observation will be made more rigorous below. 4.1.1 Algorithm AccordingtotheresultintheLemma1,tomaximizethereductionontheobjectivevalue,weshouldchoose pt as the solution of the following optimization n 1 min V((npt ) 1 φ (wt)) min (pt) 1 φ (wt) 2. (4) pt,pt [0,1],Pn pt=1 it − ∇ it ⇔pt,pt [0,1],Pn pt=1n2 i − k∇ i k∗ i∈ i=1 i i∈ i=1 i i=1 X It is easy to verify, that the solution of the above optimization is φ (wt) pti = nk∇ i φ (wk∗t) , ∀i∈{1,2,...,n}. (5) j=1k∇ j k∗ Although,thisdistributioncanmPinimizethevarianceofthet-thstochasticgradient,itrequirescalculation ofnderivativesateachstep,whichisclearlyinefficient. Tosolvethisissue,apotentialsolutionistocalculate the n derivatives at some steps andthen keepit for use for a long time. In addition the true derivatives will changes every step, it may be better to add a smooth parameter to the sampling distribution. However this solution still can be inefficient. Another more practical solution is to relax the previous optimization (4) as follows n n 1 1 min (pt) 1 φ (wt) 2 min (pt) 1G2 (6) pt,pt [0,1],Pn pt=1n2 i − k∇ i k∗ ≤pt,pt [0,1],Pn pt=1n2 i − i i∈ i=1 i i=1 i∈ i=1 i i=1 X X by introducing G φ (wt) , t. i i ≥k∇ k∗ ∀ 7 Then, we canapproximatethe distribution inequation(5)by solvingthe the righthand side ofthe inequal- ity (6) as G pt = i , i 1,2,...,n , i n G ∀ ∈{ } j=1 j which is independent with t. P Based on the above solution, we will suggest distributions for two kinds of loss functions - Lipschitz functions and smooth functions. Firstly, if φ (w) is L -Lipschitz w.r.t. , then φ (w) L for any i i i i w Rd, and the suggested distribution is k·k∗ k∇ k∗ ≤ ∈ L pt = i , i 1,2,...,n . i n L ∀ ∈{ } j=1 j Secondly, if φ (w) is (1/γ )-smooth andP wt R for any t, then φ (wt) R/γ , then the advised i i i i k k ≤ k∇ k∗ ≤ distribution is 1 pt = γi , i 1,2,...,n . i n 1 ∀ ∈{ } j=1 γj P Finally, we can summarize the proposed ProximalSGD with importance sampling in Algorithm 1. Algorithm 1 Proximal Stochastic Gradient Descent with Importance Sampling (Iprox-SGD) Input: λ 0, the learning rates η ,...,η >0. 1 T ≥ Initialize: w1 =0, p1 =(1/n,...,1/n) . ⊤ for t=1,...,T do Update pt; Sample i from 1,...,n based on pt; t { } Update wt+1 =argmin (npt ) 1 φ (wt),w +λr(w)+ 1 (w,wt) ; w it − ∇ it ηtBψ end for h(cid:10) (cid:11) i 4.1.2 Analysis This section provides a convergence analysis of the proposed algorithm. Before presenting the results, we make some general assumptions: r(0)=0, and r(w) 0, for allw. ≥ It is easy to see that these two assumptions are generally satisfied by all the well-known regularizers. Under the above assumptions, we first prove a convergence result for Proximal SGD with importance sampling using the previous Lemma 1. Theorem 1. Let wt be generated by the proposed algorithm. Assume that ψ() is σ-strongly convex with · respect to a norm , and that f is µ-strongly convex and (1/γ)-smooth with respect to ψ, if r(w) is convex k·k and η = 1 with α 1/γ µ, the following inequality holds for any T 1, t α+µt ≥ − ≥ T T 1 1 V EP(wt+1) P(w ) α (w ,w1)+E t , (7) ∗ ψ ∗ T − ≤ T " B σ(α+µt)# t=1 t=1 X X where the variance is defined as V = V[(npt ) 1 φ (wt)] = E (npt ) 1 φ (wt) f(wt) 2, and the expectation is take with the distributtion pt. it − ∇ it k it − ∇ it − ∇ k∗ 8 Proof. Firstly it is easy to check η (0,γ]. Because the functions ψ, f, r satisfy the the assumptions in t ∈ Lemma 1, we have 1 η E[P(wt+1) P(w )] E[ (w ,wt) (w ,wt+1)] µE (w ,wt)+ tEV (npt ) 1 φ (wt) . − ∗ ≤ η Bψ ∗ −Bψ ∗ − Bψ ∗ σ it − ∇ it t (cid:2) (cid:3) Summing the above inequality over t=1,...,T, and using η =1/(α+µt) we get t T T EP(wt+1) P(w ) ∗ − t=1 t=1 X X T T T V (α+µt)E ψ(w∗,wt) ψ(w∗,wt+1) µ E ψ(w∗,wt)+E t ≤ B −B − B σ(α+µt) t=1 t=1 t=1 X (cid:2) (cid:3) X X T T V V =α (w ,w1) (α+µT) (w ,wT+1)+E t α (w ,w1)+E t . ψ ∗ ψ ∗ ψ ∗ B − B σ(α+µt) ≤ B σ(α+µt) t=1 t=1 X X Dividing both sides of the above inequality by T concludes the proof. Corollary 1. Under the same assumptions in the Theorem 1, if we further assume φ (w) is (1/γ )-smooth, i i wt R for any t, and the distribution is set as pt = R/γi , then the following inequality holds for any k k≤ i Pnj=1R/γj T 1, ≥ 1 T 1 ( n R/γ )2 µ T EP(wt+1)−P(w∗) ≤ T αBψ(w∗,w1)+ i=σ1µn2 i α+µ +ln(α+µT)−ln(α+µ) t=1 (cid:20) P (cid:18) (cid:19)(cid:21) X ( n R/γ )2ln(α+µT) = O i=1 i . σµn2 T (cid:18) P (cid:19) In addition, if µ=0, the above bound is invalid, however if η is set as σ (w ,w1)/(√TPni=1R/γi), t Bψ ∗ n we can prove the following inequality for any T 1, ≥ p 1 T (w ,w1) n R/γ 1 EP(wt+1) P(w∗) 2 Bψ ∗ i=1 i . T t=1 − ≤ r σ P n √T X Remark: If ψ(w) = 1 w 2 and r(w) = 0, then (u,v) = 1 u v 2, and the proposed algorithm 2k k2 Bψ 2k − k2 becomes SGD with importance sampling. Under these assumptions, it is achievable to get rid of the lnT factor in the convergence bound, when the objective function is strongly convex. However, we will not provide the details for concision. For more general Bregman divergence, it is difficult to remove this lnT factor,becausemanypropertiesof 1 u v 2arenotsatisfiedbytheBregmandivergence,suchassymmetry. 2k − k2 In addition, it is easy to derive high probability bound using existing work, such as the Theorem 8 in the [14]. In this theorem, the high probability bound depends on the variance of the stochastic gradient, so our sampling strategy can improve this bound, since we are minimizing the variance. However, we do not explicitly provide the resulting bounds, because the consequence is relatively straightforward. Proof. Firstly, the fact φ (w) is (1/γ )-smooth, and wt R for any t implies φ (wt) R/γ . Using i i i i this result and the distribution pt = R/γi , we cakn gket≤ k∇ k∗ ≤ i Pnj=1R/γj 1 n 1 n R/γ 2 V =E (npt ) 1 φ (wt) f(wt) 2 E (npt ) 1 φ (wt) 2 = φ (wt) 2 i=1 i , t k it − ∇ it −∇ k∗ ≤ k it − ∇ it k∗ n2 i=1 pik∇ i k∗ ≤(cid:18)P n (cid:19) X 9 where the first inequality is due to E z Ez 2 =E z 2 Ez 2. Using the above inequality gives k − k k k −k k T V n R/γ 2 T 1 t i=1 i σ(α+µt) ≤ n σ(α+µt) t=1 (cid:18)P (cid:19) t=1 X X n R/γ 2 1 1 T 1 i=1 i + ≤ (cid:18)P n (cid:19) σ "α+µ Zt=1 α+µt# n R/γ 2 1 µ i=1 i +ln(α+µT) ln(α+µ) ≤ n σµ α+µ − (cid:18)P (cid:19) (cid:20) (cid:21) Plugging the above inequality into the inequality (7) concludes the proof of the first part. Toprovethesecondpart,wecanplugtheboundonV andtheequalityη = σ (w ,w1)/(√TPni=1R/γi) t t Bψ ∗ n into the inequality (7). p Remark. Ifthe uniformdistributionisadopted, itis easytoobservethatV isbounded by Pni=1(R/γi)2, t n andtheTheorem1willresultsin 1 T EP(wt+1) P(w ) O Pni=1(R/γi)2ln(α+µT) forstronglyconvex T t=1 − ∗ ≤ σµn T f, and T1 Tt=1EP(wt+1)−P(w∗)≤P2 Bψ(wσ∗,w1)Pni=1(nR/γi)2√1T(cid:16)for generalconvexf.(cid:17)However,according to Cauchy-Schwarz inequality, q P n (R/γ )2 n R/γ 2 n n (R/γ )2 i=1 i / i=1 i = i=1 i 1, n n ( n R/γ )2 ≥ P (cid:18)P (cid:19) Pi=1 i implies importance sampling does improve the convergence rPate, especially when (Pni=1R/γi)2 n. Pni=1(R/γi)2 ≪ Theorem 2. Let wt be generated by the proposed algorithm. Assume that ψ() is σ-strongly convex with · respect to a norm , f is convex, and r(w) is 1-stronglyconvex, if η is set as 1/(λt)for all t, the following t k·k inequality holds for any T 1, ≥ 1 T 1 1 T 1 φ (wt) T t=1EP(wt)−P(w∗)≤ T "λBψ(w∗,w1)+ λσ t=1 tEk∇ nitptit k2∗#, (8) X X where the expectation is take with the distribution pt. Proof. Thefactr(w)is1-stronglyconveximpliesthatλr(w)isλ-stronglyconvex. Then,alltheassumptions in the Corollary 6 of [14] are satisfied, so we have the following inequality, T T 1 1 1 1 1 [ φ (wt)+λr(wt+1) φ (w ) λr(w )] λ (w ,w1)+ φ (wt) 2 t=1 nptit it − nptit it ∗ − ∗ ≤ Bψ ∗ λσ t=1 tknptit∇ it k∗ X X which is actually the same with the last display in the page 5 of [14]. Taking expectation on both sides of the above inequality and using r(w1)=0 concludes the proof. We will use the above Theorem to derive two logarithmic convergence bounds. Corollary 2. Under the same assumptions in the Theorem 2, if we further assume φ (w) is (1/γ )-smooth, i i wt R for any t, and the distribution is set as pt = R/γi , then the following inequality holds for any k k≤ i Pnj=1R/γj T 1, ≥ 1 T 1 ( n R/γ )2 ( n R/γ )2lnT EP(wt) P(w ) λ (w ,w1)+ i=1 i (lnT +1) =O i=1 i . T t=1 − ∗ ≤ T " Bψ ∗ P λσn2 # P λσn2 T ! X 10