Table Of ContentStochastic Optimization with Importance Sampling
Peilin Zhao Tong Zhang
Department of Statistics Department of Statistics
Rutgers University Rutgers University
Piscataway, NJ, 08854, USA Piscataway, NJ, 08854, USA
5 peilinzhao@hotmail.com tzhang@stat.rutgers.edu
1
0
2
Abstract
n
a Uniform sampling of training data has been commonly used in traditional stochastic optimization
J algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual
2 CoordinateAscent(prox-SDCA).Althoughuniformsamplingcanguaranteethatthesampledstochastic
quantityis an unbiased estimate of thecorresponding truequantity,theresulting estimator may havea
] ratherhighvariance,whichnegativelyaffectstheconvergenceoftheunderlyingoptimization procedure.
L
Inthispaperwestudystochasticoptimizationwithimportancesampling,whichimprovestheconvergence
M rate by reducing the stochastic variance. Specifically, we study prox-SGD (actually, stochastic mirror
descent)withimportancesamplingandprox-SDCAwithimportancesampling. Forprox-SGD,insteadof
.
t adoptinguniformsamplingthroughoutthetrainingprocess,theproposedalgorithmemploysimportance
a
t samplingtominimize thevarianceofthestochastic gradient. Forprox-SDCA,theproposed importance
s
sampling scheme aims to achieve higher expected dual value at each dual coordinate ascent step. We
[
provideextensive theoretical analysis to show that the convergence rates with theproposed importance
2 sampling methods can be significantly improved under suitable conditions both for prox-SGD and for
v prox-SDCA.Experiments are provided toverify thetheoretical analysis.
3
5
7 1 Introduction
2
.
1 Stochastic optimization has been extensively studied in the machine learning community [1, 2, 3, 4, 5, 6,
0
7, 8, 9, 10, 11, 12, 13]. In general, at every step, a traditional stochastic optimization method will sample
4
one training example or one dual coordinate uniformly at random from the training data, and then update
1
: the model parameter using the sampled example or dual coordinate. Although uniform sampling simplifies
v
the analysis, it is insufficient because it may introduce a very high variance of the sampled quantity, which
i
X will negatively affect the convergence rate of the resulting optimization procedure. In this paper we study
r stochastic optimization with importance sampling, which reduces the stochastic variance to significantly
a improvetheconvergencerate. Specifically,this paperfocusonimportancesamplingtechniquesforProximal
Stochastic Gradient Descent (prox-SGD) (actually general proximal stochastic mirror descent) [4, 14] and
Proximal Stochastic Dual Coordinate Ascent (prox-SDCA) [13].
For prox-SGD, the traditional algorithms such as Stochastic Gradient Descent (SGD) sample training
examples uniformly at random during the entire learning process, so that the stochastic gradient is an
unbiasedestimationofthetruegradient[1,2,3,4]. However,thevarianceoftheresultingstochasticgradient
estimator may be very high since the stochastic gradient can vary significantly over different examples. In
order to improve convergence, this paper proposes a sampling distribution and the corresponding unbiased
importanceweightedgradientestimatorthatachievesminimalvariance. Tothisend,weanalyzetherelation
between the variance of stochastic gradient and the sampling distribution. We show that to minimize the
variance, the optimal sampling distribution should be roughly proportional to the norm of the stochastic
gradient. To simplify computation,we alsoconsider the use ofupper bounds for the norms. Our theoretical
analysis shows that under certain conditions, the proposed sampling method can significantly improve the
1
convergence rate, and our results include the existing theoretical results for uniformly sampled prox-SGD
and SGD as special cases.
Similarlyforprox-SDCA,thetraditionalapproachsuchasStochasticDualCoordinateAscent(SDCA)[12]
picksacoordinatetoupdate bysamplingthetrainingdatauniformlyatrandom[5,6,7,8,9,10,11,12,13].
ItwasshownrecentlythatSDCAandprox-SDCAalgorithmwithuniformrandomsamplingconvergesmuch
faster than a fixed cyclic ordering [12, 13]. However, this paper shows that if we employ an appropriately
defined importance sampling strategy, the convergence could be further improved. To find the optimal
sampling distribution, we analyze the connection between the expected increase of dual objective and the
sampling distribution, and obtain the optimal solution that depends on the smooth parameters of the loss
functions. Ouranalysisshowsthatundercertainconditions,theproposedsamplingmethodcansignificantly
improve the convergence rate. In addition, our theoretical results include the existing results for uniformly
sampled prox-SDCA and SDCA as special cases.
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents
some preliminaries. In section4, we will study stochastic optimizationwith importance sampling. Section5
lists several applications for the proposed algorithms. Section 6 gives our empirical evaluations. Section 7
concludes the paper.
2 Related Work
We review some related work on Proximal Stochastic Gradient Descent (including more general proximal
stochastic mirror descent) and ProximalStochastic Dual Coordinate Ascent.
In recent years Proximal Stochastic Gradient Descent has been extensively studied [4, 14]. As a special
case of prox-SGD, Stochastic Gradient Descent has been extensively studied in stochastic approximation
theory [15]; however these results are often asymptotic, so there is no explicit bound in terms of T. Later
on,finite sampleconvergencerateofSGD forsolvinglinear predictionproblemwerestudiedby anumber of
authors[1,16]. Ingeneralprox-SGDcanachieveaconvergencerateofO(1/√T)forconvexlossfunctions,and
aconvergencerateofO(logT/T)forstronglyconvexlossfunctions,whereT isthenumberofiterationsofthe
algorithm. Morerecently,researchershaveimprovedthepreviousboundtoO(1/T)byα-SuffixAveraging [2],
which means insteadof returning the averageof the entire sequence of classifiers,the algorithmwill average
andreturnjustanα-suffix: the averageofthe lastαfractionofthe whole sequenceofclassifiers. Inpractice
it may be difficult for users to decide when to compute the α-suffix. To solve this issue, a polynomial decay
averagingstrategy is proposed by [3], which will decay the weights of old individual classifiers polynomially
and also guarantee a O(1/T) convergence bound.
ForProximalStochasticDualCoordinateAscent[13],Shalev-ShwartzandZhangrecentlyprovedthatthe
algorithmachievesaconvergencerateofO(1/T)forLipschitzlossfunctions,andenjoysalinearconvergence
rate of O(exp( O(T))) for smooth loss functions. For structural SVM, a similar result was also obtained in
−
[9]. Several others researchers [6, 7] have studied the convergence behavior of the related non-randomized
DCA (dual coordinate ascent) algorithm for SVM, but could only obtain weaker convergence results. The
relatedrandomizedcoordinatedescentmethodhasbeeninvestigatedbysomeotherauthors[8,10,17]. How-
ever,whenappliedtoSDCA,theanalysiscanonlyleadtoaconvergencerateofthedualobjectivevaluewhile
we are mainly interested in the convergence of primal objective in machine learning applications. Recently,
ShaiShalev-ShwartzandTongZhanghasresolvedthisissuebyprovidingaprimal-dualanalysisthatshowed
a linear convergence rate O(exp( O(T))) of the duality gap for SDCA with smooth loss function [12].
−
Although both prox-SGD and prox-SDCA have been extensively studied, most of the existing work
only considered the uniform sampling scheme during the entire learning process. Recently, we noticed
that [18] Deanna Needell et. al. considered importance sampling for stochastic gradient descent, where
they suggested similar or the same sampling distributions. Strohmer and Vershynin [19] proposed a variant
of the Kaczmarz method (an iterative method for solving systems of linear equations) which selects rows
with probability proportional to their squared norm. It is pointed out that, this algorithm is actually a
SGD algorithm with importance sampling [18]. However, we have studied importance sampling for more
generalcompositeobjectivesandmoregeneralproximalstochasticgradientdescent,i.e.,proximalstochastic
2
mirrordescentwhichcoverstheiralgorithmsasspecialcases. Furthermore,wehavealsostudiedprox-SDCA
with importance sampling, which is not covered by their study. In addition, Xiao and Zhang [20] have
also proposed a proximal stochastic gradient method with progressive variance reduction, where they also
provide importance sampling strategy for only smooth loss functions, which is the same with ours. Because
our analysis is based on the basic version of stochastic gradient (mirror) descent, the convergence rate is
worse than the linear rates in SAG [21] and SVRG [20] for smooth strongly convex objective functions.
However, our main concern is on the effectiveness of importance sampling, which could be applied to many
other gradient based algorithms.
We shall mention that for coordinate descent, some researchers have recently considered non-uniform
sampling strategies [22, 23], but their results cannot be directly applied to proximal SDCA which we are
interested in here. The reasons are several-folds. The primal-dual analysis of prox-SDCA in this paper
is analogous to that of [12], which directly implies a convergence rate for the duality gap. The proof
techniques rely on the structure of the regularized loss minimization, which can not be applied to general
primal coordinate descent. The suggested distribution of the primal coordinate descent is propositional
to the smoothness constant of every coordinate, while the distribution of prox-SDCA is propositional to a
constant plus the smoothness constant of the primal individual loss function, which is the inverse of the
stronglyconvexconstantof the dual coordinate. These two distributions arequite different. In addition,we
also provide an importance sampling distribution when the individual loss functions are Lipschitz. We also
noticed that a mini-batch SDCA [24] and an accelerated version of prox-SDCA [25] were studied recently
by Shalev-Shwartz and Zhang. The accelerated version in [25] uses an inner-outer-iteration strategy, where
the inner iterationis the standardprox-SDCAprocedure. Thereforethe importance sampling results ofthis
paper can be directly applied to the accelerated prox-SDCA because the convergence of inner iteration is
fasterthanthatofuniformsampling. Thereforeinthispaper wewillonly focusonshowingthe effectiveness
of importance sampling for the unaccelerated prox-SDCA.
Related to this paper, non-uniform sampling in the online setting is related to selective sampling, which
can be regardedas a form of online active learning which has been extensively studied in the literature [26,
27, 28, 29, 30]. Similar to importance sampling in stochastic optimization, selective sampling also works
in iterations. However the purposes are quite different. Specifically, selective sampling draws unlabeled
instances uniformly at random from a fixed distribution and decides which samples to label — the goal is
to reducethe numberoflabels neededto achievea certainaccuracy;importancesampling consideredinthis
paper does not reduce the number of labels needed, and the goal is to reduce the training time.
3 Preliminaries
Here, we briefly introduce some key definitions and propositions that are useful throughout the paper (for
details, please refer to [31] ). We consider vector functions: φ:Rd R.
→
Definition 1. For σ 0, a function φ: Rd R is σ-strongly convex with respect to (w.r.t.) a norm ,
if for all u,v Rd, we≥have → k·k
∈
σ
φ(u) φ(v)+ φ(v) (u v)+ u v 2,
⊤
≥ ∇ − 2k − k
or equivalently, s [0,1]
∀ ∈
σs(1 s)
φ(su+(1 s)v) sφ(u)+(1 s)φ(v) − u v 2.
− ≤ − − 2 k − k
For example, φ(w)= 1 w 2 is 1-strongly convex w.r.t. .
2k k2 k·k2
Definition 2. A function φ:Rd R is L-Lipschitz w.r.t. a norm , if for all u,v Rd, we have
→ k·k ∈
φ(u) φ(v) L u v .
| − |≤ k − k
3
Definition3. Afunctionφ:Rd Ris(1/γ)-smoothifitisdifferentiableanditsgradientis(1/γ)-Lipschitz,
or, equivalently for all u,v Rd,→we have
∈
1
φ(u) φ(v)+ φ(v) (u v)+ u v 2.
⊤
≤ ∇ − 2γk − k
For example, φ(w)= 1 w 2 is 1-smooth w.r.t. .
2k k2 k·k2
Proposition 1. If φ is (1/γ)-smooth with respect to a norm , then its dual function φ is γ-strongly
P ∗
k·k
convex with respect to its dual norm , where
D
k·k
φ (v)=sup(v w φ(w)),
∗ ⊤
w −
and the dual norm is defined as
v = sup v w.
D ⊤
k k
kwkP=1
For example, the dual norm of is itself. The dual norm of is . The dual norm of is
2 1 p
k·k k·k k·k∞ k·k
, where 1/q+1/p=1.
q
k·k
Definition4. Letψ :Rd Rbeacontinuously-differentiable real-valuedandstrictlyconvexfunction. Then
→
the Bregman divergence associated with ψ is
(u,v)=ψ(u) ψ(v) ψ(v),u v ,
ψ
B − −h∇ − i
which is the difference between the value of ψ at u and the value of the first-order Taylor expansion of ψ
around v evaluated at u.
Throughout, ψ denotes a continuously differentiable function that is σ-strongly convex w.r.t. a norm
, so that (u,v) σ u v 2.
k·k Bψ ≥ 2k − k
Definition 5. A function f :Rd R is µ-strongly convex with respect to a differentiable function ψ, if for
→
any u,v, we have
f(u) f(v)+ f(v),u v +µ (u,v).
ψ
≥ h∇ − i B
For example, when ψ(w)= 1 w 2, we recover the usual definition of strongly convexity.
2k k2
Definition 6. A function f :Rd R is (1/γ)-smooth with respect to a differentiable function ψ, if for any
→
u,v, we have
f(u) f(v)+ f(v),u v +(1/γ) (u,v).
ψ
≤ h∇ − i B
4 Stochastic Optimization with Importance Sampling
We consider the following generic optimization problem associated with regularized loss minimization of
linear predictors. Let φ ,φ ,...,φ be n vector functions fromRd to R. Our goalis to find an approximate
1 2 n
solution of the following optimization problem
n
1
min P(w):= φ (w)+λr(w), (1)
i
w Rd n
∈ i=1
X
f(w)
where λ>0 is a regularizationparameter, and r|is a{rzegula}rizer.
For example, given examples (x ,y ) where x Rd and y 1,+1 , the Support Vector Machine
i i i i
∈ ∈ {− }
problem is obtained by setting φ (w) = [1 y x w] , [z] = max(0,z), and r(w) = 1 w 2. Regression
i − i ⊤i + + 2k k2
problems also fall into the above. For example, ridge regressionis obtained by setting φ (w)=(y x w)2
i i− ⊤i
and r(w)= 1 w 2, lasso is obtained by setting φ (w)=(y x w)2 and r(w)= w .
2k k2 i i− ⊤i k k1
Let w be the optimum of (1). We say that a solution w is ǫ -sub-optimal if P(w) P(w ) ǫ . We
∗ P ∗ P
− ≤
analyze the convergence rates of the proposed algorithms with respect to the number of iterations.
4
4.1 Proximal Stochastic Gradient Descent with Importance Sampling
In this subsection, we would consider the proximal stochastic mirror descent with importance sampling.
Because proximalstochastic mirrordescent is generalversionof proximalstochastic gradientdescent (prox-
SGD), we will abuse SGD to replace stochastic mirror descent.
If we directly apply full or stochastic gradientdescent to the optimizationproblem (1), the solution may
notsatisfysomedesirableproperty. Forexample,whenr(w)= w ,theoptimalsolutionoftheproblem(1)
1
k k
should be sparse, and we would like the approximate solution to be sparse as well. However, if we directly
use stochastic (sub-)gradient descent, then the resulting solution will not achieve sparsity [4].
To effectively and efficiently solve the optimization problem (1), a well known method is the proximal
stochastic(sub)-gradientdescent. Specifically,ProximalStochasticGradientDescentworksiniterations. At
each iteration t= 1,2,..., i will be uniformly randomly draw from 1,2,...,n , and the iterative solution
t
{ }
will be updated according to the formula
1
wt+1 =argmin φ (wt),w +λr(w)+ (w,wt) . (2)
w (cid:20)h∇ it i ηtBψ (cid:21)
where is a Bregmandivergenceand φ (wt)denotes anarbitrary(sub-)gradientofφ . Intuitively, this
Bψ ∇ it it
method works by minimizing a first-order approximation of the function φ at the current iterate wt plus
it
theregularizerλr(w), andforcingthenextiteratewt+1 tolieclosetowt. Thestepsizeη isusedtocontrols
t
the trade-off between these two objectives.Because the expectation of φ (wt) is the same with f(wt),
i.e., E[ φ (wt)wt] = 1 n φ (wt)= f(wt), the optimization pr∇oblietm (2) is an unbiased est∇imation
∇ it | n i=1∇ i ∇
of that for the proximal gradient descent.
P
We assume that the exact solution of the above optimization (2) can be efficiently solved. For example,
when ψ(w)= 1 w 2, we have (u,v)= 1 u v 2, and the above optimization will produce the t+1-th
2k k2 Bψ 2k − k2
iterate as:
wt+1 =prox wt η φ (wt) ,
ηtλr − t∇ it
(cid:0) (cid:1)
whereprox (x)=argmin h(w)+1 w x 2 . Furthermore,itisalsoassumedthattheproximalmapping
h w 2k − k2
of ηtλr(w), i.e., proxηtλr(x)(cid:16), is easy to compu(cid:17)te. For example, when r(w) = kwk1, the proximal mapping
of λr(w) is the following shrinkage operation
prox (x)=sign(x) [x λ] ,
λr ⊙ | |− +
where is the element wise product, which can be computed in time complexity O(d).
⊙
The advantage of proximalstochastic gradient descent is that eachstep only relies on a single derivative
φ (), andthusthe computationalcostis 1/nofthatofthe standardproximalgradientdescent. However,
∇ it ·
a disadvantage of the method is that the randomness introduces variance - this is caused by the fact that
φ (wt) equals the gradient f(wt) in expectation, but φ (wt) varies with i. In particular, if the
∇ it ∇ ∇ i
stochastic gradient has a large variance, then the convergence will become slow.
Now, we would like to study prox-SGD with importance sampling to reduce the variance of stochastic
gradient. The idea of importance sampling is, at the t-th step, to assign each i 1,...,n a probability
pt 0 such that n pt = 1. We then sample i from 1,...,n based on proba∈bi{lity pt =}(pt,...,pt) .
i ≥ i=1 i t { } 1 n ⊤
If we adopt this distribution, then proximal SGD with importance sampling will work as follows:
P
1
wt+1 =argmin (npt ) 1 φ (wt),w +λr(w)+ (w,wt) , (3)
w h it − ∇ it i ηtBψ
h i
which is another unbiased estimation of the optimization problem for proximal gradient descent, because
E[(npt ) 1 φ (wt)wt]= n pt(npt) 1 φ (wt)= f(wt).
it − ∇ it | i=1 i i − ∇ i ∇
Similarly,ifψ(w)= 1 w 2,theproximalSGDwithimportancesamplingwillproducethet+1-thiterate
2kPk2
as:
wt+1 =prox wt η (npt ) 1 φ (wt) .
ηtλr − t it − ∇ it
(cid:0) (cid:1)
5
In addition, setting the derivative of optimization function in equation (3) as zero, we can obtain the
following implicit update rule for the iterative solution:
ψ(wt+1)= ψ(wt) η (npt ) 1 φ (wt) η λ∂r(wt+1),
∇ ∇ − t it − ∇ it − t
where ∂r(wt+1) is a subgradient.
Now the key question that attracted us is which pt can optimally reduce the variance of the stochastic
gradient. Toanswerthis question, wewill firstlyprovealemma, thatcanindicates the relationshipbetween
pt and the convergence rate of prox-SGD with importance sampling.
Lemma 1. Let wt+1 be defined by the update (3). Assume that ψ() is σ-strongly convex with respect to
·
a norm , and that f is µ-strongly convex and (1/γ)-smooth with respect to ψ, if r(w) is convex and
k·k
η (0,γ] then wt+1 satisfies the following inequality for any t 1,
t
∈ ≥
1 η
E[P(wt+1)−P(w∗)]≤ η E[Bψ(w∗,wt)−Bψ(w∗,wt+1)]−µEBψ(w∗,wt)+ σtEV (nptit)−1∇φit(wt) ,
t
(cid:0) (cid:1)
wherethevarianceisdefinedasV((npt ) 1 φ (wt))=E (npt ) 1 φ (wt) f(wt) 2,andtheexpectation
is take with the distribution pt. it − ∇ it k it − ∇ it −∇ k∗
Proof. Tosimplifythenotation,wedenoteg =(npt ) 1 φ (wt). Becausef(w)isµ-stronglyconvexw.r.t.
t it − ∇ it
ψ, and r(w) is convex, we can derive
P(w ) f(wt)+ f(wt),w wt +µ (w ,wt)+λr(wt+1)+λ ∂r(wt+1),w wt+1 .
∗ ∗ ψ ∗ ∗
≥ h∇ − i B h − i
Using the fact f is (1/γ)-smooth w.r.t. ψ, we can further lower bound f(wt) by
f(wt) f(wt+1) f(wt),wt+1 wt (1/γ) (wt+1,wt).
ψ
≥ −h∇ − i− B
Combining the above two inequalities, we have
P(w ) P(wt+1)+ f(wt)+λ∂r(wt+1),w wt+1 +µ (w ,wt) (1/γ) (wt+1,wt).
∗ ∗ ψ ∗ ψ
≥ h∇ − i B − B
Considering the second term on the right-hand side, we have
f(wt)+λ∂r(wt+1),w wt+1 = f(wt)+[ ψ(wt) ψ(wt+1)]/η g ,w wt+1
∗ t t ∗
h∇ − i h∇ ∇ −∇ − − i
1
= ψ(wt) ψ(wt+1),w wt+1 + g f(wt),wt+1 w .
∗ t ∗
η h∇ −∇ − i h −∇ − i
t
Combining the above two inequalities, we get
P(w ) P(wt+1) µ (w ,wt) g f(wt),wt+1 w
∗ ψ ∗ t ∗
− − B −h −∇ − i
f(wt)+λ∂r(wt+1),w wt+1 (1/γ) (wt+1,wt) g f(wt),wt+1 w
∗ ψ t ∗
≥h∇ − i− B −h −∇ − i
1
= ψ(wt) ψ(wt+1),w∗ wt+1 (1/γ) ψ(wt+1,wt).
η h∇ −∇ − i− B
t
Plugging the following equality (Lemma 11.1 from [32])
ψ(w∗,wt+1)+ ψ(wt+1,wt) ψ(w∗,wt)= ψ(wt) ψ(wt+1),w∗ wt+1 ,
B B −B h∇ −∇ − i
into the previous inequality gives
P(w∗) P(wt+1) µ ψ(w∗,wt) gt f(wt),wt+1 w∗
− − B −h −∇ − i
1
(w ,wt+1)+ (wt+1,wt) (w ,wt) (1/γ) (wt+1,wt)
ψ ∗ ψ ψ ∗ ψ
≥ η B B −B − B
t
1 (cid:2) (cid:3)
(w ,wt+1) (w ,wt) ,
ψ ∗ ψ ∗
≥ η B −B
t
(cid:2) (cid:3)
6
where η (0,γ] is used for the final inequality. Re-arranging the above inequality and taking expectation
t
∈
on both sides result in
1
E[P(wt+1) P(w∗)] E[ ψ(w∗,wt) ψ(w∗,wt+1)] µE ψ(w∗,wt) E gt f(wt),wt+1 w∗ .
− ≤ η B −B − B − h −∇ − i
t
To upper bound the lastinner producttermonthe right-handside, wecandefine the proximalfull gradient
update as wt+1 =argmin f(wt),w +λr(w)+ 1 (w,wt) , which is independent with g . Then we
w h∇ i ηtBψ t
can bound E g f(wt)h,wt+1 w as follows i
t ∗
− h −∇ − i
b
E g f(wt),wt+1 w = E g f(wt),wt+1 wt+1+wt+1 w
t ∗ t ∗
− h −∇ − i − h −∇ − − i
= E gt f(wt),wt+1 wt+1 E gt f(wt),wt+1 w∗
− h −∇ − i− h −∇ − i
E g f(wt) wt+1 bwt+1 bE g f(wt),wt+1 w
t t ∗
≤ Ekηt g−∇ f(wkt∗)k2 E−gb kf−(wth),w−t+∇1 w b − i
t t ∗
≤ σk −∇ k∗− h b−∇ − i b
η η
= E t (npt ) 1 φ (wt) f(wt) 2 = tV (npt ) 1 φ (wt) ,
σk it − ∇ it −∇ k∗ σb it − ∇ it
(cid:0) (cid:1)
where, the first inequality is due to Cauchy-Schwartz inequality, the second inequality is due to Lemma 3,
andthelastequalityisbecauseofE[ g f(wt),wt+1 w wt]=0. Finally,pluggingtheaboveinequality
t ∗
h −∇ − i|
into the previous one concludes the proof of this lemma.
b
From the above analysis, we can observe that the smaller the variance, the more reduction on objective
function we have. In the next subsection, we will study how to adopt importance sampling to reduce the
variance. This observation will be made more rigorous below.
4.1.1 Algorithm
AccordingtotheresultintheLemma1,tomaximizethereductionontheobjectivevalue,weshouldchoose
pt as the solution of the following optimization
n
1
min V((npt ) 1 φ (wt)) min (pt) 1 φ (wt) 2. (4)
pt,pt [0,1],Pn pt=1 it − ∇ it ⇔pt,pt [0,1],Pn pt=1n2 i − k∇ i k∗
i∈ i=1 i i∈ i=1 i i=1
X
It is easy to verify, that the solution of the above optimization is
φ (wt)
pti = nk∇ i φ (wk∗t) , ∀i∈{1,2,...,n}. (5)
j=1k∇ j k∗
Although,thisdistributioncanmPinimizethevarianceofthet-thstochasticgradient,itrequirescalculation
ofnderivativesateachstep,whichisclearlyinefficient. Tosolvethisissue,apotentialsolutionistocalculate
the n derivatives at some steps andthen keepit for use for a long time. In addition the true derivatives will
changes every step, it may be better to add a smooth parameter to the sampling distribution. However this
solution still can be inefficient. Another more practical solution is to relax the previous optimization (4) as
follows
n n
1 1
min (pt) 1 φ (wt) 2 min (pt) 1G2 (6)
pt,pt [0,1],Pn pt=1n2 i − k∇ i k∗ ≤pt,pt [0,1],Pn pt=1n2 i − i
i∈ i=1 i i=1 i∈ i=1 i i=1
X X
by introducing
G φ (wt) , t.
i i
≥k∇ k∗ ∀
7
Then, we canapproximatethe distribution inequation(5)by solvingthe the righthand side ofthe inequal-
ity (6) as
G
pt = i , i 1,2,...,n ,
i n G ∀ ∈{ }
j=1 j
which is independent with t. P
Based on the above solution, we will suggest distributions for two kinds of loss functions - Lipschitz
functions and smooth functions. Firstly, if φ (w) is L -Lipschitz w.r.t. , then φ (w) L for any
i i i i
w Rd, and the suggested distribution is k·k∗ k∇ k∗ ≤
∈
L
pt = i , i 1,2,...,n .
i n L ∀ ∈{ }
j=1 j
Secondly, if φ (w) is (1/γ )-smooth andP wt R for any t, then φ (wt) R/γ , then the advised
i i i i
k k ≤ k∇ k∗ ≤
distribution is
1
pt = γi , i 1,2,...,n .
i n 1 ∀ ∈{ }
j=1 γj
P
Finally, we can summarize the proposed ProximalSGD with importance sampling in Algorithm 1.
Algorithm 1 Proximal Stochastic Gradient Descent with Importance Sampling (Iprox-SGD)
Input: λ 0, the learning rates η ,...,η >0.
1 T
≥
Initialize: w1 =0, p1 =(1/n,...,1/n) .
⊤
for t=1,...,T do
Update pt;
Sample i from 1,...,n based on pt;
t
{ }
Update wt+1 =argmin (npt ) 1 φ (wt),w +λr(w)+ 1 (w,wt) ;
w it − ∇ it ηtBψ
end for h(cid:10) (cid:11) i
4.1.2 Analysis
This section provides a convergence analysis of the proposed algorithm. Before presenting the results, we
make some general assumptions:
r(0)=0, and r(w) 0, for allw.
≥
It is easy to see that these two assumptions are generally satisfied by all the well-known regularizers.
Under the above assumptions, we first prove a convergence result for Proximal SGD with importance
sampling using the previous Lemma 1.
Theorem 1. Let wt be generated by the proposed algorithm. Assume that ψ() is σ-strongly convex with
·
respect to a norm , and that f is µ-strongly convex and (1/γ)-smooth with respect to ψ, if r(w) is convex
k·k
and η = 1 with α 1/γ µ, the following inequality holds for any T 1,
t α+µt ≥ − ≥
T T
1 1 V
EP(wt+1) P(w ) α (w ,w1)+E t , (7)
∗ ψ ∗
T − ≤ T " B σ(α+µt)#
t=1 t=1
X X
where the variance is defined as V = V[(npt ) 1 φ (wt)] = E (npt ) 1 φ (wt) f(wt) 2, and the
expectation is take with the distributtion pt. it − ∇ it k it − ∇ it − ∇ k∗
8
Proof. Firstly it is easy to check η (0,γ]. Because the functions ψ, f, r satisfy the the assumptions in
t
∈
Lemma 1, we have
1 η
E[P(wt+1) P(w )] E[ (w ,wt) (w ,wt+1)] µE (w ,wt)+ tEV (npt ) 1 φ (wt) .
− ∗ ≤ η Bψ ∗ −Bψ ∗ − Bψ ∗ σ it − ∇ it
t
(cid:2) (cid:3)
Summing the above inequality over t=1,...,T, and using η =1/(α+µt) we get
t
T T
EP(wt+1) P(w )
∗
−
t=1 t=1
X X
T T T
V
(α+µt)E ψ(w∗,wt) ψ(w∗,wt+1) µ E ψ(w∗,wt)+E t
≤ B −B − B σ(α+µt)
t=1 t=1 t=1
X (cid:2) (cid:3) X X
T T
V V
=α (w ,w1) (α+µT) (w ,wT+1)+E t α (w ,w1)+E t .
ψ ∗ ψ ∗ ψ ∗
B − B σ(α+µt) ≤ B σ(α+µt)
t=1 t=1
X X
Dividing both sides of the above inequality by T concludes the proof.
Corollary 1. Under the same assumptions in the Theorem 1, if we further assume φ (w) is (1/γ )-smooth,
i i
wt R for any t, and the distribution is set as pt = R/γi , then the following inequality holds for any
k k≤ i Pnj=1R/γj
T 1,
≥
1 T 1 ( n R/γ )2 µ
T EP(wt+1)−P(w∗) ≤ T αBψ(w∗,w1)+ i=σ1µn2 i α+µ +ln(α+µT)−ln(α+µ)
t=1 (cid:20) P (cid:18) (cid:19)(cid:21)
X
( n R/γ )2ln(α+µT)
= O i=1 i .
σµn2 T
(cid:18) P (cid:19)
In addition, if µ=0, the above bound is invalid, however if η is set as σ (w ,w1)/(√TPni=1R/γi),
t Bψ ∗ n
we can prove the following inequality for any T 1,
≥ p
1 T (w ,w1) n R/γ 1
EP(wt+1) P(w∗) 2 Bψ ∗ i=1 i .
T t=1 − ≤ r σ P n √T
X
Remark: If ψ(w) = 1 w 2 and r(w) = 0, then (u,v) = 1 u v 2, and the proposed algorithm
2k k2 Bψ 2k − k2
becomes SGD with importance sampling. Under these assumptions, it is achievable to get rid of the lnT
factor in the convergence bound, when the objective function is strongly convex. However, we will not
provide the details for concision. For more general Bregman divergence, it is difficult to remove this lnT
factor,becausemanypropertiesof 1 u v 2arenotsatisfiedbytheBregmandivergence,suchassymmetry.
2k − k2
In addition, it is easy to derive high probability bound using existing work, such as the Theorem 8 in
the [14]. In this theorem, the high probability bound depends on the variance of the stochastic gradient, so
our sampling strategy can improve this bound, since we are minimizing the variance. However, we do not
explicitly provide the resulting bounds, because the consequence is relatively straightforward.
Proof. Firstly, the fact φ (w) is (1/γ )-smooth, and wt R for any t implies φ (wt) R/γ . Using
i i i i
this result and the distribution pt = R/γi , we cakn gket≤ k∇ k∗ ≤
i Pnj=1R/γj
1 n 1 n R/γ 2
V =E (npt ) 1 φ (wt) f(wt) 2 E (npt ) 1 φ (wt) 2 = φ (wt) 2 i=1 i ,
t k it − ∇ it −∇ k∗ ≤ k it − ∇ it k∗ n2 i=1 pik∇ i k∗ ≤(cid:18)P n (cid:19)
X
9
where the first inequality is due to E z Ez 2 =E z 2 Ez 2. Using the above inequality gives
k − k k k −k k
T V n R/γ 2 T 1
t i=1 i
σ(α+µt) ≤ n σ(α+µt)
t=1 (cid:18)P (cid:19) t=1
X X
n R/γ 2 1 1 T 1
i=1 i +
≤ (cid:18)P n (cid:19) σ "α+µ Zt=1 α+µt#
n R/γ 2 1 µ
i=1 i +ln(α+µT) ln(α+µ)
≤ n σµ α+µ −
(cid:18)P (cid:19) (cid:20) (cid:21)
Plugging the above inequality into the inequality (7) concludes the proof of the first part.
Toprovethesecondpart,wecanplugtheboundonV andtheequalityη = σ (w ,w1)/(√TPni=1R/γi)
t t Bψ ∗ n
into the inequality (7).
p
Remark. Ifthe uniformdistributionisadopted, itis easytoobservethatV isbounded by Pni=1(R/γi)2,
t n
andtheTheorem1willresultsin 1 T EP(wt+1) P(w ) O Pni=1(R/γi)2ln(α+µT) forstronglyconvex
T t=1 − ∗ ≤ σµn T
f, and T1 Tt=1EP(wt+1)−P(w∗)≤P2 Bψ(wσ∗,w1)Pni=1(nR/γi)2√1T(cid:16)for generalconvexf.(cid:17)However,according
to Cauchy-Schwarz inequality, q
P
n (R/γ )2 n R/γ 2 n n (R/γ )2
i=1 i / i=1 i = i=1 i 1,
n n ( n R/γ )2 ≥
P (cid:18)P (cid:19) Pi=1 i
implies importance sampling does improve the convergence rPate, especially when (Pni=1R/γi)2 n.
Pni=1(R/γi)2 ≪
Theorem 2. Let wt be generated by the proposed algorithm. Assume that ψ() is σ-strongly convex with
·
respect to a norm , f is convex, and r(w) is 1-stronglyconvex, if η is set as 1/(λt)for all t, the following
t
k·k
inequality holds for any T 1,
≥
1 T 1 1 T 1 φ (wt)
T t=1EP(wt)−P(w∗)≤ T "λBψ(w∗,w1)+ λσ t=1 tEk∇ nitptit k2∗#, (8)
X X
where the expectation is take with the distribution pt.
Proof. Thefactr(w)is1-stronglyconveximpliesthatλr(w)isλ-stronglyconvex. Then,alltheassumptions
in the Corollary 6 of [14] are satisfied, so we have the following inequality,
T T
1 1 1 1 1
[ φ (wt)+λr(wt+1) φ (w ) λr(w )] λ (w ,w1)+ φ (wt) 2
t=1 nptit it − nptit it ∗ − ∗ ≤ Bψ ∗ λσ t=1 tknptit∇ it k∗
X X
which is actually the same with the last display in the page 5 of [14]. Taking expectation on both sides of
the above inequality and using r(w1)=0 concludes the proof.
We will use the above Theorem to derive two logarithmic convergence bounds.
Corollary 2. Under the same assumptions in the Theorem 2, if we further assume φ (w) is (1/γ )-smooth,
i i
wt R for any t, and the distribution is set as pt = R/γi , then the following inequality holds for any
k k≤ i Pnj=1R/γj
T 1,
≥
1 T 1 ( n R/γ )2 ( n R/γ )2lnT
EP(wt) P(w ) λ (w ,w1)+ i=1 i (lnT +1) =O i=1 i .
T t=1 − ∗ ≤ T " Bψ ∗ P λσn2 # P λσn2 T !
X
10