Table Of ContentJMLR:WorkshopandConferenceProceedingsvol40:1–26,2015
Escaping the Local Minima via Simulated Annealing: Optimization of
Approximately Convex Functions
AlexandreBelloni
TheFuquaSchoolofBusiness,DukeUniversity
TengyuanLiang
DepartmentofStatistics,TheWhartonSchool,UniversityofPennsylvania
HariharanNarayanan
DepartmentofStatisticsandDepartmentofMathematics,UniversityofWashington
AlexanderRakhlin
DepartmentofStatistics,TheWhartonSchool,UniversityofPennsylvania
Abstract
Weconsidertheproblemofoptimizinganapproximatelyconvexfunctionoveraboundedconvex
set in Rn using only function evaluations. The problem is reduced to sampling from an approxi-
mately log-concave distribution using the Hit-and-Run method, which is shown to have the same
O∗ complexityassamplingfromlog-concavedistributions. Inadditiontoextendtheanalysisfor
log-concave distributions to approximate log-concave distributions, the implementation of the 1-
dimensionalsampleroftheHit-and-Runwalkrequiresnewmethodsandanalysis. Thealgorithm
thenisbasedonsimulatedannealingwhichdoesnotreliesonfirstorderconditionswhichmakesit
essentiallyimmunetolocalminima.
We then apply the method to different motivating problems. In the context of zeroth order
stochasticconvexoptimization, theproposedmethodproducesan(cid:15)-minimizerafterO∗(n7.5(cid:15)−2)
noisyfunctionevaluationsbyinducingaO((cid:15)/n)-approximatelylogconcavedistribution. Wealso
consider in detail the case when the “amount of non-convexity” decays towards the optimum of
the function. Other applications of the method discussed in this work include private computa-
tion of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic
programmingforonlinelearning.
1. IntroductionandProblemFormulation
LetK ⊂ Rn beaconvexset, andletF : Rn → RbeanapproximatelyconvexfunctionoverK in
thesensethat
sup|F(x)−f(x)| ≤ (cid:15)/n (1)
x∈K
forsomeconvexfunctionf : Rn → Rand(cid:15) > 0. Inparticular,F maybediscontinuous. Weseek
tofindx ∈ Ksuchthat
F(x)−minF(y) ≤ (cid:15) (2)
y∈K
using only function evaluations of F. This paper presents a randomized method based on simu-
latedannealingthatsatisfies(2)inexpectation(orwithhighprobability). Moreover,thenumberof
(cid:13)c 2015A.Belloni,T.Liang,H.Narayanan&A.Rakhlin.
BELLONILIANGNARAYANANRAKHLIN
required function evaluations of F is at most O∗(n4.5) (see Corollary 18), where O∗ hides poly-
logarithmic factors in n and (cid:15)−1. Our method requires only a membership oracle for the set K. In
Section 7, we consider the case when the amount of non-convexity in (1) can be much larger than
(cid:15)/nforpointsawayfromtheoptimum.
In the oracle model of computation, access to function values at queried points is referred to
as the zeroth-order information. Exact function evaluation of F may be equivalently viewed as
approximatefunctionevaluationoftheconvexfunctionf,withtheoraclereturningavalue
F(x) ∈ [f(x)−(cid:15)/n,f(x)+(cid:15)/n]. (3)
A closely related problem is that of convex optimization with a stochastic zeroth order oracle.
Here, the oracle returns a noisy function value f(x)+η. If η is zero-mean and subgaussian, the
functionvaluescanbeaveragedtoemulate,withhighprobability,theapproximateoracle(3). The
randomizedmethodweproposehasanO∗(n7.5(cid:15)−2)oraclecomplexityforconvexstochasticzeroth
orderoptimization,which,tothebestofourknowledge,isthebestthatisknownforthisproblem.
WerefertoSection6formoredetails.
The motivation for studying zeroth-order optimization is plentiful, and we refer the reader to
Conn et al. (2009) for a discussion of problems where derivative-free methods are essential. In
Section 8 we sketch three areas where the algorithm of this paper can be readily applied: private
computation with distributed data, two-stage stochastic programming, and online learning algo-
rithms.
2. PriorWork
ThepresentpaperrestsfirmlyonthelongstringofworkbyKannan, Lova´sz, Vempala, andothers
(Lova´sz and Simonovits, 1993; Kannan et al., 1997; Kalai and Vempala, 2006; Lova´sz and Vem-
pala,2006a,b,2007). Inparticular,weinvokethekeylowerboundonconductanceofHit-and-Run
from Lova´sz and Vempala (2006a) and use the simulated annealing technique of Kalai and Vem-
pala (2006). Our analysis extends Hit-and-Run to approximately log-concave distributions which
requirednewtheoreticalresultsandimplementationadjustments. Inparticular,weproposeaunidi-
mensional sampling scheme that mixes fast to a truncated approximately log-concave distribution
ontheline.
Samplingfromβ-log-concavedistributionswasalreadystudiedintheearlyworkofApplegate
and Kannan (1991) with a discrete random walk based on a discretization of the space. In the
case of non-smooth densities and unrestricted support, sampling from approximate log-concave
distributions has also been studied in Belloni and Chernozhukov (2009) where the hidden convex
function f is quadratic. This additional structure was motivated by the central limit theorem in
statistical applications and leads to faster mixing rates. Both works used ball walk-like strategies.
NeitherworkconsideredrandomwalksthatallowforlongstepslikeHit-and-Run.
The present work was motivated by the question of information-based complexity of zeroth-
orderstochasticoptimization. ThepaperofAgarwaletal.(2013)studiesasomewhatharderprob-
lem of regret minimization with zeroth-order feedback. Their method is based on the pyramid
construction of Nemirovskii and Yudin (1983) and requires O(n33(cid:15)−2) noisy function evaluations
toachievearegret(and,hence,anoptimizationguarantee)of(cid:15). ThemethodofLiangetal.(2014)
improved the dependence on the dimension to O∗(n14) using a Ball Walk on the epigraph of the
function in the spirit of Bertsimas and Vempala (2004). The present paper further reduces this
2
OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS
dependence to O∗(n7.5) and still achieves the optimal (cid:15)−2 dependence on the accuracy. The best
knownlowerboundfortheproblemisΩ(n2(cid:15)−2)(seeShamir(2012)).
OtherrelevantworkincludestherecentpaperofDyeretal.(2013)wheretheauthorsproposed
asimplerandomwalkmethodthatrequiresonlyapproximatefunctionevaluations. Astheauthors
mention, their algorithm only works for smooth functions and sets K with smooth boundaries —
assumptions that we would like to avoid. Furthermore, the effective dependence of Dyer et al.
(2013)onaccuracyisworsethan(cid:15)−2.
3. Preliminaries
Throughout the paper, the functions F and f satisfy (1) and f is convex. The Lipschitz constant
of f with respect to (cid:96) norm will be denoted by L, defined as the smallest number such that
∞
|f(x)−f(y)| ≤ L(cid:107)x−y(cid:107) forx,y ∈ K. AssumetheconvexbodyK ⊆ Rn tobewell-roundedin
∞ √
thesensethatthereexistr,R > 0suchthatBn(r) ⊆ K ⊆ Bn(R)andR/r ≤ O( n).1 Foranon-
2 2
negative function g, denote by π the normalized probability measure induced by g and supported
g
onK.
Definition1 Afunctionh : K → R islog-concaveif
+
h(αx+(1−α)y) ≥ h(x)αh(y)1−α
forallx,y ∈ Kandα ∈ [0,1]. Afunctioniscalledβ-log-concaveforsomeβ ≥ 0if
h(αx+(1−α)y) ≥ e−βh(x)αh(y)1−α
forallx,y ∈ Kandα ∈ [0,1].
Definition2 A function g : K → R is ξ-approximately log-concave if there is a log-concave
+
functionh : K → R suchthat
+
sup|logh(x)−logg(x)| ≤ ξ.
x∈K
Lemma3 Ifthefunctiong isβ/2-approximatelylog-concave,theng isβ-log-concave.
Forone-dimensionalfunctions,theabovelemmacanbereversed:
Lemma4(BelloniandChernozhukov(2009),Lemma9) Ifgisaunidimensionalβ-log-concave
function,thenthereexistsalog-concavefunctionhsuchthat
e−βh(x) ≤ g(x) ≤ h(x) forallx ∈ R.
Remark5(GapBetweenβ-Log-ConcaveFunctionsandξ-ApproximateLog-ConcaveFunctions)
AconsequenceofLemma4isthatβ-log-concavefunctionsareequivalenttoβ-approximatelylog-
concavefunctionswhenthedomainisunidimensional. However,suchequivalencenolongerholds
in higher dimensions. In the case the domain is Rn, Green et al. (1952); Cholewa (1984) estab-
lishedthatβ-log-concavefunctionsare β log (2n)-approximatelylog-concave. Laczkovich(1999)
2 2
showedthattherearefunctionssuchthatthefactorthatrelatestheseapproximationscannotbeless
than 1 log (n/2).
4 2
1.ThisconditioncanberelaxedbyapplyingapencilconstructionasinLova´szandVempala(2007).
3
BELLONILIANGNARAYANANRAKHLIN
4. SamplingfromApproximateLog-ConcaveDistributionsviaHit-and-Run
InthissectionweanalyzetheHit-and-Runproceduretosimulaterandomvariablesfromadistribu-
tion induced by an approximate log-concave function. The Hit-and-Run algorithm is as follows in
Algorithm1.
Algorithm1Hit-and-Run
Input: a target distribution π on K induced by a nonnegative function g; x ∈ dom(g); linear
g
transformationΣ;numberofstepsm
Output: apointx(cid:48) ∈ dom(g)generatedbyone-stepHit-and-Runwalk
initialization: astartingpointx ∈ dom(g)
fori = 1,...,mdo
1. Choose a random line (cid:96) that passes through x. The direction is uniform from the surface of
ellipsegivenbyΣactingonsphere
2. On the line (cid:96) run the unidimensional rejection sampler with π restricted to the line (and
g
supportedonK)toproposeasuccessfulnextstepx(cid:48)
end
Inordertohandleapproximatelog-concavefunctionsweneedtoaddressimplementationissues
and address the theoretical difficulties caused by deviations from log-concavity which can include
discontinuities. Themainimplementationdifferenceliesistheunidimensionalsampler. Nolonger
abinarysearchyieldsthemaximumoverthelineanditsendpointssinceβ-log-concavefunctions
canbediscontinuousandmultimodal. Wenowturntothesequestions.
4.1. Unidimensionalsamplingscheme
As a building block of the randomized method for solving the optimization problem (2), we intro-
duceaone-dimensionalsamplingprocedure. Letg beaunidimensionalβ-log-concavefunctionon
aboundedlinesegment(cid:96),andletπ betheinducednormalizedmeasure. Thefollowingguarantee
g
willbeprovedinthissection.
Lemma6 Letg beaβ-log-concavefunctionandlet(cid:96)beaboundedlinesegment(cid:96)onK. Givena
targetaccuracy(cid:15)˜∈ (0,e−2β/2),Algorithm2producesapointX ∈ (cid:96)withadistributionπˆ such
g,(cid:96)
that
d (π ,πˆ ) ≤ 3e2β(cid:15)˜.
tv g,(cid:96) g,(cid:96)
Moreover, the method requires O∗(1) evaluations of the unidimensional β-log-concave function g
ifβ isO(1).
The proposed method for sampling from the β-log-concave function g is a rejection sampler
thatrequirestwoinitializationsteps. Wefirstshowhowtoimplementstep(a).
Fortheβ-log-concavefunctiong,lethbealog-concavefunctioninLemma4andletL˜ denote
theLipschitzconstantoftheconvexfunction −logh. Inthefollowingtworesults,theO∗ notation
hidesalog(L˜)factor.
Lemma7(InitializationStep(a)) Algorithm3findsapointp ∈ (cid:96)thatsatisfiesg(p) ≥ e−3βmax g(z).
z∈(cid:96)
Moreover,thissteprequiresO∗(1)functionevaluations.
4
OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS
Algorithm2Unidimensionalrejectionsampler
Input: 1-dimβ-log-concavefunctiong definedonaboundedsegment(cid:96) = [x,x¯];accuracy(cid:15)˜> 0
Output: Asamplexwithdistributionπˆ closetoπ
g,(cid:96) g,(cid:96)
Initialization: (a)computeapointp ∈ (cid:96)s.t. g(p) ≥ e−3βmax g(z) (b)giventargetaccuracy(cid:15)˜,
z∈(cid:96)
findtwopointse ,e ontwosidesofps.t.
−1 1
e = x if g(x) ≥ 1e−β(cid:15)˜g(p), 1e−β(cid:15)˜g(p) ≤ g(e ) ≤ (cid:15)˜g(p) otherwise
−1 2 2 −1
(4)
e = x¯ if g(x¯) ≥ 1e−β(cid:15)˜g(p), 1e−β(cid:15)˜g(p) ≤ g(e ) ≤ (cid:15)˜g(p) otherwise;
1 2 2 1
whilesamplerejected do
pick x ∼ unif([e ,e ]) and pick r ∼ unif([0,1]) independently. If r ≤ g(x)/{g(p)e3β} then
−1 1
acceptxandstop. Otherwise,rejectx
end
Algorithm3InitializationStep(a)
Input: unidimensionalβ-log-concavefunctiong definedonaboundedinterval(cid:96) = [x,x¯]
Output: apointp ∈ (cid:96)s.t. g(p) ≥ e−3βmax g(z)
z∈(cid:96)
whiledidnotstopdo
setx = 3x+ 1x¯,x = 1x+ 1x¯andx = 1x+ 3x¯
l 4 4 c 2 2 r 4 4
If|logg(x )−logg(x )| > β,set[x,x¯]aseither[x ,x¯]or[x,x ]accordingly
l r l r
elseif|logg(x )−logg(x )| > β,set[x,x¯]aseither[x ,x¯]or[x,x ]accordingly
l c l c
elseif|logg(x )−logg(x )| > β,set[x,x¯]aseither[x,x ]or[x ,x¯]accordingly
r c r c
elseoutputp = arg max f(x)andstop
x∈{xl,xc,xr}
end
Lemma8(InitializationStep(b)) Let (cid:96) = [x,x¯] and p ∈ (cid:96). The binary search algorithm finds
e ∈ [x,p] and e ∈ [p,x¯] such that (4) holds. Moreover, this step requires O∗(1) function
−1 1
evaluations.
According to Lemmas 6, 7, 8, the unidimensional sampling method produces a sample from a
distributionthatisclosetothedesiredβ-log-concavedistribution. Furthermore,themethodrequires
anumberofqueriesthatislogarithmicinalltheparameters.
4.2. Mixingtime
Inthissection,wewillanalyzemixingtimeoftheHit-and-Runalgorithmwithaβ/2-approximate
log-concavefunctiong,namelythereexistslog-concavehsuchthatsup |logg−logh| ≤ β/2.In
K
particular, this implies that g is β-log-concave, according to Lemma 3. In this section, we provide
theanalysisofHit-and-RunwiththelineartransformationΣ = I andremarkthattheresultsextend
tootherlineartransformationsemployedtoroundthelog-concavedistributions.
The mixing time of a geometric random walk can be bounded through the spectral gap of the
inducedMarkovchain. Inturn,thespectralgaprelatestothesocalledconductancewhichhasbeen
akeyquantityintheliterature. ConsiderthetransitionprobabilityofHit-and-Runwithadensityg,
5
BELLONILIANGNARAYANANRAKHLIN
namely
(cid:90)
2 g(x)dx
Pg(A) =
u nπ µ (u,x)|x−u|n−1
n A g
(cid:82) g(x)
where µ (u,x) = g(y)dy. Let π (x) = be the probability measure induced
g (cid:96)(u,x)∩K g (cid:82) g(y)dy
y∈K
bythefunctiong. TheconductanceforasetS ⊂ Kwith0 < π (S) < 1isdefinedas
g
(cid:82) Pg(K\S)dπ
φg(S) = x∈S x g ,
min{π (S),π (K\S)}
g g
andφg istheminimumconductanceoverallmeasurablesets. Thes-conductanceis,inturn,defined
as
(cid:82) Pg(K\S)dπ
φg = inf x∈S x g.
s S⊂K,s<πg(S)≤1/2 πg(S)−s
Bydefinitionwehaveφg ≤ φg foralls > 0.
s
The following theorem provides us an upper bound on the mixing time of the Markov chain
basedonconductance. Letσ(0) betheinitialdistributionandσ(m) thedistributionofthem-thstep
of the random walk of Hit-and-Run with exact sampling from the distribution π restricted to the
g
line.
Theorem9(Lova´szandSimonovits(1993);Lova´szandVempala(2007),Lemma9.1) Let0 <
s ≤ 1/2andletg : K → R bearbitrary. Thenforeverym ≥ 0,
+
(cid:18) (φg)2(cid:19)m H (cid:18) (φg)2(cid:19)m
d (π ,σ(m)) ≤ H 1− and d (π ,σ(m)) ≤ H + s 1− s
tv g 0 tv g s
2 s 2
whereH = sup π (x)/σ0(x) and H = sup{|π (A)−σ(0)(A)| : π (A) ≤ s}.
0 x∈K g s g g
Building on Lova´sz and Vempala (2006a), we prove the following result that provides us with
a lower bound on the conductance for Hit-and-Run induced by a log-concave h. The proof of the
result below follows the proof of Theorem 3.7 in Lova´sz and Vempala (2006a) with modifications
toallowunboundedsetsKwithouttruncatingtherandomwalk.
Theorem10(ConductanceLowerBoundforLog-concaveMeasureswithUnboundedSupport)
Lethbealog-concavefunctioninRn suchthatthelevelsetofmeasure 1 containsaballofradius
8
r. Define R = (E (cid:107)X − z (cid:107)2)1/2, where z = E X and X is sampled from the log concave
h h h h
measureinducebyh. ThenforanysubsetS,withπ (S) = p ≤ 1,theconductanceofHit-and-Run
h 2
satisfies
1
φh(S) ≥
C nR log2 nR
1 r rp
whereC > 0isauniversalconstants.
1
Although Theorem 10 is new, very similar conductance bounds allowing for unbounded sets
were establish before. Indeed in Section 3.3 of Lova´sz and Vempala (2006a) the authors discuss
the case of unbounded K and propose to truncate the set to its effective diameter and use the fact
thatthisdistributionwouldbeclosetothedistributionoftheunrestrictedset. Suchtruncationneeds
6
OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS
to be enforces which requires to change the implementation of the algorithm and lead to another
(small) layer of approximation errors. Theorem 10 avoids this explicit truncation and truncation is
done implicitly in the proof only. We note that when applying the simulated annealing technique,
even if we start with a bounded set, by diminishing the temperature, we are effectively stretching
thesetswhichwouldessentiallyrequiretohandleunboundedsets.
WenowarguethatconductanceofHit-and-Runwithβ-approximatelog-concavemeasurescan
berelatedtotheconductancewithlog-concavemeasures.
Theorem11 Letgbeaβ/2-approximatelog-concavemeasureandhbeanylog-concavefunction
with the property in Definition 2. Then the conductance and s-conductance of the random walk
inducedbyg arelowerboundedas
φg ≥ e−3βφh and φg ≥ e−3βφh .
s s/eβ
WeapplyTheorem9toshowcontractionofσ(m) toπ intermsofthetotalvariationdistance.
g
Theorem12(MixingTimeforApproximately-log-concaveMeasure) Let π is the stationary
g
measureassociatedwiththeHit-and-Runwalkbasedonaβ/2-approximatelylog-concavefunction
g,andM = (cid:107)σ(0)/π (cid:107) = (cid:82)(dσ(0)/dπ )dσ(0). ThereisauniversalconstantC < ∞suchthatfor
g g
anyγ ∈ (0,1/2),if
e6βR2 eβMnR M
m ≥ Cn2 log4 log
r2 rγ2 γ
thenmstepsoftheHit-and-Runrandomwalkbasedong yieldd (π ,σ(m)) ≤ γ.
tv g
Remark13 The value M in Theorem 12 bounds the impact of the initial distribution σ(0) which
can be potentially far from the stationary distribution. In the Simulated Annealing application of
nextsection,wewillshowinLemma15thatwecan“warmstart”thechainbycarefullypickingan
initialdistributionsuchthatM = O(1).
Theorem12showsγ-closenessbetweenthedistributionσ(m) andthecorrespondingstationary
distribution. However,thestationarydistributionisnotexactlygsincetheunidimensionalsampling
proceduredescribedearliertruncatesthedistributiontoimprovemixingtime. Thefollowingtheo-
remshowsthattheseconcernsareovertakenbythegeometricmixingoftherandomwalk. Letπˆ
g,(cid:96)
denote the distribution of the unidimensional sampling scheme (Algorithm 2) along the line (cid:96) and
π denotethedistributionoftheunidimensionalsamplingschemeproportionaltog alongline(cid:96).
g,(cid:96)
Theorem14 Letσˆ(m)denotethedistributionoftheHit-and-Runwiththeunidimensionalsampling
scheme(Algorithm2)aftermsteps. Forany0 < s < 1/2,thealgorithmmaintainsthat
(cid:26) H (cid:18) (φg)2(cid:19)m(cid:27)
d (σˆ(m),π ) ≤ 2d (σˆ(0),σ(0))+msupd (πˆ ,π )+ H + s 1− s
tv g tv tv g,(cid:96) g,(cid:96) s
s 2
(cid:96)⊂K
wherethesupremumistakenoveralllines(cid:96)inK. Inparticular,foratargetaccuracyγ ∈ (0,1/e),
if d (σˆ(0),σ(0)) ≤ γ/8, s such that H ≤ γ/4, m ≥ {2/(φg)2}log({H /s}{4/γ}), and the
tv s s s
precisionoftheunidimensionalsamplingschemetobe(cid:15)˜= γe−2β/{12m},wehave
d (σˆ(m),π ) ≤ γ.
tv g
7
BELLONILIANGNARAYANANRAKHLIN
5. OptimizationviaSimulatedAnnealing
Wenowturntothemaingoalofthepaper: toexhibitamethodthatproducesan(cid:15)-minimizerofthe
nearlyconvexfunctionF inexpectation. Fixthepairf,F withtheproperty(1),anddefineaseries
offunctions
h (x) = exp(−f/T ), g (x) = exp(−F/T )
i i i i
for a chain of temperatures {T ,i = 1,...,K} to be specified later. It is immediate that h ’s are
i i
log-concave. Lemma3,inturn,impliesthatg ’sare 2(cid:15) -log-concave.
i nTi
We now introduce the simulated annealing method that proceeds in epochs and employs the
Hit-and-Run procedure with the unidimensional sampler introduced in the previous section. The
overall simulated annealing procedure is identical to the algorithm of Kalai and Vempala (2006),
withdifferencesintheanalysisarisingfromF beingonlyapproximatelyconvex.
Algorithm4Simulatedannealing
Input: Aseriesoftemperatures{T ,1 ≤ i ≤ K},K=numberofepochs,x ∈ intK
i
Output: acandidatepointxforwhichF(x) ≤ min F(y)+(cid:15)holds
y∈K
initialization: well-rounded convex body K and {Xj,1 ≤ j ≤ N} i.i.d. samples from uniform
0
measureonK,N-numberofstrands,setK = K,andΣ = I
0 0
whilei-thepoch,1 ≤ i ≤ K do
1. calculate the i-th rounding linear transformation T based on {Xj ,1 ≤ j ≤ N} and let
i i−1
Σ = T ◦Σ
i i i−1
2. draw N i.i.d. samples {Xj,1 ≤ j ≤ N} from measure π using Hit-and-Run with linear
i gi
transformationΣ andwithN warm-startingpoints{Xj ,1 ≤ j ≤ N}
i i−1
end
output x = argmin F(Xj).
i
1≤j≤N,1≤i≤K
Beforestatingtheoptimizationguaranteeoftheabovesimulatedannealingprocedure,weprove
thatthewarm-startpropertyofthedistributionsbetweensuccessiveepochsandtheroundingguar-
anteegivenbyN samples.
5.1. Warmstartandmixing
Weneedtoprovethatthemeasuresbetweensuccessivetemperaturesarenottoofarawayinthe(cid:96)
2
sense,sothatthesamplesfromthepreviousepochcanbetreatedasawarmstartforthenextepoch.
ThefollowingresultisanextensionofLemma6.1in(KalaiandVempala,2006)toβ-log-concave
functions.
Lemma15 Let g(x) = exp(−F(x)) be a β-log-concave function. Let µ be a distribution with
i
(cid:16) (cid:17)
densityproportionaltoexp{−F(x)/Ti},supportedonK. LetTi = Ti−1 1− √1n . Then
(cid:107)µ /µ (cid:107) ≤ C = 5exp(2β/T )
i i+1 γ i
Nextweaccountfortheimpactofusingthefinaldistributionfromthepreviousepochσ(0) asa
“warm-start.”
8
OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS
Theorem16 Fix a target accuracy γ ∈ (0,1/e) and let g be an β/2-approximately log-concave
√
functioninRn. Supposethesimulatedannealingalgorithm(Algorithm4)isrunforK = nlog(1/ρ)
√
epochs with temperature parameters T = (1−1/ n)i,0 ≤ i ≤ K. If the Hit-and-Run with the
i
unidimensionalsamplingscheme(Algorithm2)isrunform = O∗(n3)numberofstepsprescribed
inTheorem12,thealgorithmmaintainsthat
(m)
d (σˆ ,π ) ≤ eγ (5)
tv i gi
(m)
ateveryepochi, whereσˆ isthedistributionofthem-thstepofHit-and-Run. Here, mdepends
i
polylogarithmicallyonρ−1.
5.2. Optimizationguarantee
WeproveanextensionofLemma4.1in(KalaiandVempala,2006):
Theorem17 Letf beaconvexfunction. LetX bechosenaccordingtoadistributionwithdensity
proportionaltoexp{−f(x)/T}. Then
E f(X)−minf(x) ≤ (n+1)T
f
x∈K
Furthermore, if F is such that |F − f| ≤ ρ, for X chosen from a distribution with density
∞
proportionaltoexp{−F(x)/T},wehave
E f(X)−minf(x) ≤ (n+1)T ·exp(2ρ/T)
F
x∈K
TheaboveTheoremimpliesthatthefinaltemperatureT inthesimulatedannealingprocedure
K √
needs to be set as T = (cid:15)/n. This, in turn, leads to K = nlog(n/(cid:15)) epochs. The oracle
K
complexityofoptimizingF isthen,informally,
√
O∗(n3)queriespersample × O∗(n)parallelstrands × O∗( n)epochs = O∗(n4.5)
Thefollowingcorollarysummarizesthecomputationalcomplexityresult:
Corollary18 Suppose F is approximately convex and |F − f| ≤ (cid:15)/n as in (1). The simulated
√
annealingmethodwithK = nlog(n/(cid:15))epochsproducesarandompointsuchthat
Ef(X)−minf(x) ≤ (cid:15), EF(X)−minF(x) ≤ 2(cid:15).
x∈K x∈K
FurthermorethenumberoforaclequeriesrequiredbythemethodisO∗(n4.5).
6. StochasticConvexZerothOrderOptimization
Letf : K → RbetheunknownconvexL-Lipschitzfuncitonweaimtominimize. Withinthemodel
ofconvexoptimizationwithstochasticzeroth-orderoracleO,theinformationreturneduponaquery
x ∈ Kisf(x)+(cid:15) where(cid:15) isthezeromeannoise. Weshallassumethatthenoiseissub-Gaussian
x x
with parameter σ. That is, Eexp(λ(cid:15) ) ≤ exp(σ2λ2/2). It is easy to see from Chernoff’s bound
x
that for any t ≥ 0 P(|(cid:15) | ≥ σt) ≤ 2exp(−t2/2). We can decrease the noise level by repeatedly
x
queryingatx. Fixτ > 0, tobedeterminedlater. Theaverage(cid:15)¯ ofτ observationsisconcentrated
x
9
BELLONILIANGNARAYANANRAKHLIN
√
as P(|(cid:15)¯ | ≥ σt/ τ) ≤ 2exp(−t2/2). To use the randomized optimization method developed in
x
this paper, we view f(x)+(cid:15)¯ as the value of F(x) returned upon a single query at x. Since the
x
randomizedmethoddoesnotre-visitxwithprobability1,thefunctionF is“well-defined”.
Let us make the above discussion more precise by describing three oracles. Oracle O(cid:48) draws
noise (cid:15) for each x ∈ K prior to optimization. Upon querying x ∈ K, the oracle deterministically
x
returnsf(x)+(cid:15) ,evenifthesamepointisqueriedtwice. Giventhattheoptimizationmethoddoes
x
notquerythesamepoint(withprobabilityone), thisoracleisequivalenttoanobliviousversionof
oracleOoftheoriginalzerothorderstochasticoptimizationproblem.
TodefineO ,letN beanα-netin(cid:96) whichcanbetakenasaboxgridofK. IfK ⊆ RB ,
α α ∞ ∞
thesizeofthenetisatmost(R/α)n. Theoracledraws(cid:15) foreachelementx ∈ N ,independently.
x (cid:15)
Upon a query x(cid:48) ∈ K, the oracle deterministically returns f(x)+(cid:15) for x ∈ N which is closest
x α
tox(cid:48). NotethatO isnomorepowerfulthanO(cid:48),sincethelearneronlyobtainstheinformationon
α
theα-net. OracleOτ isasmallmodificationofO . Thismodificationmodelsarepeatedqueryat
α α
the same point, as described earlier. Parametrized by τ (the number of queries at the same point),
√
oracleOτ drawsrandomvariables(cid:15) foreachx ∈ N ,butsub-Gaussianparameterof(cid:15) isσ/ τ.
α x α x
Theoptimizationalgorithmpaysforτ oraclecallsuponasinglecalltoOτ.
α
We argued that Oτ is no more powerful than the original zeroth order oracle given that the
α
algorithm does not revisit the point. In the rest of the section, we will work with Oτ as the oracle
α
model. Foranyx,denotetheprojectiontotheN tobeP (x). DefineF : K (cid:55)→ Ras
α Nα
F(x) = f(P (x))+(cid:15)
Nα PNα(x)
whereP (x)istheclosesttoxpointofN inthe(cid:96) sense. Clearly,|F−f| ≤ max |(cid:15) |+
Nα α ∞ ∞ x∈Nα x
αL where L is the ((cid:96) ) Lipschitz constant. Since ((cid:15) ) define a finite collection of sub-
∞ x x∈Nα
Gaussian random variables with sub-Gaussian parameter σ, we have that with probability at least
1−δ
(cid:113)
2nlog(R/α)+2log(1/δ)
max |(cid:15) | ≤ σ
x∈Nα x τ
Fromnowon,weconditiononthisevent,whichwecallE. Toguarantee(1),wesettheaboveupper
boundequalto (cid:15) = αLwhereτ istheparameterfromoracleOτ. Solvingforτ andα:
2n α
σ2n2(8nlog(R/α)+8log(1/δ)) σ2n2(8nlog(2LRn/(cid:15))+8log(1/δ))
τ = = = O∗(n3/(cid:15)2)
(cid:15)2 (cid:15)2
and α = (cid:15)/(2Ln). Note here L affects τ only logarithmically, and, in particular, we could have
definedtheLipschitzconstantwithrespectto(cid:96) . Wealsoobservethattheoraclemodeldependson
2
αand,hence,onthetargetaccuracy(cid:15). However,becausethedependenceonαisonlylogarithmic,
wecantakeαtobemuchsmallerthan(cid:15).
TogetherwiththeO∗(n4.5)oraclecomplexityprovedintheprevioussectionforoptimizingF,
thechoiceofτ = O∗(n3(cid:15)−2)evaluationspertimestepyieldsatotaloraclecomplexityof
O∗(n7.5(cid:15)−2)
for the problem of stochastic convex optimization with zeroth order information. We observe that
a factor of n2 in oracle complexity comes from the union bound over the exponential-sized dis-
cretizationoftheset. This(somewhatartificial)factorcanbereducedorremovedunderadditional
assumptionsonthenoise,suchasadrawfromaGaussianprocesswithspatialdependenceoverK.
10
Description:then is based on simulated annealing which does not relies on first order from Lovász and Vempala (2006a) and use the simulated annealing