ebook img

Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex ... PDF

26 Pages·2015·0.36 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex ...

JMLR:WorkshopandConferenceProceedingsvol40:1–26,2015 Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions AlexandreBelloni TheFuquaSchoolofBusiness,DukeUniversity TengyuanLiang DepartmentofStatistics,TheWhartonSchool,UniversityofPennsylvania HariharanNarayanan DepartmentofStatisticsandDepartmentofMathematics,UniversityofWashington AlexanderRakhlin DepartmentofStatistics,TheWhartonSchool,UniversityofPennsylvania Abstract Weconsidertheproblemofoptimizinganapproximatelyconvexfunctionoveraboundedconvex set in Rn using only function evaluations. The problem is reduced to sampling from an approxi- mately log-concave distribution using the Hit-and-Run method, which is shown to have the same O∗ complexityassamplingfromlog-concavedistributions. Inadditiontoextendtheanalysisfor log-concave distributions to approximate log-concave distributions, the implementation of the 1- dimensionalsampleroftheHit-and-Runwalkrequiresnewmethodsandanalysis. Thealgorithm thenisbasedonsimulatedannealingwhichdoesnotreliesonfirstorderconditionswhichmakesit essentiallyimmunetolocalminima. We then apply the method to different motivating problems. In the context of zeroth order stochasticconvexoptimization, theproposedmethodproducesan(cid:15)-minimizerafterO∗(n7.5(cid:15)−2) noisyfunctionevaluationsbyinducingaO((cid:15)/n)-approximatelylogconcavedistribution. Wealso consider in detail the case when the “amount of non-convexity” decays towards the optimum of the function. Other applications of the method discussed in this work include private computa- tion of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programmingforonlinelearning. 1. IntroductionandProblemFormulation LetK ⊂ Rn beaconvexset, andletF : Rn → RbeanapproximatelyconvexfunctionoverK in thesensethat sup|F(x)−f(x)| ≤ (cid:15)/n (1) x∈K forsomeconvexfunctionf : Rn → Rand(cid:15) > 0. Inparticular,F maybediscontinuous. Weseek tofindx ∈ Ksuchthat F(x)−minF(y) ≤ (cid:15) (2) y∈K using only function evaluations of F. This paper presents a randomized method based on simu- latedannealingthatsatisfies(2)inexpectation(orwithhighprobability). Moreover,thenumberof (cid:13)c 2015A.Belloni,T.Liang,H.Narayanan&A.Rakhlin. BELLONILIANGNARAYANANRAKHLIN required function evaluations of F is at most O∗(n4.5) (see Corollary 18), where O∗ hides poly- logarithmic factors in n and (cid:15)−1. Our method requires only a membership oracle for the set K. In Section 7, we consider the case when the amount of non-convexity in (1) can be much larger than (cid:15)/nforpointsawayfromtheoptimum. In the oracle model of computation, access to function values at queried points is referred to as the zeroth-order information. Exact function evaluation of F may be equivalently viewed as approximatefunctionevaluationoftheconvexfunctionf,withtheoraclereturningavalue F(x) ∈ [f(x)−(cid:15)/n,f(x)+(cid:15)/n]. (3) A closely related problem is that of convex optimization with a stochastic zeroth order oracle. Here, the oracle returns a noisy function value f(x)+η. If η is zero-mean and subgaussian, the functionvaluescanbeaveragedtoemulate,withhighprobability,theapproximateoracle(3). The randomizedmethodweproposehasanO∗(n7.5(cid:15)−2)oraclecomplexityforconvexstochasticzeroth orderoptimization,which,tothebestofourknowledge,isthebestthatisknownforthisproblem. WerefertoSection6formoredetails. The motivation for studying zeroth-order optimization is plentiful, and we refer the reader to Conn et al. (2009) for a discussion of problems where derivative-free methods are essential. In Section 8 we sketch three areas where the algorithm of this paper can be readily applied: private computation with distributed data, two-stage stochastic programming, and online learning algo- rithms. 2. PriorWork ThepresentpaperrestsfirmlyonthelongstringofworkbyKannan, Lova´sz, Vempala, andothers (Lova´sz and Simonovits, 1993; Kannan et al., 1997; Kalai and Vempala, 2006; Lova´sz and Vem- pala,2006a,b,2007). Inparticular,weinvokethekeylowerboundonconductanceofHit-and-Run from Lova´sz and Vempala (2006a) and use the simulated annealing technique of Kalai and Vem- pala (2006). Our analysis extends Hit-and-Run to approximately log-concave distributions which requirednewtheoreticalresultsandimplementationadjustments. Inparticular,weproposeaunidi- mensional sampling scheme that mixes fast to a truncated approximately log-concave distribution ontheline. Samplingfromβ-log-concavedistributionswasalreadystudiedintheearlyworkofApplegate and Kannan (1991) with a discrete random walk based on a discretization of the space. In the case of non-smooth densities and unrestricted support, sampling from approximate log-concave distributions has also been studied in Belloni and Chernozhukov (2009) where the hidden convex function f is quadratic. This additional structure was motivated by the central limit theorem in statistical applications and leads to faster mixing rates. Both works used ball walk-like strategies. NeitherworkconsideredrandomwalksthatallowforlongstepslikeHit-and-Run. The present work was motivated by the question of information-based complexity of zeroth- orderstochasticoptimization. ThepaperofAgarwaletal.(2013)studiesasomewhatharderprob- lem of regret minimization with zeroth-order feedback. Their method is based on the pyramid construction of Nemirovskii and Yudin (1983) and requires O(n33(cid:15)−2) noisy function evaluations toachievearegret(and,hence,anoptimizationguarantee)of(cid:15). ThemethodofLiangetal.(2014) improved the dependence on the dimension to O∗(n14) using a Ball Walk on the epigraph of the function in the spirit of Bertsimas and Vempala (2004). The present paper further reduces this 2 OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS dependence to O∗(n7.5) and still achieves the optimal (cid:15)−2 dependence on the accuracy. The best knownlowerboundfortheproblemisΩ(n2(cid:15)−2)(seeShamir(2012)). OtherrelevantworkincludestherecentpaperofDyeretal.(2013)wheretheauthorsproposed asimplerandomwalkmethodthatrequiresonlyapproximatefunctionevaluations. Astheauthors mention, their algorithm only works for smooth functions and sets K with smooth boundaries — assumptions that we would like to avoid. Furthermore, the effective dependence of Dyer et al. (2013)onaccuracyisworsethan(cid:15)−2. 3. Preliminaries Throughout the paper, the functions F and f satisfy (1) and f is convex. The Lipschitz constant of f with respect to (cid:96) norm will be denoted by L, defined as the smallest number such that ∞ |f(x)−f(y)| ≤ L(cid:107)x−y(cid:107) forx,y ∈ K. AssumetheconvexbodyK ⊆ Rn tobewell-roundedin ∞ √ thesensethatthereexistr,R > 0suchthatBn(r) ⊆ K ⊆ Bn(R)andR/r ≤ O( n).1 Foranon- 2 2 negative function g, denote by π the normalized probability measure induced by g and supported g onK. Definition1 Afunctionh : K → R islog-concaveif + h(αx+(1−α)y) ≥ h(x)αh(y)1−α forallx,y ∈ Kandα ∈ [0,1]. Afunctioniscalledβ-log-concaveforsomeβ ≥ 0if h(αx+(1−α)y) ≥ e−βh(x)αh(y)1−α forallx,y ∈ Kandα ∈ [0,1]. Definition2 A function g : K → R is ξ-approximately log-concave if there is a log-concave + functionh : K → R suchthat + sup|logh(x)−logg(x)| ≤ ξ. x∈K Lemma3 Ifthefunctiong isβ/2-approximatelylog-concave,theng isβ-log-concave. Forone-dimensionalfunctions,theabovelemmacanbereversed: Lemma4(BelloniandChernozhukov(2009),Lemma9) Ifgisaunidimensionalβ-log-concave function,thenthereexistsalog-concavefunctionhsuchthat e−βh(x) ≤ g(x) ≤ h(x) forallx ∈ R. Remark5(GapBetweenβ-Log-ConcaveFunctionsandξ-ApproximateLog-ConcaveFunctions) AconsequenceofLemma4isthatβ-log-concavefunctionsareequivalenttoβ-approximatelylog- concavefunctionswhenthedomainisunidimensional. However,suchequivalencenolongerholds in higher dimensions. In the case the domain is Rn, Green et al. (1952); Cholewa (1984) estab- lishedthatβ-log-concavefunctionsare β log (2n)-approximatelylog-concave. Laczkovich(1999) 2 2 showedthattherearefunctionssuchthatthefactorthatrelatestheseapproximationscannotbeless than 1 log (n/2). 4 2 1.ThisconditioncanberelaxedbyapplyingapencilconstructionasinLova´szandVempala(2007). 3 BELLONILIANGNARAYANANRAKHLIN 4. SamplingfromApproximateLog-ConcaveDistributionsviaHit-and-Run InthissectionweanalyzetheHit-and-Runproceduretosimulaterandomvariablesfromadistribu- tion induced by an approximate log-concave function. The Hit-and-Run algorithm is as follows in Algorithm1. Algorithm1Hit-and-Run Input: a target distribution π on K induced by a nonnegative function g; x ∈ dom(g); linear g transformationΣ;numberofstepsm Output: apointx(cid:48) ∈ dom(g)generatedbyone-stepHit-and-Runwalk initialization: astartingpointx ∈ dom(g) fori = 1,...,mdo 1. Choose a random line (cid:96) that passes through x. The direction is uniform from the surface of ellipsegivenbyΣactingonsphere 2. On the line (cid:96) run the unidimensional rejection sampler with π restricted to the line (and g supportedonK)toproposeasuccessfulnextstepx(cid:48) end Inordertohandleapproximatelog-concavefunctionsweneedtoaddressimplementationissues and address the theoretical difficulties caused by deviations from log-concavity which can include discontinuities. Themainimplementationdifferenceliesistheunidimensionalsampler. Nolonger abinarysearchyieldsthemaximumoverthelineanditsendpointssinceβ-log-concavefunctions canbediscontinuousandmultimodal. Wenowturntothesequestions. 4.1. Unidimensionalsamplingscheme As a building block of the randomized method for solving the optimization problem (2), we intro- duceaone-dimensionalsamplingprocedure. Letg beaunidimensionalβ-log-concavefunctionon aboundedlinesegment(cid:96),andletπ betheinducednormalizedmeasure. Thefollowingguarantee g willbeprovedinthissection. Lemma6 Letg beaβ-log-concavefunctionandlet(cid:96)beaboundedlinesegment(cid:96)onK. Givena targetaccuracy(cid:15)˜∈ (0,e−2β/2),Algorithm2producesapointX ∈ (cid:96)withadistributionπˆ such g,(cid:96) that d (π ,πˆ ) ≤ 3e2β(cid:15)˜. tv g,(cid:96) g,(cid:96) Moreover, the method requires O∗(1) evaluations of the unidimensional β-log-concave function g ifβ isO(1). The proposed method for sampling from the β-log-concave function g is a rejection sampler thatrequirestwoinitializationsteps. Wefirstshowhowtoimplementstep(a). Fortheβ-log-concavefunctiong,lethbealog-concavefunctioninLemma4andletL˜ denote theLipschitzconstantoftheconvexfunction −logh. Inthefollowingtworesults,theO∗ notation hidesalog(L˜)factor. Lemma7(InitializationStep(a)) Algorithm3findsapointp ∈ (cid:96)thatsatisfiesg(p) ≥ e−3βmax g(z). z∈(cid:96) Moreover,thissteprequiresO∗(1)functionevaluations. 4 OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS Algorithm2Unidimensionalrejectionsampler Input: 1-dimβ-log-concavefunctiong definedonaboundedsegment(cid:96) = [x,x¯];accuracy(cid:15)˜> 0 Output: Asamplexwithdistributionπˆ closetoπ g,(cid:96) g,(cid:96) Initialization: (a)computeapointp ∈ (cid:96)s.t. g(p) ≥ e−3βmax g(z) (b)giventargetaccuracy(cid:15)˜, z∈(cid:96) findtwopointse ,e ontwosidesofps.t. −1 1 e = x if g(x) ≥ 1e−β(cid:15)˜g(p), 1e−β(cid:15)˜g(p) ≤ g(e ) ≤ (cid:15)˜g(p) otherwise −1 2 2 −1 (4) e = x¯ if g(x¯) ≥ 1e−β(cid:15)˜g(p), 1e−β(cid:15)˜g(p) ≤ g(e ) ≤ (cid:15)˜g(p) otherwise; 1 2 2 1 whilesamplerejected do pick x ∼ unif([e ,e ]) and pick r ∼ unif([0,1]) independently. If r ≤ g(x)/{g(p)e3β} then −1 1 acceptxandstop. Otherwise,rejectx end Algorithm3InitializationStep(a) Input: unidimensionalβ-log-concavefunctiong definedonaboundedinterval(cid:96) = [x,x¯] Output: apointp ∈ (cid:96)s.t. g(p) ≥ e−3βmax g(z) z∈(cid:96) whiledidnotstopdo setx = 3x+ 1x¯,x = 1x+ 1x¯andx = 1x+ 3x¯ l 4 4 c 2 2 r 4 4 If|logg(x )−logg(x )| > β,set[x,x¯]aseither[x ,x¯]or[x,x ]accordingly l r l r elseif|logg(x )−logg(x )| > β,set[x,x¯]aseither[x ,x¯]or[x,x ]accordingly l c l c elseif|logg(x )−logg(x )| > β,set[x,x¯]aseither[x,x ]or[x ,x¯]accordingly r c r c elseoutputp = arg max f(x)andstop x∈{xl,xc,xr} end Lemma8(InitializationStep(b)) Let (cid:96) = [x,x¯] and p ∈ (cid:96). The binary search algorithm finds e ∈ [x,p] and e ∈ [p,x¯] such that (4) holds. Moreover, this step requires O∗(1) function −1 1 evaluations. According to Lemmas 6, 7, 8, the unidimensional sampling method produces a sample from a distributionthatisclosetothedesiredβ-log-concavedistribution. Furthermore,themethodrequires anumberofqueriesthatislogarithmicinalltheparameters. 4.2. Mixingtime Inthissection,wewillanalyzemixingtimeoftheHit-and-Runalgorithmwithaβ/2-approximate log-concavefunctiong,namelythereexistslog-concavehsuchthatsup |logg−logh| ≤ β/2.In K particular, this implies that g is β-log-concave, according to Lemma 3. In this section, we provide theanalysisofHit-and-RunwiththelineartransformationΣ = I andremarkthattheresultsextend tootherlineartransformationsemployedtoroundthelog-concavedistributions. The mixing time of a geometric random walk can be bounded through the spectral gap of the inducedMarkovchain. Inturn,thespectralgaprelatestothesocalledconductancewhichhasbeen akeyquantityintheliterature. ConsiderthetransitionprobabilityofHit-and-Runwithadensityg, 5 BELLONILIANGNARAYANANRAKHLIN namely (cid:90) 2 g(x)dx Pg(A) = u nπ µ (u,x)|x−u|n−1 n A g (cid:82) g(x) where µ (u,x) = g(y)dy. Let π (x) = be the probability measure induced g (cid:96)(u,x)∩K g (cid:82) g(y)dy y∈K bythefunctiong. TheconductanceforasetS ⊂ Kwith0 < π (S) < 1isdefinedas g (cid:82) Pg(K\S)dπ φg(S) = x∈S x g , min{π (S),π (K\S)} g g andφg istheminimumconductanceoverallmeasurablesets. Thes-conductanceis,inturn,defined as (cid:82) Pg(K\S)dπ φg = inf x∈S x g. s S⊂K,s<πg(S)≤1/2 πg(S)−s Bydefinitionwehaveφg ≤ φg foralls > 0. s The following theorem provides us an upper bound on the mixing time of the Markov chain basedonconductance. Letσ(0) betheinitialdistributionandσ(m) thedistributionofthem-thstep of the random walk of Hit-and-Run with exact sampling from the distribution π restricted to the g line. Theorem9(Lova´szandSimonovits(1993);Lova´szandVempala(2007),Lemma9.1) Let0 < s ≤ 1/2andletg : K → R bearbitrary. Thenforeverym ≥ 0, + (cid:18) (φg)2(cid:19)m H (cid:18) (φg)2(cid:19)m d (π ,σ(m)) ≤ H 1− and d (π ,σ(m)) ≤ H + s 1− s tv g 0 tv g s 2 s 2 whereH = sup π (x)/σ0(x) and H = sup{|π (A)−σ(0)(A)| : π (A) ≤ s}. 0 x∈K g s g g Building on Lova´sz and Vempala (2006a), we prove the following result that provides us with a lower bound on the conductance for Hit-and-Run induced by a log-concave h. The proof of the result below follows the proof of Theorem 3.7 in Lova´sz and Vempala (2006a) with modifications toallowunboundedsetsKwithouttruncatingtherandomwalk. Theorem10(ConductanceLowerBoundforLog-concaveMeasureswithUnboundedSupport) Lethbealog-concavefunctioninRn suchthatthelevelsetofmeasure 1 containsaballofradius 8 r. Define R = (E (cid:107)X − z (cid:107)2)1/2, where z = E X and X is sampled from the log concave h h h h measureinducebyh. ThenforanysubsetS,withπ (S) = p ≤ 1,theconductanceofHit-and-Run h 2 satisfies 1 φh(S) ≥ C nR log2 nR 1 r rp whereC > 0isauniversalconstants. 1 Although Theorem 10 is new, very similar conductance bounds allowing for unbounded sets were establish before. Indeed in Section 3.3 of Lova´sz and Vempala (2006a) the authors discuss the case of unbounded K and propose to truncate the set to its effective diameter and use the fact thatthisdistributionwouldbeclosetothedistributionoftheunrestrictedset. Suchtruncationneeds 6 OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS to be enforces which requires to change the implementation of the algorithm and lead to another (small) layer of approximation errors. Theorem 10 avoids this explicit truncation and truncation is done implicitly in the proof only. We note that when applying the simulated annealing technique, even if we start with a bounded set, by diminishing the temperature, we are effectively stretching thesetswhichwouldessentiallyrequiretohandleunboundedsets. WenowarguethatconductanceofHit-and-Runwithβ-approximatelog-concavemeasurescan berelatedtotheconductancewithlog-concavemeasures. Theorem11 Letgbeaβ/2-approximatelog-concavemeasureandhbeanylog-concavefunction with the property in Definition 2. Then the conductance and s-conductance of the random walk inducedbyg arelowerboundedas φg ≥ e−3βφh and φg ≥ e−3βφh . s s/eβ WeapplyTheorem9toshowcontractionofσ(m) toπ intermsofthetotalvariationdistance. g Theorem12(MixingTimeforApproximately-log-concaveMeasure) Let π is the stationary g measureassociatedwiththeHit-and-Runwalkbasedonaβ/2-approximatelylog-concavefunction g,andM = (cid:107)σ(0)/π (cid:107) = (cid:82)(dσ(0)/dπ )dσ(0). ThereisauniversalconstantC < ∞suchthatfor g g anyγ ∈ (0,1/2),if e6βR2 eβMnR M m ≥ Cn2 log4 log r2 rγ2 γ thenmstepsoftheHit-and-Runrandomwalkbasedong yieldd (π ,σ(m)) ≤ γ. tv g Remark13 The value M in Theorem 12 bounds the impact of the initial distribution σ(0) which can be potentially far from the stationary distribution. In the Simulated Annealing application of nextsection,wewillshowinLemma15thatwecan“warmstart”thechainbycarefullypickingan initialdistributionsuchthatM = O(1). Theorem12showsγ-closenessbetweenthedistributionσ(m) andthecorrespondingstationary distribution. However,thestationarydistributionisnotexactlygsincetheunidimensionalsampling proceduredescribedearliertruncatesthedistributiontoimprovemixingtime. Thefollowingtheo- remshowsthattheseconcernsareovertakenbythegeometricmixingoftherandomwalk. Letπˆ g,(cid:96) denote the distribution of the unidimensional sampling scheme (Algorithm 2) along the line (cid:96) and π denotethedistributionoftheunidimensionalsamplingschemeproportionaltog alongline(cid:96). g,(cid:96) Theorem14 Letσˆ(m)denotethedistributionoftheHit-and-Runwiththeunidimensionalsampling scheme(Algorithm2)aftermsteps. Forany0 < s < 1/2,thealgorithmmaintainsthat (cid:26) H (cid:18) (φg)2(cid:19)m(cid:27) d (σˆ(m),π ) ≤ 2d (σˆ(0),σ(0))+msupd (πˆ ,π )+ H + s 1− s tv g tv tv g,(cid:96) g,(cid:96) s s 2 (cid:96)⊂K wherethesupremumistakenoveralllines(cid:96)inK. Inparticular,foratargetaccuracyγ ∈ (0,1/e), if d (σˆ(0),σ(0)) ≤ γ/8, s such that H ≤ γ/4, m ≥ {2/(φg)2}log({H /s}{4/γ}), and the tv s s s precisionoftheunidimensionalsamplingschemetobe(cid:15)˜= γe−2β/{12m},wehave d (σˆ(m),π ) ≤ γ. tv g 7 BELLONILIANGNARAYANANRAKHLIN 5. OptimizationviaSimulatedAnnealing Wenowturntothemaingoalofthepaper: toexhibitamethodthatproducesan(cid:15)-minimizerofthe nearlyconvexfunctionF inexpectation. Fixthepairf,F withtheproperty(1),anddefineaseries offunctions h (x) = exp(−f/T ), g (x) = exp(−F/T ) i i i i for a chain of temperatures {T ,i = 1,...,K} to be specified later. It is immediate that h ’s are i i log-concave. Lemma3,inturn,impliesthatg ’sare 2(cid:15) -log-concave. i nTi We now introduce the simulated annealing method that proceeds in epochs and employs the Hit-and-Run procedure with the unidimensional sampler introduced in the previous section. The overall simulated annealing procedure is identical to the algorithm of Kalai and Vempala (2006), withdifferencesintheanalysisarisingfromF beingonlyapproximatelyconvex. Algorithm4Simulatedannealing Input: Aseriesoftemperatures{T ,1 ≤ i ≤ K},K=numberofepochs,x ∈ intK i Output: acandidatepointxforwhichF(x) ≤ min F(y)+(cid:15)holds y∈K initialization: well-rounded convex body K and {Xj,1 ≤ j ≤ N} i.i.d. samples from uniform 0 measureonK,N-numberofstrands,setK = K,andΣ = I 0 0 whilei-thepoch,1 ≤ i ≤ K do 1. calculate the i-th rounding linear transformation T based on {Xj ,1 ≤ j ≤ N} and let i i−1 Σ = T ◦Σ i i i−1 2. draw N i.i.d. samples {Xj,1 ≤ j ≤ N} from measure π using Hit-and-Run with linear i gi transformationΣ andwithN warm-startingpoints{Xj ,1 ≤ j ≤ N} i i−1 end output x = argmin F(Xj). i 1≤j≤N,1≤i≤K Beforestatingtheoptimizationguaranteeoftheabovesimulatedannealingprocedure,weprove thatthewarm-startpropertyofthedistributionsbetweensuccessiveepochsandtheroundingguar- anteegivenbyN samples. 5.1. Warmstartandmixing Weneedtoprovethatthemeasuresbetweensuccessivetemperaturesarenottoofarawayinthe(cid:96) 2 sense,sothatthesamplesfromthepreviousepochcanbetreatedasawarmstartforthenextepoch. ThefollowingresultisanextensionofLemma6.1in(KalaiandVempala,2006)toβ-log-concave functions. Lemma15 Let g(x) = exp(−F(x)) be a β-log-concave function. Let µ be a distribution with i (cid:16) (cid:17) densityproportionaltoexp{−F(x)/Ti},supportedonK. LetTi = Ti−1 1− √1n . Then (cid:107)µ /µ (cid:107) ≤ C = 5exp(2β/T ) i i+1 γ i Nextweaccountfortheimpactofusingthefinaldistributionfromthepreviousepochσ(0) asa “warm-start.” 8 OPTIMIZATIONOFAPPROXIMATELYCONVEXFUNCTIONS Theorem16 Fix a target accuracy γ ∈ (0,1/e) and let g be an β/2-approximately log-concave √ functioninRn. Supposethesimulatedannealingalgorithm(Algorithm4)isrunforK = nlog(1/ρ) √ epochs with temperature parameters T = (1−1/ n)i,0 ≤ i ≤ K. If the Hit-and-Run with the i unidimensionalsamplingscheme(Algorithm2)isrunform = O∗(n3)numberofstepsprescribed inTheorem12,thealgorithmmaintainsthat (m) d (σˆ ,π ) ≤ eγ (5) tv i gi (m) ateveryepochi, whereσˆ isthedistributionofthem-thstepofHit-and-Run. Here, mdepends i polylogarithmicallyonρ−1. 5.2. Optimizationguarantee WeproveanextensionofLemma4.1in(KalaiandVempala,2006): Theorem17 Letf beaconvexfunction. LetX bechosenaccordingtoadistributionwithdensity proportionaltoexp{−f(x)/T}. Then E f(X)−minf(x) ≤ (n+1)T f x∈K Furthermore, if F is such that |F − f| ≤ ρ, for X chosen from a distribution with density ∞ proportionaltoexp{−F(x)/T},wehave E f(X)−minf(x) ≤ (n+1)T ·exp(2ρ/T) F x∈K TheaboveTheoremimpliesthatthefinaltemperatureT inthesimulatedannealingprocedure K √ needs to be set as T = (cid:15)/n. This, in turn, leads to K = nlog(n/(cid:15)) epochs. The oracle K complexityofoptimizingF isthen,informally, √ O∗(n3)queriespersample × O∗(n)parallelstrands × O∗( n)epochs = O∗(n4.5) Thefollowingcorollarysummarizesthecomputationalcomplexityresult: Corollary18 Suppose F is approximately convex and |F − f| ≤ (cid:15)/n as in (1). The simulated √ annealingmethodwithK = nlog(n/(cid:15))epochsproducesarandompointsuchthat Ef(X)−minf(x) ≤ (cid:15), EF(X)−minF(x) ≤ 2(cid:15). x∈K x∈K FurthermorethenumberoforaclequeriesrequiredbythemethodisO∗(n4.5). 6. StochasticConvexZerothOrderOptimization Letf : K → RbetheunknownconvexL-Lipschitzfuncitonweaimtominimize. Withinthemodel ofconvexoptimizationwithstochasticzeroth-orderoracleO,theinformationreturneduponaquery x ∈ Kisf(x)+(cid:15) where(cid:15) isthezeromeannoise. Weshallassumethatthenoiseissub-Gaussian x x with parameter σ. That is, Eexp(λ(cid:15) ) ≤ exp(σ2λ2/2). It is easy to see from Chernoff’s bound x that for any t ≥ 0 P(|(cid:15) | ≥ σt) ≤ 2exp(−t2/2). We can decrease the noise level by repeatedly x queryingatx. Fixτ > 0, tobedeterminedlater. Theaverage(cid:15)¯ ofτ observationsisconcentrated x 9 BELLONILIANGNARAYANANRAKHLIN √ as P(|(cid:15)¯ | ≥ σt/ τ) ≤ 2exp(−t2/2). To use the randomized optimization method developed in x this paper, we view f(x)+(cid:15)¯ as the value of F(x) returned upon a single query at x. Since the x randomizedmethoddoesnotre-visitxwithprobability1,thefunctionF is“well-defined”. Let us make the above discussion more precise by describing three oracles. Oracle O(cid:48) draws noise (cid:15) for each x ∈ K prior to optimization. Upon querying x ∈ K, the oracle deterministically x returnsf(x)+(cid:15) ,evenifthesamepointisqueriedtwice. Giventhattheoptimizationmethoddoes x notquerythesamepoint(withprobabilityone), thisoracleisequivalenttoanobliviousversionof oracleOoftheoriginalzerothorderstochasticoptimizationproblem. TodefineO ,letN beanα-netin(cid:96) whichcanbetakenasaboxgridofK. IfK ⊆ RB , α α ∞ ∞ thesizeofthenetisatmost(R/α)n. Theoracledraws(cid:15) foreachelementx ∈ N ,independently. x (cid:15) Upon a query x(cid:48) ∈ K, the oracle deterministically returns f(x)+(cid:15) for x ∈ N which is closest x α tox(cid:48). NotethatO isnomorepowerfulthanO(cid:48),sincethelearneronlyobtainstheinformationon α theα-net. OracleOτ isasmallmodificationofO . Thismodificationmodelsarepeatedqueryat α α the same point, as described earlier. Parametrized by τ (the number of queries at the same point), √ oracleOτ drawsrandomvariables(cid:15) foreachx ∈ N ,butsub-Gaussianparameterof(cid:15) isσ/ τ. α x α x Theoptimizationalgorithmpaysforτ oraclecallsuponasinglecalltoOτ. α We argued that Oτ is no more powerful than the original zeroth order oracle given that the α algorithm does not revisit the point. In the rest of the section, we will work with Oτ as the oracle α model. Foranyx,denotetheprojectiontotheN tobeP (x). DefineF : K (cid:55)→ Ras α Nα F(x) = f(P (x))+(cid:15) Nα PNα(x) whereP (x)istheclosesttoxpointofN inthe(cid:96) sense. Clearly,|F−f| ≤ max |(cid:15) |+ Nα α ∞ ∞ x∈Nα x αL where L is the ((cid:96) ) Lipschitz constant. Since ((cid:15) ) define a finite collection of sub- ∞ x x∈Nα Gaussian random variables with sub-Gaussian parameter σ, we have that with probability at least 1−δ (cid:113) 2nlog(R/α)+2log(1/δ) max |(cid:15) | ≤ σ x∈Nα x τ Fromnowon,weconditiononthisevent,whichwecallE. Toguarantee(1),wesettheaboveupper boundequalto (cid:15) = αLwhereτ istheparameterfromoracleOτ. Solvingforτ andα: 2n α σ2n2(8nlog(R/α)+8log(1/δ)) σ2n2(8nlog(2LRn/(cid:15))+8log(1/δ)) τ = = = O∗(n3/(cid:15)2) (cid:15)2 (cid:15)2 and α = (cid:15)/(2Ln). Note here L affects τ only logarithmically, and, in particular, we could have definedtheLipschitzconstantwithrespectto(cid:96) . Wealsoobservethattheoraclemodeldependson 2 αand,hence,onthetargetaccuracy(cid:15). However,becausethedependenceonαisonlylogarithmic, wecantakeαtobemuchsmallerthan(cid:15). TogetherwiththeO∗(n4.5)oraclecomplexityprovedintheprevioussectionforoptimizingF, thechoiceofτ = O∗(n3(cid:15)−2)evaluationspertimestepyieldsatotaloraclecomplexityof O∗(n7.5(cid:15)−2) for the problem of stochastic convex optimization with zeroth order information. We observe that a factor of n2 in oracle complexity comes from the union bound over the exponential-sized dis- cretizationoftheset. This(somewhatartificial)factorcanbereducedorremovedunderadditional assumptionsonthenoise,suchasadrawfromaGaussianprocesswithspatialdependenceoverK. 10

Description:
then is based on simulated annealing which does not relies on first order from Lovász and Vempala (2006a) and use the simulated annealing
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.