ebook img

Parallel algorithms and probability of large deviation for stochastic optimization problems PDF

0.12 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Parallel algorithms and probability of large deviation for stochastic optimization problems

Parallel algorithms and probability of large deviation for stochastic optimization problems Pavel Dvurechensky 1, Alexander Gasnikov 2, Anastasia Lagunovskaya 3 7 1 Abstract 0 2 We consider convex stochastic optimization problems under different assumptions on n thepropertiesofavailablestochasticsubgradient. Itisknownthat,ifthevalueoftheob- a jective function is available,one canobtain,in parallel, severalindependent approximate J solutionsintermsoftheobjectiveresidualexpectation. Then,choosingthesolutionwith 8 the minimum function value, one can control the probability of large deviation of the 1 objective residual. On the contrary, in this short paper, we address the situation, when ] the value of the objective function is unavailable or is too expensive to calculate. Under C "‘light-tail"’ assumption for stochastic subgradient and in general case with moderate O large deviation probability, we show that parallelization combined with averaging gives . bounds forprobabilityoflargedeviationsimilarto aserialmethod. Thus, inthese cases, h t one can benefit from parallel computations and reduce the computational time without a loss in the solution quality. m [ Keywords Stochastic Convex Optimization, Probability of Large Deviation, Mirror De- 2 scent, Parallel Algorithm v Mathematics Subject Classification 90C15, 90C25 0 3 8 1 Introduction 1 0 . We consider the following general stochastic optimization problem over a convex compact set 1 0 Q: 7 min f(x):= E [f (x,ξ)] , (1) ξ 1 x∈Q⊂E{ } : v where E is a finite-dimensional real vector space, ξ is a random vector, f (x,ξ) is a closed i X convex function w.r.t x for a.e. ξ, and x and ξ are independent. Under these assumptions, r this problem is a convex optimization problem. a Our main goal is to approximately solve this problem using some algorithm. Usually, in the stochastic optimization literature [1, 2], two measures for the quality of an approximate solution x¯ are considered. The first is expectation of the objective residual. In this case, x¯ is an ε-solution of (1) for ε > 0 iff Ef(x¯) f ε, where f is the optimal value in (1) and ∗ ∗ − ≤ expectation is taken with respect to all the randomness arising in the algorithmic process. The second is bound for probability of large deviation of the objective residual. In this case 1correspondingauthor,[email protected],WeierstrassInstituteforAppliedAnalysisand Stochastics, Mohrenstr. 39, 10117 Berlin, Germany; Institute for Information Transmission Problems RAS, Bolshoy Karetny per. 19, build.1, Moscow, Russia 127051; [email protected], Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russia141700; Institutefor Information Transmission Problems RAS,Bolshoy Karetnyper. 19, build.1, Moscow, Russia 127051; [email protected], Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgo- prudny,Moscow Region, Russia 141700; 1 x¯, is an (ε,σ)-solution of (1) for ε > 0, σ (0,1) iff P f(x¯) f > ε σ. We mainly focus ∗ ∈ { − } ≤ on the latter quality measure in this paper. It is known that, if the value of the objective function is available (e.g. in randomized methods [3]), onecanobtaininparallellogarithmic inσ−1 number ofindependent ε-solutions. Then, the solution with the minimum function value is an (ε,σ)-solution. Nevertheless, the valueoftheobjectivefunctioncanbeunavailableortooexpensivetocalculateorapproximate. The latter is easy to imagine since, for ξ Rp, the computational effort for calculation of f¯ s.t. f¯ f(x) δ can amount up to O∈(δ−p) calculations of f(x,ξ) at different ξ. Our | − | ≤ goal is to propose a technique, which allows to obtain an (ε,σ)-solution based on a number computed in parallel ε-solutions without calculation of the function f(x) value. Our approach is based on Stochastic Mirror Descent algorithm [2]. It turns out that, under some mild assumptions, an ε-solution to (1), obtained by Stochastic Mirror Descent, is also an(ε˜,σ)-solution. Weuse this factand calculate inparallel logarithmic inσ−1 number of independent ε-solutions, average them and prove that this average is an (ε,σ)-solution. Thus, we can benefit from parallelization and reduce the computation time without any loss in the solution quality. 2 Stochastic Mirror Descent This section is devoted to description of Stochastic Mirror Descent (SMD) [2, 3] and its convergence properties in terms of expectation of the objective residual and also in terms of probability of large deviation of this residual. These convergence results provide the basis of our approach for constructing an (ε,σ)-solution by parallelization. Let us choose some norm on E and denote the conjugate norm by . We assume that, at any point x Q, a k·k k·k∗ ∈ stochastic subgradient f(x,ξ) of f(x) is available and satisfies x ∇ E [ f(x,ξ)] ∂f(x), E f(x,ξ) 2 M2 (2) ξ ∇x ∈ ξ k∇x k∗ ≤ h i for some constant M > 0. We choose a prox-function d(x) which is 1-strongly convex in . k·k Let x0 = argmin d(x). W.l.o.g. weassume that d(x0) = 0. Thealgorithm uses Bregman’s x∈Q divergence V (x)= d(x) d(z) d(z),x z . Let x be a solution of (1), R be a number z ∗ − −h∇ − i s.t. V (x ) R2, and R¯ be a number s.t. max V (x ) R¯. x0 ∗ x∈Q x ∗ ≤ ≤ Stochastic Mirror Descent [2, 3] iterates as follows, starting from x0 Q, ∈ xk+1 = Mirrxk h∇xf xk,ξk , Mirrxk(v) := argmx∈iQn v,x−xk +Vxk(x) , (3) (cid:16) (cid:16) (cid:17)(cid:17) nD E o where h > 0 is the stepsize, ξk is an i.i.d. sample of ξ. The main property of SMD-step k≥0 [3] is (cid:8) (cid:9) 2 2Vxk+1(x)≤ 2Vxk(x)+2h ∇xf xk,ξk ,x−xk +h2 ∇xf xk,ξk ∗, ∀x∈ Q. D (cid:16) (cid:17) E (cid:13) (cid:16) (cid:17)(cid:13) Further, using convexity of f(x), for any f(xk) ∂f(xk) an(cid:13)(cid:13)d x Q, (cid:13)(cid:13) ∇ ∈ ∈ f xk f (x) f(xk),xk x f(xk) f xk,ξk ,xk x + x − ≤ ∇ − ≤ ∇ −∇ − (cid:16) (cid:17) D E D (cid:16) (cid:17) E 1 h 2 + h (Vxk(x)−Vxk+1(x))+ 2 ∇xf xk,ξk ∗. (cid:13) (cid:16) (cid:17)(cid:13) (cid:13) (cid:13) 2 (cid:13) (cid:13) Taking conditional expectation w.r.t. ξ1,...,ξk−1 and using (2), one obtains 1 h 2 f xk −f(x)≤ h Vxk(x)−E Vxk+1(x)|ξ1,...,ξk−1 +2 E ∇xf xk,ξk ∗ ξ1,...,ξk−1 . (cid:16) (cid:17) (cid:16) h i(cid:17) (cid:20)(cid:13) (cid:16) (cid:17)(cid:13) (cid:12)(cid:12) (cid:21) (cid:13) (cid:13) (cid:13) (2) (cid:13) (cid:12) ≤M2 (cid:12) | {z } Since ξk is an i.i.d. sample, taking full expectation from the both sides of these in- k≥0 equalities for k = 0,...,N 1, summing them up and taking x = x gives, by convexity of ∗ (cid:8) (cid:9) − f (x), 1 N−1 1 M2h 2M2R2 E f x¯N f E f(xk) f V (x )+ , − ∗ ≤ N − ∗ ≤ hN x0 ∗ 2 ≤ N r k=0 (cid:2) (cid:0) (cid:1)(cid:3) X where N−1 1 R 2 R is s.t. V (x ) R2, x¯N := xk, h= . (4) x0 ∗ ≤ N M N r k=0 X Choosing 2M2R2 N = , (5) ε2 (cid:24) (cid:25) we obtain that x¯N satisfies E f x¯N f ε and, hence, is an ε-solution. Note that ∗ − ≤ this bound for N is optimal [3] up to a multiplicative constant factor for the class of convex (cid:2) (cid:0) (cid:1)(cid:3) stochastic programming problems (1) with a.e. bounded stochastic subgradients. It turns out that it is possible to prove bounds for probability of large deviation for f x¯N f . ∗ − P(cid:0)ropo(cid:1)sition1([2,5,6]). Assumethatoneofthefollowingassumptionsholds. a) f(x,ξ) k∇x k∗ ≤ M for a.e. ξ; b) E exp f (x,ξ) 2 M2 exp(1) and lnσ−1 N; c) There exists ξ k∇x k∗ ≤ ≪ some α > 2 s.t., for(cid:16)all, t(cid:16)≥ 0, P k∇fM(x.2,ξ)k22 (cid:17)≥(cid:17)t ≤ (t+11)α, and σ−1/(α−1) ≪ N. Then the point x¯N generated by SMD (3), (4(cid:16)) satisfies (cid:17) C M P f x¯N f 1 R+C R¯ ln(1/σ) 1 σ, (6) ∗ 2 − ≤ √N ≥ − (cid:26) (cid:27) (cid:0) (cid:1) (cid:16) p (cid:17) where in the case a) C = √2, C = 2√2; in the case b) C = C = 2√2; in the case c) 1 2 1 2 C = C (α), C = 1. 1 1 2 Corollary 1. Let any of three assumptions of Proposition 1 hold. Choose N = CM2R¯2 , ε2 where the constant C depends on C ,C . Then the point x¯N generated by SMDl (3), (4m) 1 2 satisfies, for any c 0, ≥ P f x¯N f c P η c , (7) ∗ − ≥ ≤ { ≥ } where η N ε,ε2 – normal ra(cid:8)ndo(cid:0)m v(cid:1)ariable wi(cid:1)th mean ε and variance ε2. ∈ (cid:0) (cid:1) 3 3 Parallelization and bounds for probability of large deviation In this section, we first discuss a known way to obtain an (ε,σ)-solution using a number of ε- solutions calculated in parallel. Then, we suggest a new way of doing this without calculation of the objective f(x) value, state and prove the main result. Assume that x¯N is an ε/2-solution, obtained by SMD (3), (4) with N = 8M2R2 . Then, ε2 using the Markov inequality [5], we obtain l m E f x¯N f 1 P f x¯N f ε − ∗ . ∗ − ≥ ≤ ε ≤ 2 (cid:2) (cid:0) (cid:1)(cid:3) (cid:0) (cid:0) (cid:1) (cid:1) If one calculates in parallel K = log σ−1 independent SMD ε/2-solutions x¯N,i K and 2 i=1 chooses the one x¯Nmin which min(cid:6)imizes(cid:0) f x¯(cid:1)N(cid:7),i , then, in total 8Mε22R2 log2(cid:8)σ−1(cid:9) calcu- lations of stochastic subgradient, one obtains P f x¯N f l ε mσ. Thus, x¯N is an (cid:0) (cid:1) min − ∗ ≥ ≤ (cid:6) (cid:0) m(cid:1)(cid:7)in (ε,σ)-solution of (1). The crucial point here is the possibility to calculate the value of the (cid:0) (cid:0) (cid:1) (cid:1) function f (x). We now suggest a technique which does not rely on the assumption of the function f(x) value availability. This assumption may not hold [1] in many real stochastic programming problems, e.g. in maximum likelihood approach used in mathematical statistics. Theorem 1. Let any of three assumptions of Proposition 1 hold. Let K = 2ln σ−1 and x¯N,i K be independent points obtained by SMD (3), (4) with N = 4CM2R¯2 . Then the i=1 ε2(cid:6) (cid:0) (cid:1)(cid:7) K l m (cid:8)point(cid:9)x¯K = 1 x¯N,i is an (ε,σ)-solution of (1). K i=1 P Proof Using Corollary 1 with c = ε, we obtain, for all i = 1,...,K, ε ε2 P f x¯N,i f ε P η ε , η N , . (8) ∗ i i − ≥ ≤ { ≥ } ∈ 2 4 (cid:18) (cid:19) (cid:8) (cid:0) (cid:1) (cid:9) By convexity of f(x), since x¯N,i K are i.i.d., η K are i.i.d., i=1 { i}i=1 (cid:8) (cid:9) K 1 (8) P f x¯K f ε P f x¯N,i f ε (9) ∗ ∗ − ≥ ≤ K − ≥ ≤ ( ) k=i (cid:8) (cid:0) (cid:1) (cid:9) X(cid:0) (cid:0) (cid:1) (cid:1) K K 1 ε ε 1 ε ε ε ε P η + = P η = P η √K σ, (10) i i K ≥ 2 2 K − 2 ≥ 2 − 2 ≥ 2 ≤ ( ! ( ! ) Xi=1 Xi=1 n o where η N ε,ε2 . Here we used well-known facts about the properties of sum of indepen- ∈ 2 4 dent normal(cid:16)random(cid:17) variables [4] ε ε2 ε ε2 d Kε Kε2 1 Kε Kε2 d ε ε2 N , +...+N , =N , , N , =N , , 2 4 2 4 2 4 K 2 4 2 4K (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) that K = 2ln σ−1 and, for η N ε,ε2 , P η ε ε 2ln(σ−1) σ. Thus, the ∈ 2 4 − 2 ≥ 2 ≤ point x¯K is(cid:6)an ((cid:0)ε,σ)-(cid:1)s(cid:7)olution of (1). (cid:16) (cid:17) n p o 4 Tosumup,undersomemildassumptions,butwithoutcalculationoftheobjectivefunction value, we propose a way to obtain an (ε,σ)-solution using a number of ε-solutions calculated in parallel. This approach allows to reduce the computational time without any loss in the solution quality. At the same time, we answer to the question of Yu. Nesterov [7]. The question can be stated as follows. When the quality of a solution of problem (1) obtained by one wise old man, thinking for Θ M2R¯2ln σ−1 /ε2 days, is the same as obtained by Θ ln σ−1 experts, each thinking for Θ M2R¯2/ε2 days? Our answer is that the quality is (cid:0) (cid:0) (cid:1) (cid:1) equivalent under any of three assumptions of Proposition 1. (cid:0) (cid:0) (cid:1)(cid:1) (cid:0) (cid:1) References [1] Shapiro A., Dentcheva D., Ruszczynski A. Lecture on stochastic programming. Modeling and theory. – MPS-SIAM series on Optimization, 2014. [2] Nemirovski A., Juditsky A., Lan, G., Shapiro, A. Robust stochastic approximation ap- proach to stochastic programming. SIAM J. Optim. 2009, V. 19. P.1574-1609. [3] Nemirovski A. Lectures on modern convex optimization analysis, algorithms, and engi- neering applications. Philadelphia: SIAM, 2013. [4] Durrett R. Probability and examples. – Cambridge University Press, 2010. [5] Guiges V., Juditsky A., Nemirovski A. Non-asymptotic confidence bounds for the optimal value of a stochastic program // e-print, 2016 arXiv:1601.07592 [6] Gasnikov A.V. Searching equilibriums in large transport networks. Doctoral thesis. – MIPT, 2016. [in Russian] arXiv:1607.03142 [7] Nesterov Yu., Vial J.-Ph. Confidence level solution for stochastic programming // Auto- matica. 2008. V. 44. no. 6. P. 1559–1568. 5

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.