ebook img

Convergence Rate of Distributed ADMM over Networks PDF

0.23 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Convergence Rate of Distributed ADMM over Networks

1 Convergence Rate of Distributed ADMM over Networks Ali Makhdoumi and Asuman Ozdaglar MIT, Cambridge, MA 02139 Emails: [email protected], [email protected] Abstract—WeproposeadistributedalgorithmbasedonAlter- forsomeconvexlossfunctionL:R Rd Rd Randsome natingDirectionMethodofMultipliers(ADMM)tominimizethe convexpenaltyfunctionp:Rd R×.This×gene→ralformulation sumoflocallyknownconvexfunctionsusingcommunicationover → captures many statistical scenarios including: 6 anetwork. Thisoptimization problememerges inmany applica- 1 tions in distributed machine learning and statistical estimation. • Least-Absolute Shrinkage and Selection Operator 0 We show that when functions are convex, both the objective (LASSO): 2 function values and the feasibility violation converge with rate n tOh(eT1fu),nwcthioenresTareisstthreonngulymcboenrvoefxitaenrdathioanvse.WLiepstchheintzshcoonwtitnhuaotuisf min 1 M (yi θ′xi)2+τ θ 1. a gradients, the sequence generated by our algorithm converges θ∈Rd M i=1 − || || J X linearly to the optimal solution. In particular, an ǫ-optimal 2 solutioncanbecomputedwithO(√κflog(1/ǫ))iterations,where • Support Vector Machine (SVM) ([10]): κ is the condition number of the problem. Our analysis also C] hraifgtehltihgrhotusgthhemaefxfiemctumofannedtwmoirnkimsutrmucdteugrreeeonoftnhoedecsonavsewrgeellncaes θm∈iRnd M1 M max{0,1−yi(θ′xi)}+τ||θ||22. O the algebraic connectivity of the network. Xi=1 . Suppose our distributed computing system consists of n ma- h chines each with k = M/n data points (without loss of t a I. INTRODUCTION generality suppose M is divisible by n, otherwise one of m the machines have the remainder of data points). For all A. Motivation [ i = 1,...,n, we define a function based on the available Many of today’s optimization problems in data science data to machine i as 1 (including statistics, machine learning, and data mining) in- v ik clude an abundance of data, which cannot be handled by 1 4 f (θ)= L(y ,x ,θ)+p(θ). 9 a single processor alone. This necessitates distributing data i k i i 1 among multiple processors and processing it in a decentral- 1+X(i−1)k 0 ized manner based on the available local information. The Therefore, the empirical risk minimization (2) can be written .0 applications in machine learning [1], [2], [3], [4], [5] along as minθ∈Rd n1 ni=1fi(θ), where functionfi(θ) is only avail- 1 with other applications in distributed data processing where abletomachinei,whichisaninstanceoftheformulation(1). 0 information is inherently distributed among many processors Data is distribPuted across different machines either because 6 (see e.g. distributed sensor networks [6], [7], coordination it is collected by decentralized agents [11], [12], [13] or 1 and flow control problems [8], [9]) have spearheaded a large because memory constraints prevent it from being stored in : v literature on distributed multiagent optimization. a singlemachine[5],[14], [15], [16].The decentralizednature i X In this paper, we focus on the following optimization of data together with communication constraints necessitate problem: distributedprocessingwhichhasmotivatedalargeliteraturein r a optimization and statistical learning on distributed algorithms n (see e.g. [17], [18], [19], [20], [21]). min f (x), (1) i x∈Rd i=1 X where f : Rd R is a convex function. We assume f is B. Related Works and Contributions i i → known only to agent i and refer to it as a local objective Much of this literature builds on the seminal works [22], function.1 Agents can communicate over a given network [23], which proposed gradient methods that can parallelize and their goal is to collectively solve this optimization. A computations across multiple processors. A number of recent prominent example where this general formulation emerges papers proposed subgradient type methods [24], [25], [26], is EmpiricalRisk Minimization(EMR). Supposethat we have [27],[28],[25],[29]oradualaveragingmethod[30]todesign aMnddyaitapoRinitssa{(taxrig,eyti)o}uMit=p1u,t.wThheereemxip∈iriRcadlirsisakfmeaitnuirmeivzeactitoonr disAtrnibuatletdernoapttiivmeizaaptipornoaaclghorisithtmos.use Alternating Direction ∈ is then given by Method of Multipliers (ADMM) type methods which for separable problemsleads to decoupled computations(see e.g. M 1 [31]and[32]forcomprehensivetutorialsonADMM).ADMM min L(y ,x ,θ)+p(θ), (2) i i θ∈Rd M hasbeenstudiedextensivelyinthe 80’s[33],[34],[35].More i=1 X recently, it has found applications in a variety of distributed 1Weusethetermsmachine,agent, andnodeinterchangeably. settings in machine learning such as model fitting, resource allocation, and classification (see e.g. [36], [37], [38], [39], ADMM algorithm. In Section III, we show some preliminary [40], [41], [42], [43], [44]). resultsthathelpsustoshowthemainresults.InSectionIVwe InthispaperwepresentanewdistributedADMMalgorithm show the sub-linear convergence rate. In Section V we show for solving problem (1) over a network. Our algorithm relies thelinearconvergencerateofouralgorithm.Finally,inSection on a novel node-based reformulation of (1) and leads to an VIweshowtheeffectofnetworkontheconvergencerateand ADMM algorithm that uses dual variables with dimension providenumericalresultsthatillustratetheperformanceofour given by the number of nodes in the network. This results in algorithm, which leads to concluding remarks in Section VII. a significant reduction in the number of variables stored and All the omitted proofs are presented in the appendix. communicated with respect to edge-based ADMM algorithm II. FRAMEWORK presentedin the literature (see [45], [46]). Our maincontribu- tion is a unified convergence rate analysis for this algorithm A. Problem Formulation thatappliestoboththecasewhenthelocalobjectivefunctions Consider a network represented by a connected graph G= areconvexandalsothecasewhenthelocalobjectivefunctions (V,E)whereV = 1,...,n isthesetofagentsandE isthe { } are strongly convex with Lipschitz continuous gradients. In set of edges. For any i, let N(i) denote its set of neighbors particular, our analysis shows that when the local objective including agent i itself, i.e., N(i) = j (i,j) E i , { | ∈ }∪{ } functions are convex (with no further assumptions), then the and let d denote the degree of agent i, i.e., N(i) =d +1. i i | | objectivefunctionattheergodicaverageoftheestimatesgen- We let d =max d and d =min d . max i∈V i min i∈V i erated by our algorithmconvergeswith rate O(1). Moreover, The goal of the agents is to collectively solve optimization T when the local objective functions are strongly convex with problem (1), where f is a function known only to agent i. i Lipschitz continuous gradients we show that the iterates con- In order to solve optimization problem (1), we introduce a vergeslinearly,i.e.,theiteratesconvergetoanǫ-neighborhood variable x for each i and write the objective function of i of the optimal solution after O(√κflog(1ǫ)) steps, where problem (1) as ni=1fi(xi), so that the objective function is κ is the condition number defined as L/ν, where L is the decoupled across the agents. The constraint that all the x ’s f i P maximumLipschitzgradientparameterandν istheminimum are equal can be imposed using the following matrix. strongconvexityconstantofthelocalobjectivefunctions.This Definition 1 (Communication Matrix). Let P be a n n matches the best known iteration complexity and condition × matrix whose entries satisfy the following property: numberdependenceforthecentralizedADMM(seee.g.[47]). For any i=1,...,n, P =0 for j / N(i). We refer to P as Our convergence rate estimates also highlight a novel depen- ij ∈ the communication matrix. dence on the network structure as well as the communication weights. In particular, for communication weights that are Assumption 1. The communication matrix P satisfies governed by the Laplacian of the graph, we establish a novel null(P)=span 1 , where 1 is a n 1 vector with all entries { } × iterationcomplexityO √κf dminda4m2ax(G)log 1ǫ , wheredmin Peq.ualto one and null(P) denotes the null-spaceof the matrix is the minimum degre(cid:16)e, dmaqx is the maxim(cid:0)um(cid:1)(cid:17)degree, and a(G) is the algebraic connectivity of the network. Finally, Example 1. If P < 0 for all j N(i) i , summation ij ∈ \{ } we illustrate theperformanceofouralgorithmwith numerical of each row of P is zero, and the graph is connected, then examples. Assumption1holds.Asaparticularcase,theLaplacianmatrix Our paper is most closely related to [45], [46], which of the graph given by P = 1 when j N(i) i and ij − ∈ \{ } studied edge-based ADMM algorithms for solving (1). In zero otherwise, and P = d is a communication matrix that ii i [46],theauthorsconsiderconvexlocalobjectivefunctionsand satisfies Assumption 1. provide an O(1) convergence rate. The more recent paper T We next show that the constraintthat all x ’s are equal can [45] assumes strongly convex local objective functions with i be enforced by the linear constraint Ax = 0, where x = Lipschitzgradientsandshowalinearconvergenceratethrough (x ,...,x ) where each x is a sub-vector of dimension d a completely differentanalysis. This analysis does not extend 1 n i and A is a dn dn matrix defined as the Kronecker product tothenode-basedADMMalgorithmundertheseassumptions. × between communication matrix P and I , i.e., A=P I . In contrast, our paper provides a unified convergence rate d ⊗ d analysisfor both cases for the node-baseddistributed ADMM Lemma 1. Under Assumption 1, the constraint Ax = 0 algorithm.Ourpaperisalsorelatedto[48],[47]thatstudythe guarantees that x =x for all i,j V. i j ∈ basic centralized ADMM where the goal is to miminize sum Using Lemma 1, under Assumption 1, we can reformulate of two functionswith a linearly coupled constraint. Our work optimization problem (1) as is also related to the literature on the converge of operator splitting schemes, such as Douglas-Rachford splitting and min F(x) (3) relaxed Peaceman-Rachford [49], [50], [51], [52], [53], [54], x∈Rnd s.t. Ax=0, [55], [56], [57]. where F(x)= n f (x ). i=1 i i C. Outline Assumption2.PTheoptimalsolutionsetofproblem(3)isnon- The organization of paper is as follows. In Section II we empty. We let x∗ denote an optimal solution of the problem give the problem formulation and propose a novel distributed (3). 2 B. Multiagent Distributed ADMM Algorithm 1 Multiagent Distributed ADMM In this section, we propose a distributed ADMM algorithm • Initialization: xi(0), yi(0), and pi(0) all in Rd, for any to solve problem (3). We first use a reformulation technique i V and matrix A Rnd×nd. ∈ ∈ (this technique was introduced in [58] to separate optimiza- • Algorithm: tion variables in a constraint, allowing them to be updated 1) for i=1,...,n, let simultaneously in an ADMM iteration), which allows us to x (t+1) argmin f (x )+ (p′(t)A x separate each constraint associated with a node into multiple i ∈ xi∈Rd i i j ji i constraintsthatinvolveonlythevariablecorrespondingtoone j∈XN(i) c of the neighboring nodes. We expand the constraint Ax = 0 + y (t)+A (x x (t) 2). 2|| j ji i− i ||2 so that for each node i, we have A x = 0, where j∈N(i) ij j A = P I is a d d matrix. We let A x = z Rd 2) for i=1,...,n, let ij ij ⊗ d × P ij j ij ∈ to obtain the following reformulation: 1 y (t+1)= A x (t+1). i ij j d +1 mx,inzF(x) i j∈XN(i) 3) for i=1,...,n, let s.t. A x =z , for i=1,...,n, j N(i), ij j ij ∈ zij =0, for i=1,...,n. (4) pi(t+1)=pi(t)+cyi(t+1) j∈XN(i) • Output: {xi(t)}∞t=0 for any i∈V. For each equality constraint in (4), we let λ Rd be ij ∈ the corresponding Lagrange multiplier and form the aug- mented Lagrangian function by adding a quadratic penalty with penalty parameter c > 0 for feasibility violation to the yi(t) = di1+1 [A]i ′x(t) and [A]i ′ = (Pi1,...,Pin)⊗Id. Lagrangian function as This reduction shows that the algorithm need not maintain (cid:0) (cid:1) (cid:0) (cid:1) primal variables z (t) for each i and its neighbors j, but n ij L (x,z,λ)=F(x)+ λ′ (A x z ) instead can operate with the lower dimensional node-based c ij ij j − ij primal variables y (t), where y (t) is node i’s estimate of the Xi=1j∈XN(i) i i primalvariable(obtainedastheaverageofprimalvariablesof n c + A x z 2. hisownneighbors).Theaforementionedreductionsareshown 2 || ij j − ij||2 in the following proposition. Xi=1j∈XN(i) We now use ADMM algorithm (see e.g. [59]). ADMM Proposition 1. The sequence x (t) ∞ for i = 1,...,n algorithm generates primal-dual sequences xj(t) , zij(t) , generated by implementing the{steips}ptr=es0ented in Algorithm { } { } and λij(t) which at iteration t+1 are updated as follows: (1) is the same as the sequence generated by the ADMM { } 1) For any j =1,...,n, we update x as algorithm. j xj(t+1)∈argminxj∈RdLc(x,z(t),λ(t)). (5) The steps of the algorithm can be implemented in a distributed way, meaning that each node first updates her 2) For any i = 1,...,n, we update the vector z = i estimates based on the information received from her neigh- [z ] as ij j∈N(i) boring nodes and then broadcasts her updated estimates to zi(t+1)∈argminzi∈ZiLc(x(t+1),z,λ(t))., (6) her neighboring nodes. Each node i maintains local variables x (t), y (t), and p (t) and updates these variables using where Z = z z =0 . i i i i { i | j∈N(i) ij } communication with its neighbors as follows: 3) For i=1,...,n and j N(i) we update λ as ij P ∈ λ (t+1)=λ (t)+c(A x (t+1) z (t+1)). • Atthe end of iterationt, eachnodei sendsoutpi(t) and ij ij ij j ij − y (t) to allofits neighborsandtheneachnodesuch asj (7) i uses y (t) and p (t) of all i N(j) to update x (t+1) i i j One can implement this algorithm in a distributed manner, as in step 1. ∈ where node i maintains variables λij(t) and zij(t) for all j ∈ • Each node j sends out xj(t+1) to all of its neighbors N(i) ([46]). However, using the inherent symmetries in the and then each node such as i computes y (t+1) as in i problem, we can significantly reduce the number of variables step 2. that each node requires to maintain from O(|E|) to O(|V|). • Each node i updates pi(t+1) as in step 3. We first show that for all t, i, and j N(i), we have ∈ λ (t) = p (t). This reduction shows that the algorithm need Usingthisalgorithmagentineedtostoreonlythreevariables, ij i notmaintaindualvariablesλ (t) foreachiandits neighbors x (t),y (t),andp (t)andupdatethemateachiteration.Also, ij i i i j, but instead can operate with the lower dimensional node- each agent need to communicate only with (broadcast her based dual variable p (t). The dual variable p (t) can be estimates to) its neighbors. Therefore, the overall storage re- i i updated using primal variables x (t) for all j N(i). The quirementis 3V andthe overallcommunicationrequirement j ∈ | | second observation is that z (t) = A x (t) y (t), where at each iteration is E . ij ij j i − | | 3 III. PRELIMINARY RESULTS Proposition2. Foranyr Rd andt,thesequencegenerated ∈ by Algorithm 1 satisfies: In this section, we present the preliminary results that we will use to establish our convergencerate. we define 2 (F(x(t+1)) F(x∗))+2r′Qx(t+1) c − ∂F(x)={h∈Rnd : h=(h1(x1)′,...,∇hn(xn)′)′ ≤||q(t)−q∗||2G−||q(t+1)−q∗||2G−||q(t)−q(t+1)||2G, , h (x ) ∂f (x ) , i i i i ∈ } r∗ where q∗ = . where for each i, ∂f (x ) denotes subdifferential of f at x , x∗ i i i i (cid:18) (cid:19) i.e., the set of all subgradients of f at x . In what follows, i i In order to obtain O(1/T) convergence rate, we consider for notational simplicity we assume d=1, i.e., in (3) x R. All the analysis generalizes to the case with x Rd. We∈first theperformanceofthealgorithmattheergodicvectordefined provide a compact representation of the evolu∈tion of primal as xˆ(T)=(xˆ1(T),...,xˆn(T)), where vector x(t) that will be used in the convergence proof. This 1 T is a core step in proving the convergencerate as it eliminates xˆi(T)= xi(t), T the dependence on the other variables yi(t) and pi(t). Let M Xt=1 be a n n diagonal matrix with M = A2 and D for any i = 1,...,n. Note that each agent i can construct × ii j∈N(i) ji be a n n diagonal matrix with D =d +1. this vector by simple recursive time-averaging of its estimate ii i × P x (t). Let (x∗,ˆr) be a primal-dual optimal solution of Lemma 2 (Perturbed Linear Update). The update of Algo- i rithm 1 can be written as min F(x). Qx=0 1 x(t+1)= M−1h(x(t+1))+ I M−1A′D−1A x(t) Since null(Q) = null(P), under Assumption (1), the optimal − c − primal solution of this problem is the same as of the original t (cid:0) (cid:1) M−1(A′D−1A) x(s), problem (3) Next, we show both objective function and − feasibility violation converges with rate O(1) to the optimal Xs=0 T value. for some h(x(t+1)) ∂F(x(t+1)). ∈ Theorem 1. For any T, we have Lemma2showsx(t+1)canbewrittenasaperturbedlinear co1mMbi−na1thio(xn(otf+{x1(s))).}tsT=h0ewiinthtutihtieonperbteuhrbinadtiotnhebecionngvtehregteenrcme |F(xˆ(T))−F(x∗)|≤ 2cT ||x(0)−x∗||2M−A′D−1A r−atce analysis is that the linear term that relates x(t +1) to + c max r(0) 2ˆr 2(cid:0), r(0) 2 . (cid:1) 2T {|| − ||2 || ||2} x(0),...,x(t) guaranteesthatthe sequencex(t) convergesto We also ha(cid:0)ve (cid:1) a consensus point where x (t) = x (t) for all i,j V; and i j tchoenvpeerrgtuinrbgatpiooinnttemrmin−im1ciMzes−1thhe(xo(btj+ec1ti)v)egfuuanrcatniotene∈sFth(xat)th=e ||Qxˆ(T)||2 ≤ 21T ||x(0)−x∗||2M−A′D−1A ni=1fi(xi). + 21T 2||r(0)−ˆr(cid:0)||22+2 . (cid:1) P This theorem(cid:0) shows that the(cid:1)objective function at the IV. SUB-LINEARRATEOF CONVERGENCE ergodic average of the sequence of estimates generated by In this section, we show the sublinear rate of conver- Algorithm1convergeswithrateO(1)totheoptimalsolution. gence. We define two auxiliary sequences that we will use T We next characterize the network effect on the performance in proving the convergence rates. Since A′D−1A is positive guarantee. semidefinite (see Lemma 6 in the appendix), we can define Q = (A′D−1A)1/2. In other words, we let Q = VΣ1/2V′, Theorem 2. For any T, starting form x(0)=0, we have where A′D−1A=VΣV′ is the singular value decomposition c 2 U2 of the symmetric matrix A′D−1A. We define the auxiliary |F(xˆ(T))−F(x∗)|≤2T||x∗||22λM + cT λ˜ , sequences m t and r(t)= Qx(s), 1 1 U2 Qxˆ(T) x∗ 2λ + 2+2 , Xs=0 || ||≤ 2T|| ||2 M 2T c2λ˜ (cid:18) m(cid:19) and r(t) where U is a bound on the subgradients of the function F q(t)= x(t) . at x∗, i.e., v U for all v ∂F(x∗), λ˜m is the smallest (cid:18) (cid:19) non-zeroeigkenkv≤alueofA′D−1A∈,andλ isthelargesteigen M We also let value of M A′D−1A. I 0 − G= . 0 M A′D−1A Remark1. Boththeoptimalityoftheobjectivefunctionvalue (cid:18) − (cid:19) at the ergodic average and the feasibility violation converge Next, we show a proposition that bounds the function value with rate O(1). Our guaranteed rates show a novel depen- at each iteration. T dency on the network structure and communication matrix 4 throughλ˜ and λ . Therefore,for a givenfunction,in order Therate of convergencein Theorem3 holdsforanychoice m M toobtainabetterperformanceguaranteeweneedtomaximize of penalty parameter c > 0. In other words, for any choice λ˜ and minimize λ . In Section (VI) we show that these of c > 0, the convergence rate is linear. We now optimize m M termsdependonthealgebraicconnectivityofthenetworkand the rate of convergence over all choices of c and provide an provide explicit dependencies solely on the network structure explicit convergence rate estimate that highlights dependence whenthecommunicationmatrixistheLaplacianofthegraph. on the condition number of the problem. Theorem 4. Suppose Assumptions 1, 2, and 3 hold. Let V. LINEARRATEOF CONVERGENCE x(t) ∞ be the sequence generated by Algorithm 1. There In order to show the linear rate of convergence, we adopt e{xist c}t>=10 for which we have the following standard assumptions. x(t) x∗ 2 ρt q(0) q∗ 2, Assumption 3 (Strongly convex and Lipschitz Gradient). || − ||2 ≤ || − ||2 For any i=1,...,n, the function f is differentiableand has where the rate ρ<1 is given by i Lipschitz continuous gradient, i.e., −1 1 2λ˜2 1 |∇fi(x)−∇fi(y)|≤Lfi||x−y||2, for any x,y ∈Rd, ρ= 1+ 2sλM(2+mλ˜m)κf! . for some L 0. The function f is also strongly convex with parameftier≥ν >0, i.e., f (x) i νfi x 2 is convex. Remark 2. This result shows that within O √κflog 1ǫ fi i − 2 || ||2 iterations, the estimates x(t) reach an ǫ- neighborhood of { } (cid:0) (cid:0) (cid:1)(cid:1) We let ν = min1≤i≤nνfi and L = max1≤i≤nLfi, and theoptimalsolution.Ourrateestimatehasa √κf dependence definetheconditionnumberofF(x)(ortheconditionnumber which improves on the linear condition number dependence of problem (3)) as κ = L. Note that when the functions are provided in the convergence analysis of edge-based ADMM f ν differentiable, we have in [47]. The network dependence in our rate estimates is captured through λ˜ and λ . In particular, the larger λ˜ F(x)=( f (x )′,..., f (x )′)′ Rnd. m M m ∇ ∇ 1 1 ∇ n n ∈ and the smaller λM results in a faster rate of convergence. Assumption 3 results in the following standard inequalities In Section (VI) we will explicitly show the network effect in for the aggregate function F(x). the convergence rate and provide numerical results that illus- trate the performancefor networkswith differentconnectivity Lemma 3. (a) UnderAssumption3,foranyx,y Rnd,we properties. ∈ have ( F(x) F(y))′(x y) ν x y 2. VI. NETWORK EFFECTS ∇ −∇ − ≥ || − ||2 (b) Under Assumption 3, for any x,y Rnd, we have We can choose communication matrix P (and the corre- ∈ sponding matrix A) in the Algorithm 1 to be any matrix 1 ( F(x) F(y))′(x y) F(x) F(y) 2. that satisfies Assumption (1). One natural choice for the ∇ −∇ − ≥ L||∇ −∇ ||2 matrix P is the Laplacian of the graph which leads to having (c) Under convexity assumption, for any x,y Rnd and A = A = 1 for all j N(i) i and A = d . Using ij ji ii i ∈ − ∈ \{ } h(x) ∂F(x) we have Laplacian as the communication matrix we can now capture ∈ the effect of network structure in the convergence rate. (x y)′h(x) F(x) F(y). − ≥ − UnderAssumption(3)weshowthatthesequencegenerated A. Network Effect in Sub-linear Rate by Algorithm (1) converges linearly to the optimal solution The following proposition explicitly show the networks (which is uniqueunderthese assumptions). The idea is to use dependence in the bounds provided in Theorem 2. strong convexity and Lipschitz gradient property of F(x) in ordertoshowthattheG-normofsequenceq(t) q∗ contracts Proposition 3. For any T, starting form x(0)=0 and using − at each iteration, providing a linear rate. standard Laplacian as the communication matrix, we have Theorem 3. Suppose Assumptions 1, 2, and 3 hold. For any F(xˆ(T)) F(x∗) c x∗ 2 4d2 + 2 U2 2dmax , value of the penalty parameter c > 0 and β (0,1), the | − |≤2T|| ||2 max cT a(G)2 ∈ sequence generated by Algorithm 1 {x(t)}∞t=1 satisfies and (cid:0) (cid:1) x(t) x∗ 2 1 t q(0) q∗ 2, Qxˆ(T) 1 x∗ 2 4d2 + 1 2+ 2U2 2dmax , || − ||2 ≤ 1+δ || − ||2 || ||≤ 2T|| ||2 max 2T c2 a(G)2 (cid:18) (cid:19) (cid:18) (cid:19) (cid:0) (cid:1) where where U is a bound on the subgradients of the function F at x∗, i.e., v U for all v ∂F(x∗) and a(G) is the δ min 2βν ,(1−β)cλ˜m , algebraic coknnkec≤tivity of the grap∈h. ≤ (cλM(1+ λ˜2m) L ) Therefore, highly connected graphs with larger algebraic and λ˜ is the smallest non-zeroeigen value of A′D−1A, λ connectivity has a faster convergencerate (see e.g. [60], [61] m M is the largest eigen value of M A′D−1A. for an overview of the results on algebraic connectivity). − 5 that ADMM can be implemented by only keeping track of 102 somenode-basedvariables.Wethenshowedthatouralgorithm reaches ǫ-optimal solution in O 1 number of iterations for 100 ǫ convexfunctionsand in √κflo(cid:0)g (cid:1)1ǫ iterationsfor strongly convex and Lipschitz functions. Our analysis shows that the ∗x2||10-2 performance of our algo(cid:0)rithm dep(cid:0)end(cid:1)s(cid:1)on the algebraic con- − xt||()10-4 d=15 ntheectmivaitxyimofumthedeggrarepeh,otfhethmeinnoimdeusm. Fdineaglrleye, wofethilelunstordaetesd, atnhde d=20 performance of our algorithm with numerical examples. 10-6 d=10 APPENDIX 10-8 0 10 20 30 40 50 60 70 80 90 100 Iterationnumber A. Proof of Lemma 1 Fig. 1:PerformanceofAlgorithm1 forthreed-regulargraphs Consider the k-th coordinate of all xi’s and form a n 1 × with d = 10,20,30. The y axis is logarithmic to show the vector xk. From Ax = 0 and the fact that A = P I, we ⊗ linear convergence rate. obtain that Pxk = 0. This shows that xk null(P). Using ∈ Assumption1,wehavexk span( 1 ),whichguaranteesthat ∈ { } allentriesofxk areequal.Similarly,foranyk =1,...,d,the B. Network Effect in Linear Rate k-th entries of all xi’s are equal. This leads to xi =xj for all i,j V. The following proposition explicitly show the networks ∈ dependence in the bound provided in Theorem 4. B. Proof of Lemma 2 Proposition 4. Suppose Assumptions 1, 2, and 3 hold. Using Using the first step of Algorithm (1) for i, we can write standard Laplacian as the communication matrix, in order x (t+1) as tioterraetaiocnhsasnuffiǫc-oep.timal solution O(cid:16)√κfqdminda4m2ax(G)log(cid:0)1ǫ(cid:1)(cid:17) ihi(x(t+1))+ Ajipj(t) Both of our guaranteed rates for sub-linear and linear j∈XN(i) +c A (y (t)+A x (t+1) A x (t))=0, (8) rates depends on three parameters d , d and a(G). The ji j ji i ji i max min − convergence rate is faster for larger dmin and smaller dmax. j∈XN(i) Finally, the convergence rate is faster for larger algebraic whereh (x(t+1))=f′(x (t+1))fordifferentiablefunctions i i i connectivity a(G). and h (x(t+1)) ∂f (x (t+1)) in general. We next use i i i ∈ second and third steps of Algorithm (1) to write p (t) and Example 2. To provide more intuition on the networks i y (t) in terms of (x(0),...,x(t)). Using the update for j, we dependence,wefocusond-regulargraphswithmatrixP equal i toLaplacianofthegraph.Inthissetting,wehave:λ˜ = a(G)2 have m d+1 t t and λ =d(d+1), where a(G) is the algebraic connectivity c M p (t)A = A ([A]j)′x(s) of the graph. Thus in this case the iteration complexity is j ji ji d +1 i O √κf a2d(3G)log 1ǫ (note that this bound matches the j∈XN(i)t j∈XN(i) Xs=0 Xs=0 on(cid:16)e provqided in Pro(cid:0)po(cid:1)s(cid:17)ition 4). For d-regular graphs there =c [A′D−1Ax(s)]i. (9) exist good expanders such as Ramanujan graphs for which s=0 X a(G) = O(d) (see e.g. [62]). In Figure 1, we compare the Moreover, we can write the term A y (t) based on performance of our algorithm for several regular graphs. The j∈N(i) ji j the sequence (x(0),...,x(t)) as follows cishoaicsecaolafrftuhnacttiiosnkinsowFn(xo)nl=y to12 manic=h1i(nxe−i (awi)h2erwehaeire=aii A y (t)= AP 1 ([A]j)′x(t) P ji j ji for i = 1,...,n). The communication matrix used in these dj +1 experiments is the Laplacian of the graph. This problem j∈XN(i) j∈XN(i) =[A′D−1Ax(t)] . (10) appearsin distributed estimationwhere the goalis to estimate i the parameter x∗, using local measurements ai =x∗+Ni at Substituting (9) and (10) in (8), we can write the update of each machinei=1,...,n. Here Ni representsmeasurements xi(t + 1) in terms of the sequence (x(0),...,x(t)), which noise, which we assume to be jointly Gaussian with mean then can compactly be written as zero and variance one. The maximum likelihood estimate is cMx(t+1)= h(x(t+1))+c M A′D−1A x(t) the minimizer x∗ of F(x). − − t (cid:0) (cid:1) c(A′D−1A) x(s), − VII. CONCLUSION s=0 X We proposed a novel distributed algorithm based on Alter- where h(x(t + 1)) = F(x(t + 1)) if the functions are ∇ natingDirectionMethodofMultipliers(ADMM)tominimize differentiable and h(x(t + 1)) ∂F(x(t + 1)) in general. ∈ the sum of locally known convex functions. We first showed Left multiplying by 1M−1, completes the proof. c 6 C. Proof of Proposition 2 Using convexity of the functions and Jensen’s inequality, we obtain We first show a lemma that we will use in the proofof this c proposition.The followinglemma showsthe relation between F(xˆ(T)) F(x∗)+cr′Qxˆ(T) q(0) q 2. the auxiliary sequence r(t) and the primal sequence x(t). − ≤ 2T|| − ||G Letting r=0, yields Lemma 4. SupposeAssumptions 1 and 2 hold. The sequence x(t),r(t) ∞ satisfies F(xˆ(T)) F(x∗) { }t=0 c − (M A′D−1A)(x(t+1) x(t)) ≤ 2T ||x(0)−x∗||2M−A′D−1A+||r(0)||22 . (11) − − 1 From sadd(cid:0)le point inequality, we have (cid:1) = Qr(t+1) h(x(t+1)), − − c F(x∗) F(xˆ(T))+cˆr′Qxˆ(T), (12) for some h(x(t+1)) ∂F(x(t+1)). ≤ ∈ which implies Proof: Using Lemma 2 we have F(x∗) F(xˆ(T)) cˆr′Qxˆ(T). (13) 1 − ≤ M(x(t+1) x(t))= h(x(t+1)) − −c Next, we will bound the term ˆr′Qxˆ(T). We add the term t cˆr′Qxˆ(T) to both sides of (12) to obtain (A′D−1A)x(t) (A′D−1A) x(s). − − cˆr′Qxˆ(T) F(xˆ(T)) F(x∗)+2cˆr′Qxˆ(T). (14) Xs=0 ≤ − We subtract(A′D−1A)x(t+1)frombothsidesandrearrange Again, using Proposition 2 to bound the right-hand side of the terms to obtain (14), we obtain c (M A′D−1A)(x(t+1) x(t)) cr∗′Qxˆ(T)≤ 2T ||x(0)−x∗||2M−A′D−1A+||r(0)−2ˆr||22 . − − t+1 1 (cid:0) (15(cid:1)) = A′D−1A x(s) h(x(t+1)). − − c Using (15) to bound the right-hand side of (13), and then Xs=0 combining the result with (12), we obtain Using QQ=A′D−1A, yields c F(xˆ(T)) F(x∗) x(0) x∗ 2 | − |≤ 2T || − ||M−A′D−1A (M A′D−1A)(x(t+1) x(t)) c − − + max r(0) 2ˆr 2(cid:0), r(0) 2 . (cid:1) 1 2T {|| − ||2 || ||2} = Qr(t+1) h(x(t+1)). − − c We next bo(cid:0)und the feasibility violation. U(cid:1)sing Proposition 2 with r=ˆr+ Qxˆ(T) , we have ||Qxˆ(T)|| Back to the proof of Proposition 2: Using Lemma 7 and F(xˆ(T)) F(x∗)+cr∗′Qxˆ(T)+c Qxˆ(T) Lemma 3 part (c), we have that − || || c x(0) x∗ 2 2 ≤ 2T || − ||M−A′D−1A (F(x(t+1)) F(x∗))+2r′Qx(t+1) c − c (cid:0) Qxˆ(T) (cid:1) + r(0) ˆr 2 . 2(x(t+1) x∗)′h(x(t+1))+2r′Qx(t+1) 2T (cid:18)|| − − ||Qxˆ(T)||||2(cid:19) ≤ c − Since (x∗,ˆr) is a primal-dual optimal solution, using saddle =2(x(t+1) x∗)′( Q(r(t+1) r) − − − point inequality, we have that (M A′D−1A)(x(t+1) x(t))) − − − F(xˆ(T)) F(x∗)+cˆr′Qxˆ(T) 0. =2(r(t+1) r(t))′( r(t+1)+r) − ≥ − − 2(x(t+1) x∗)′(M A′D−1A)(x(t+1) x(t)) Combining the two previous relations, we obtain − − − − = r(t) r∗ 2 r(t+1) r 2 r(t) r(t+1) 2 1 +(||x(t)−x∗||22−|| − x||(2t−+|1|) x−∗ 2 ||2 ||Qxˆ(T)||2 ≤ 2T ||x(0)−x∗||2M−A′D−1A (cid:0)|| − ||(M−A′D−1A)−|| − ||(M−A′D−(cid:1)1A) 1 (cid:0) Qxˆ(T) (cid:1) −||x(t)−x(t+1)||2(M−A′D−1A)) + 2T ||r(0)−ˆr− Qxˆ(T) ||22 . = q(t) q∗ 2 q(t+1) q∗ 2 q(t) q(t+1) 2. (cid:18) || || (cid:19) || − ||G−|| − ||G−|| − ||G Since Qxˆ(T) r(0) ˆr 2 2 r(0) ˆr 2+2, D. Proof of Theorem 1 || − − Qxˆ(T) ||2 ≤ || − ||2 || || Taking summation of the relation given in Proposition 2 we can further bound Qxˆ(T) as || || from t=0 up to t=T, we obtain that 1 Qxˆ(T) x(0) x∗ 2 T || ||2 ≤ 2T || − ||M−A′D−1A c F(x(t))−F(x∗)+cr′Qx(t)≤ 2||q(0)−q||2G. + 1 2 r(0) ˆr(cid:0) 2+2 . (cid:1) t=0 2T || − ||2 X (cid:0) (cid:1) 7 E. Proof of Theorem 2 F. Proof of Lemma 3 Wefirstshowalemmathatboundsthenormofdualoptimal For any i, since f (x) νfi x 2 is convex, (f (x) solution of (16). νfi x 2) is monotonei and−we2h|a|ve|| ∇ i − 2 || || Lemma 5. Let x∗ be an optimal solution for problem (3). f (x) f (y),x y ν x y 2 0. h∇ i −∇ i − i− fi|| − || ≥ There exists an optimal dual solution ˜r for problem Since ν ν , we obtain ≤ fi min F(x), (16) f (x) f (y),x y ν x y 2 0, cQx=0 h∇ i −∇ i − i− || − || ≥ that satisfies which results in U2 ( F(x) F(y))′(x y) ν x y 2, ˜r 2 , ∇ −∇ − ≥ || − ||2 || ||2 ≤ c2λ˜ this completes the proof of part (a). We now prove part (b). m where U is a bound on the subgradients of the function F at For any x,y Rn, we have ∈ x∗, i.e., v U for allv ∂F(x∗), and λ˜ is the smallest 1 m non-zerokeikge≤n-value of A′D∈−1A. fi(y)=fi(x)+ fi(x+τ(y x)),y x dτ h∇ − − i Z0 Proof: There exists an optimal primal-dual solution for =f (x)+ f (x),y x i i h∇ − i problem (16) such that (x∗,ˆr) is a saddle point of the 1 Lagrangian function, i.e., for any x Rn, + fi((1 τ)x+τy) fi(x),y x dτ ∈ Z0 h∇ − −∇ − i F(x∗) F(x) cˆr′Q(x x∗). (17) f (x)+ f(x),y x i − ≤ − ≤ h∇ − i 1 Note that (x∗,ˆr) satisfies saddle point inequality if and only + f ((1 τ)x+τy) f (x) y x dτ i i if itsatisfies theinequalitygivenin(17).Equation(17) shows Z0 ||∇ − −∇ |||| − || that cˆr′Q ∂F(x∗). Let cˆr′Q = v′ ∂F(x∗). We will use 1 ∈ ∈ f (x)+ f (x),y x +L y x 2 τdτ thisˆrto construct˜r such thatc˜r′Q=v′ and hencewe would ≤ i h∇ i − i fi|| − || Z0 have L =f (x)+ f (x),y x + fi y x 2. i i F(x∗) F(x) c˜r′Q(x x∗), h∇ − i 2 || − || − ≤ − Let φ (y) = f (y) f (x),y . Note that φ (y) has x i i x meaning (x∗,˜r) satisfies the saddle point inequality. This Lipschitzgradientwith−pahra∇meterL i.Moreover,wehavethat shows that (x∗,˜r) is an optimal primal-dual solution (see min φ (y)=φ (x), since fi y x x section 6 of [59]). Moreover, we choose ˜r to satisfy the φ (y)= f (y) f (x) statement of lemma. x i i ∇ ∇ −∇ Let Q= ri=1uiσivi′ be the singular value decomposition is zero for y = x. Therefore, using the previous relation, we of Q, where rank(Q)=r. Since cˆr′Q=v′, v belongs to the have that spanof cvP,...,cv andcanbewrittenasv=c r c v yofoiferl˜rsdoswmee{choa1evfeficci˜re′nQtsrc=}i’sc.Leri=t˜1rc=ivPi =ri=1vσc′ii.uTih.iBsyctPhhoisiic=ce1hoailiscoei φ≥x(2yL)1−fi|φ|∇x(φxx)(=y)|f|i=(y)2−L1ffii|(|∇x)f−i(yh)∇−fi∇(xf)i,(yx)−||2x.i P Using the previous relation , we have ˜r 2 = r c2i r c2i = 1 r c2 fi(y) fi(x) fi(x),y x 1 fi(y) fi(x) 2. || || i=1 σi2 ≤ i=1 λ˜m σm2in i=1 i − −h∇ − i≥ 2Lfi||∇ −∇ || X X X We also have 1 1 = v 2 U2, 1 c2λ˜ || || ≤ c2λ˜ f (x) f (y) f (y),x y f (y) f (x) 2. m m i i i i i − −h∇ − i≥ 2L ||∇ −∇ || where we used the bound on the subgradient to obtain the fi We add the two preceding relations to obtain lastinequality.Since˜r′Q=v′ ∂F(x∗), (x∗,˜r)satisfies the saddle point inequality. ∈ f (x) f (y),x y 1 f (y) f (x) 2, i i i i h∇ −∇ − i≥ L ||∇ −∇ || fi Next, we use this lemma to analyze the network effect. for any i = 1,...,n. Combining this relation for all i’s Using Theorem 1 with zero initial condition, we have completes the proof. c 2 U2 Finally,weprovepart(c).Leth=(h′1,...,h′n)′.Bydefinition |F(xˆ(T))−F(x∗)|≤2T||x∗||2M−A′D−1A+ cT λ˜m, of subgradient, for any i∈V we have h (x (t+1))′(x y ) f (x ) f (y ). and i i i i i i i i − ≥ − 1 1 U2 Takingsummationofthisinequalityforalli=1,...,nshows Qxˆ(T) x∗ 2 + 2+2 . || ||≤ 2T|| ||M−A′D−1A 2T c2λ˜ that (cid:18) m(cid:19) Using x∗ 2 λ x∗ 2 completes the proof. h(x)′(x y) F(x) F(y). || ||M−A′D−1A ≤ M|| ||2 − ≥ − 8 G. Proof of Theorem 3 for some r∗ that satisfies Qr∗+ 1 F(x∗)=0. Moreover, r∗ c∇ We first show two lemmas that we use in the proof belongs to the column span of Q. of this theorem. The first lemma shows that both matrices Proof: Using Lemma 2 for differentiable functions we M A′D−1A and A′D−1A are positive semidefinite and the − have second lemma shows a relation between the sequences q(t), r(t), and x(t) same as the one shown in Lemma 7. 1 M(x(t+1) x(t))= F(x(t+1)) Lemma 6. The matrices M A′D−1A and A′D−1A are − −c∇ − positive semidefinite. t (A′D−1A)x(t) (A′D−1A) x(s). Proof:Bothmatricesareclearlysymmetric.Wefirstshow − − s=0 M A′D−1A is positivesemidefinite.By definition,we have X − [M A′D−1A] =M A A 1 = A2 dl . We subtract(A′D−1A)x(t+1)frombothsidesandrearrange − ii ii− li lid +1 lid +1 the terms to obtain l l l l X X We also have (M A′D−1A)(x(t+1) x(t)) 1 [M A′D−1A] = A A . − − − ij − li ljd +1 t+1 1 l l = A′D−1A x(s) F(x(t+1)). X − − c∇ Therefore, s=0 X 1 [M A′D−1A] = A A | − ij | | li ljd +1 | Using QQ=A′D−1A, yields l j6=i j6=i l X X X 1 1 ≤ |Ali |dl+1 | Alj |= A2lidl+1, (M −A′D−1A)(x(t+1)−x(t)) Xl Xj6=i Xl 1 = Qr(t+1) F(x(t+1)). where we used the fact that nj=1Alj = 0, for any l. − − c∇ Therefore,byGreshgorinCircleTheorem,foranyeigenvalue µ of M A′D−1A, for some iPwe have We next show there exist r∗ such that Qr∗+ 1 F(x∗)=0. − c∇ First note that both column space (range) and null space of µ [M A′D−1A] [M A′D−1A] , | − − ii|≤ | − ij| Q and A′D−1A are the same. Since span(Q) null(Q) = Xj6=i Rn, we have F(x∗) span(Q) null(Q) =⊕span(Q) which leads to span 1 as nu∇ll(Q) = ∈span 1 . Si⊕nce 1′ F(x∗) = 0, w⊕e { } { } ∇ µ [M A′D−1A] [M A′D−1A] can write F(x∗) as a linear combination of column vectors ≥ − ii− | − ij | of Q. The∇refore, there exist r such that 1 F(x∗) = Qr. Xj6=i Let r∗ = Proj r to obtain Qr = Qr∗ wch∇ere r∗ lies i−n the A2 dl−1 0, column space Qof Q. Part (b) simply follows from the same ≥ li d +1 ≥ l (cid:18) l (cid:19) lines of argument. X where we used the fact that dl 1 that evidently holds. We Back to the proof of Theorem 3: Note that since M next show that A′D−1A is posit≥ive semidefinite. We have A′D−1A is positive semidefinite, − 1 [A′D−1A] = A2 ii lidl+1 .,. :R2n R2n R Xl h i × 7→ We also have hq1,q2i=q′1Gq2, 1 [A′D−1A] = A A | ij | | li ljd +1 | where l j6=i j6=i l X X X 1 1 I 0 A A = A2 . G= ≤ | li |d +1 | lj | lid +1 0 M A′D−1A l l j6=i l l (cid:18) − (cid:19) X X X Since [A′D−1A] [A′D−1A] , similarly, by ii ≥ j6=i | ij | is a semi-inner product.2 We first show that for a δ given by Greshgorin Circle Theorem, the matrix A′D−1A is positive P the statement of theorem, we have semidefinite. Lemma 7. Suppose Assumptions 1 and 2 hold. For differen- 1 tiable functions, the sequence {x(t),r(t)}∞t=0 satisfies ||q(t+1)−q∗||2G ≤(cid:18)1+δ(cid:19)||q(t)−q∗||2G. (18) (M A′D−1A)(x(t+1) x(t)) − − 1 = Q(r(t+1) r∗) ( F(x(t+1)) F(x∗)), 2This means it satisfies conjugate symmetry, linearity and semipositive- − − − c ∇ −∇ definiteness (insteadofpositive-definiteness). 9 Using Lemma (3) and Lemma (7), we have Since bothr(t+1) andr∗ are orthogonalto 1 and null(Q)= span( 1 ), using Lemma (7), we obtain 2 { } ν x(t+1) x∗ 2 c || − ||2 δ δ (r(t+1) r∗) 2 Q(r(t+1) r∗) 2 2(x(t+1) x∗)′( F(x(t+1)) F(x∗)) || − ||2 ≤ λ˜m|| − ||2 ≤ c − ∇ −∇ δ =2(x(t+1) x∗)′(Q(r∗ r(t+1))) (M A′D−1A)(x(t+1) x∗) − − ≤ λ˜ || − − m +2(x(t+1) x∗)′(M A′D−1A)(x(t) x(t+1)) 1 − − − ( F(x(t+1)) F(x∗)) 2 =2(r(t+1) r(t))′(r∗ r(t+1)) − c ∇ −∇ ||2 − − +2(x(t+1) x(t))′(M A′D−1A)(x∗ x(t+1)) 2δ (M A′D−1A)(x(t+1) x∗) 2 =2(q(t+1)− q(t))′G(q−∗ r(t+1)) − ≤ λ˜m|| − − ||2 − − 2δ = q(t) q∗ 2 q(t+1) q∗ 2 q(t) q(t+1) 2. + ( F(x(t+1)) F(x∗)) 2 || − ||G−|| − ||G−|| − ||G λ˜ || ∇ −∇ ||2 (19) m 2δλ M x(t+1) x∗ 2 Again, using Lemma (3) and Lemma (7), we have ≤ λ˜ || − ||M−A′D−1A m 2δ 2 1 + ( F(x(t+1)) F(x∗)) 2 (26) cL||∇F(x(t+1))−∇F(x∗)||22 λ˜m|| ∇ −∇ ||2 ≤||q(t)−q∗||2G−||q(t+1)−q∗||2G−||q(t)−q(t+1)||2G. Comparing (26) and (25), it suffices to have (20) 2βν (1 β)cλ˜ m Using (19) and (20), for any β (0,1), we have δ min , − . 2 ∈ ≤ (cλM(1+ λ˜2m) L ) β ν x(t+1) x∗ 2 c || − ||2 This shows that (18) holds. Using (18) along with (19) 2 1 completes the proof. +(1 β) F(x(t+1)) F(x∗) 2 (21) − cL||∇ −∇ ||2 q(t) q∗ 2 q(t+1) q∗ 2 q(t) q(t+1) 2. ≤|| − ||G−|| − ||G−|| − ||G H. Proof of Theorem 4 (22) The largest possible δ that satisfies the constraint given in This yields to Theorem 1 by maximizing over β (0,1) is the solution of ∈ q(t) q∗ 2 q(t+1) q∗ 2 || − ||G−|| −2 ||G 2βν = (1−β)cλ˜m, (27) q(t) q(t+1) 2 +β ν x(t+1) x∗ 2 cλ (1+ 2 ) L ≥|| − ||G c || − ||2 M λ˜m 2 1 Compa+rin(g1t−hisβ)reclaLti|o|∇nFw(itxh(t(1+8)1,)i)t−rem∇aFin(sx∗to)||s22h.ow (23) wmhaixcihmuismβδ∗ =is e2qνLuca2+λlcM2toλ(M2δ+(2λ˜=+mλ˜)cmλ)M.2(βT1∗+hνiλ˜s2mi)n.tWurennshowowmsathxaimt tihzee δ over choices of c, leading to 2 q(t) q(t+1) 2 +β ν x(t+1) x∗ 2 || − ||G c || − ||2 1 2λ˜2 1 +(1−β)2cL1||∇F(x(t+1))−∇F(x∗)||22 δ∗ = 2sλM(2+mλ˜m)κf. δ q(t+1) q∗ 2, ≥ || − ||G I. Proof of Proposition 3 which is equivalent to TheboundprovidedinTheorem 2dependsonλ˜ .Wehave q(t) q(t+1) 2 + x(t+1) x∗ 2 m || − ||G || − ||2νctI−δ(M−A′D−1A) that 2 1 +(1−β)cL||∇F(x(t+1))−∇F(x∗)||22 λ˜m 1 a(G)2, ≥ d +1 δ r(t+1) r∗ 2. (24) max ≥ || − ||2 where a(G) is the algebraic connectivity of the graph which Using Lemma (6), in order to show this inequality it suffices is the smallest non-zero eigenvalue of the Laplacian matrix. to show Moreover, we have that ||x(t+1)−x∗||22cνtI−δ(M−A′D−1A) M A′D−1A 2 dmax(dmax+1)+ 4d2max . +(1 β)2 1 F(x(t+1)) F(x∗) 2 || − || ≤ dmin+1 − cL||∇ −∇ ||2 Plugging these two bounds in Theorem 2 an using the fact δ r(t+1) r∗ 2. (25) ≥ || − ||2 that dmin 1 completes the proof. ≥ 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.