Table Of Content

SMART: The Stochastic Monotone Aggregated Root-Finding Algorithm 6 Damek Davis 1 0 2 n u J 9 June13,2016 ] C O Abstract We introduce the Stochastic Monotone Aggregated Root-Finding (SMART) algorithm, a new . h randomized operator-splitting scheme for finding roots of finite sums of operators. These algorithms are t similar to the growing class of incremental aggregated gradient algorithms, which minimize finite sums of a m functions; the difference is that we replace gradients of functions with black-boxes called operators, which representsubproblemstobe solvedduringthe algorithm.Byreplacinggradientswithoperators,weincrease [ our modeling power, and we simplify the application and analysis of the resulting algorithms. The operator 2 pointofviewalsomakesiteasytoextendouralgorithmstoallowarbitrarysamplingandupdatingofblocks v of coordinates throughout the algorithm. Implementing and running an algorithm like this on a computing 8 9 cluster can be slow if we force all computing nodes to be synched up at all times. To take better advantage 6 of parallelism, we allow computing nodes to delay updates and break synchronization. 0 Thispaperhasseveraltechnicalandpracticalcontributions.Weprovetheweak,almostsureconvergence 0 of a new class of randomized operator-splitting schemes in separable Hilbert spaces; we prove that this . 1 class ofalgorithmsconvergenceslinearly inexpectation whena weakregularitypropertyholds;we highlight 0 connections to other algorithms; and we introduce a few new algorithms for large-scale optimization. 6 1 : v 1 Introduction i X The guiding problems in optimization are evolving. While all optimization problems can be reduced to r a minimizing functions oversets,prototypicaloptimization problems, like minx C f(x), hide structure that is ∈ found throughout modern applications, and this structure is useful for designing algorithms for large-scale problems (e.g., problems with gigabytes to terabytes of data). Among these structures,the finite sum is the most pervasive: n 1 minimize f (x). (1.1) x C i ∈ n i=1 X ThismaterialisbaseduponworksupportedbytheNationalScienceFoundation underAwardNo.1502405. D.Davis SchoolofOperationsResearchandInformationEngineering,CornellUniversity Ithaca, NY16850, USA E-mail:[email protected] 2 DamekDavis Large sums (n 0) are common in statistical and machine learning, where the functions f often, but not i ≫ always,correspondone-to-onewithpointsinadataset.Whenthef aresummedtogether,the minimization i problem (1.1) grows in complexity, but the prevailing and realistic assumption in applications is that it is drasticallysimpler(intermsmemoryorcomputationalcomplexity)toperformoperations,likedifferentiation, on the f than it is to perform the same operations on the entire sum. i Large sums, like (1.1), come from large datasets, even low dimensional ones. But modern applications often involve high dimensional datasets. When the decision variable x Rm is high-dimensional (m 0), ∈ ≫ the prevailing and realistic assumption in applications is that it is drastically simpler to compute partial derivatives of the f , or other componentwise quantities, than it is to compute full derivatives of the f . i i These two structural assumptions, and the host of algorithms that adopt them, have led to big im- provements in large-scale algorithm design (for examples, see [28,17,34,36,35,16,20,15,25,29,9,5,8,37,4, 27,6,31]). And we do not have to look far to find problems for which these assumptions make sense. A simple problem with n 0 and m 0 comes from regularized least squares problems with matrix A= AT,...,AT T Rn m≫, vector b R≫n, and regularizer induced by K Rm m: 1 n ∈ × ∈ ∈ × (cid:2) (cid:3) x 2 n 1 1 1 . 1 mixniRmmize n 2(cid:13)(cid:13)Ai .. −bi(cid:13)(cid:13) + 2hKx,xi. (1.2) ∈ Xi=1 (cid:13)(cid:13) xm (cid:13)(cid:13)  Thetwoprevailingassumptionsareclearlysatisfi(cid:13)(cid:13)edforthisexam(cid:13)(cid:13)plewheneachrowofA issparse:eachterm (cid:13) (cid:13) i in the sum is simple and partial derivatives of the terms are easier to compute than full derivatives. Absent other special structure in the matrices A and K (like bandlimitedness), a conceptually simple operationlike differentiating the objectivefunction requiresO(max mn,m2 )operations.With problems ofthis scale,full { } gradient computations cannot lie at the heart of a practical iterative algorithm for solving (1.2). Thealgorithmsthathavesprungfromthesestructuralassumptionsalllookalike.Theyallformasequence of points xk k N that converge (in some appropriate sense of the word) to a minimizer of (1.1). They all make a se{ries}of∈choices at each time step: given xk 1. choose (randomly or otherwise) indices i 1,...,n and j 1,...,m ; ∈{ } ∈{ } 2. get xk+1 (in some undetermined way) from xk and f (xk) by only changing a single component of xk j i ∇ (possibly using some combination of the previous terms xl ). l 0,...,k { }∈{ } Atthismoment,thedifferencesbetweenthesealgorithmsisimmaterial.Whatmattersisthecontrastbetween these algorithms and the overwhelmingly costlier algorithms that form sequences xk k N by getting xk+1 from xk and [ n f ](xk); we simply cannot afford to compute these full gradie{nts}fo∈r every k N. ∇ i=1 i ∈ When looking at this 2-step prototypical algorithm, there is a looming temptation to generalize and to P replace the partial derivatives of the f with something else. To find a good generalization, we only need to i look at the optimality conditions of (1.1): n 1 Find x such that: f (x )=0. (1.3) ∗ i ∗ n ∇ i=1 X At once we reduce our cognitive load by changing notation and forgeting that there ever was a function called f . We simply replace every occurrence of f with a mapping called S : i i i ∇ n 1 Find x such that: S(x ):= S (x )=0. (1.4) ∗ ∗ i ∗ n i=1 X SMART:TheStochasticMonotone AggregatedRoot-FindingAlgorithm 3 TheS take,asinput,pointsinaspace ,likeRn,andoutputpointsin .Wefollowmathematicaltradition i H H andcalltheS operators.Wedonot,however,forgetaboutourstructuralassumptions:foranygivenx , i we still expect S (x) to be drastically simpler to compute than n 1 n S (x), and we still expect the∈jHth i − i=1 i component, denoted by (S (x)) , to be drastically simpler to compute than the full operator S (x). We gain i j i P a lot of flexibility from this generalization. EachS (x) can, inprinciple, be any mapping.But we wouldnever be so optimistic. Instead,a coherence i condition suggests itself: we assume that for each root x of (1.4), we have ∗ m n ( β >0):( x ) β (S (x)) (S (x )) 2 S(x),x x . (1.5) ∃ ij ∀ ∈H ijk i j − i ∗ jkj ≤h − ∗i j=1i=1 XX This condition1 is weakerthanwhatcanbe guaranteedwhen eachS () is the gradientofaconvexfunction. i In that case,the x in (1.5) can evenvary beyond the roots of n 1 n· S () to any point in ; the β are ∗ − i=1 i · H ij akin to inverse Lipschitz constants of gradients. But (1.4) is not limited to smooth optimization problems P because (1.5) willbe satisfied for S that are,for example, compositionsof proximaland gradientoperators. i Beyond the benefits of increased modeling power, though, we make this generalization because it is easy to design algorithms for (1.4) that hide all of the complicated details that appear in algorithms which solve specific models, for example, it can be quite technical to design algorithms for (1.1) when two of the terms, say f and f , are nonsmooth. Instead, we treat S just like we treat f . 1 2 i i ∇ A need for increased modeling power and algorithms with smaller per-iteration complexity has led us to Problem (1.4) and a class of algorithms that solve it. We want to parallelize these algorithms. But they are inherently sequential because each step updates just a single coordinate using just a single function or operator,andtheothercoordinatescannotbeupdateduntiltheactiveonefinishesitswork.Abigslowdown occurs if some partial derivatives or operator coordinates are much more complicated to evaluate than all the others because we will spend most of our time updating just a few coordinates and leaving the others fixed. A solution is to eliminate the stalling between coordinate updates and allow multiple processors to work at their own pace, updating whenever they complete their work. We take this approach in this paper and allow asynchronous updates in our algorithm. In total, we have taken the basic template for incremental gradient and block-coordinate optimization algorithmsandincreaseditsmodelingpowerbyintroducingtherootfindingProblem(1.4)andanalgorithm to solve it. This stochastic monotone aggregated root-finding (SMART)2 algorithm is inspired by another algorithm called SAGA [15]. In this work we have taken SAGA and generalized it: the SMART algorithm applies to operators, allows block-coordinate updates, allows asynchronous updates, requires less memory thanSAGA,andrequireslesscomputationperiterationthanSAGA.ThetheoreticalguaranteesforSMART are also stronger: the sequence of points formed by the algorithm, and not just the function values, will convergewithprobability1whenasolutionexists(evenininfinitedimensionalseparableHilbertspaces,but inthiscaseSMARTconvergesweakly).LikeSAGA,SMARTconvergeslinearlywhenn 1 n f isstrongly − i=1 i convex, and beyond that, it converges linearly when n 1 n S is essentially quasi-strongly monotone. It − i=1 i P evenconvergeslinearlyinthe asynchronous,block-coordinatecase.The restofthis paper describesSMART P and proves that it converges. 1 Thetechnical namemightbequasi-cocoercivity;wecouldnotfinditdiscussedelsewhere. 2 Weusethetermmonotone becauseS isaquasi-monotoneoperator. 4 DamekDavis 2 Assumptions and Notation The SMART algorithm solves (1.4) in a separable Hilbert space, like Rn, which we call . We assume the Hilbert space = is a direct sum ofm N other Hilbert spaces ,..., H . Givena vector 1 m 1 m x , we denHote itHs j⊕th·c·o·m⊕pHonent by xj j. Given∈a sequence xk k N andHa vectorHh Nm, we define ∈H ∈H { } ∈ ∈ (∀k ∈N) xk−h =(xk1−h1,...,xkm−hm) andusetheconventionthatxk =x0 ifk 0.Forj 1,...,m ,welet , : Rdenotetheinner j j ≤ ∈{ } h· ·ij Hj×Hj → producton ,andwelet bethecorrespondingnorm.Forallx,y ,welet x,y = m x ,y Hj k·kj ∈H h iprod j=1h j jij and x := √ x,x be the standard inner product and norm on . We also fix an inner product , k: kprod Rhandidperondote the corresponding norm by . We make oHne assumption abouPt this norm: h· ·i H×H→ k·k m m M ,M >0 :( x ) M x 2 x 2 M x 2. ∃ j j ∀ ∈H jk jkj ≤k k ≤ jk jkj i=1 i=1 (cid:0) (cid:1) X X We often choose inner products associated to self-adjoint linear maps, P, which are defined for all x,y , ∈H by x,y = x,y := Px,y = x,Py . We work with an underlying probability space denoted by h i h iP h iprod h iprod (Ω, ,P), and we assume that the space is equipped with the Borel σ-algebra. We always let σ(X) F H ⊆F denote the sub σ-algebra generated by a random variable X. We use the shorthand a.s. to denote almost sure convergence of a sequence of random variables. We also assume that the operators S are measurable. i We say that a map T : is nonexpansive if it is 1-Lipschitz continuous, i.e., Tx Ty x y for H→H k − k≤k − k all x,y . A map S : is called demiclosed at 0 if whenever a sequence xk k N converges weakly toapoi∈ntHx andS(xHk)→coHnvergesstronglyto0 ,thenS(x)=0.Givenany{non}e∈mpty,closed,convex ∈H ∈H set C, let P (x):=argmin x x denote its projectionoperator,and let d (x):=min x x denote its dCistance functionx.∗∈FCorkan−y c∗loksed convex set, C , we let N denoteCthe normalxc∗o∈nCekof−C [∗2k, C ⊆H Definition 6.37]. We define n 1 S := S and :=zer(S)= x S(x)=0 . i n S { ∈H| } i=1 X Beyond (1.5), we use a single regularity property ( µ>0):( x ) S(x),x P (x) µ x P (x) 2. (2.1) ∃ ∀ ∈H h − S i≥ k − S k Operators that satisfy (2.1) are called essentially strongly quasi-monotone. In order for our most general results to hold, S need not satisfy (2.1), but when it is satisfied, our algorithm converges linearly. Although itishardlycomprehensive,oneexampleisnoteworthy:property(2.1)holdsifS =A f Aforastrongly ∗ ◦∇ ◦ convex function f and a matrix A [25, p. 287]. See [7,38,24] for information on convex error bounds. Most of the concepts that we use in this paper can be found in [2]. See Table C.1 for a list of symbols. 3 The SMART Algorithm We develop an iterative algorithm that solves (1.4). The algorithm forms a sequence of primal variables xk k N that converges to a root of (1.4). The algorithm also maintains a sequence of dual variables, {one}pe∈r o⊆peHrator, which is denoted by {(y1k,...,ynk)}k∈N ⊆Hn. We assume that at least one of m, the number of components, and n, the number of operators, is large, andsoitcostsalottoobtainxk+1 fromxk byupdatingallmofitscomponents,usingallnoftheoperators. SMART:TheStochasticMonotone AggregatedRoot-FindingAlgorithm 5 Tolowerthe cost,we introduceanIIDsequence of2 1,...,m -valued(subsetsof 1,...,m )randomvariables { } Sk k N that determines which components of xk we update at the kth iterati{on. The c}omponent-choosing v{ari}ab∈le Sk is coupled with an IID sequence of 1,...,n -valued random variables ik k N that determine whichone ofthe n operatorsS ,...,S areevalu{atedatt}he kth iteration.If i =i a{nd}j∈ S ,then the jth 1 n k k ∈ componentofxk isupdatedatiterationkusinganevaluationof(S ()) ;theotheroperatorsandcomponents i j are left alone. The user can choose ik k N and Sk k N however th·ey like as long as { } ∈ { } ∈ q :=P(j S )>0 for all j; and p :=P(i =ij S )>0 exactly when (S ()) 0. j 0 ij 0 0 i j ∈ | ∈ · 6≡ Unlike the point sequence xk k N, the dual variables need not be updated at every iteration. Instead, { } ∈ we introduce an IID sequence of 0,1 -valued random variables ǫk k N that determines if and when we { } { } ∈ update the dual variables. If ǫ =1, then the dual variables are updated at iteration k; otherwise, the dual k variables are left alone. The user can choose ǫk k N however they like as long as { } ∈ ρ:=P(ǫ =1)>0. 0 If ǫ = 1, and thus the dual variables must update at iteration k, we only require, at the absolute k minimum, thatyk be updated; the restofthe variablesmaystayfixed.However,we allowthe update ofyk ik ik to trigger the update of any subset of the other dual variables. The trigger graph G = (V,E) with vertices V = 1,...,n uses the edge set E 1,...,n 2 to encode, for each i V, the set of dual variables that { } ⊆ { } ∈ must be updated when i =i: k (i,i) E if, and only if, i =i triggers the update of dual variable yk. ′ ∈ k i′ When(i,i) E,wesimply saythatitriggers i.Werequirethatforalli,(i,i) E,butotherwisethereare ′ ′ ∈ ∈ no constraints on G; it can be absolutely any graph, from a completely disconnected graph, to a complete graph on n vertices. And one quantity, figuring only in our linear rate of convergence, is important: the probability that i is triggered and coordinate j is sampled simultaneously pT :=P((i ,i) E,j S )=P((i ,i) E j S )q = p q , ij 0 ∈ ∈ 0 0 ∈ | ∈ 0 j (i′,i)∈E i′j j which is easily computable, but it need not be known to guarantee convergence. P Often, the matrix of optimal operator values S :=((S (x )) ) , x ∗ i ∗ j ij ∗ ∈S hassomeentries whicharezero(by (1.5), S is independent ofx ). Forthese zeroentries,wesimply set ∗ ∗ ∈S thecorrespondingdualvariabletozero:yk 0ifS =0.IntheextremecasethatS =0,alloperatorsare i,j ≡ ∗ij ∗ zenroaartetahllezseorlou.tSioent,ti(n1g.4t)hreesdeupcaerstitcoultahredcuoamlmvaorniazbelreos tporozbelreomi,sannodt ntheecedssuaarlyv,abruiatbbleysd{o(iyn1kg,.s.o.,,wynek)u}ske∈Nles⊆s H memory than we otherwise would. An algorithm that solves (1.4) might use the following 3-step process: given xk 1. sample S ,i , and ǫ ; k k k 2. get xk+1 from xk using (S (xk)) j S and (yk,...,yk); 3. if ǫ =1, get (yk+1,...,{yk+i1k) usinjg|xk∈andk}(yk,...,1yk); othnerwise set (yk+1,...,yk+1)=(yk,...,yk). k 1 n 1 n 1 n 1 n After i and S are sampled, the inactive operators and the inactive components stall until the active ones k k finishSteps2and3.Ifwehaveaccesstoaparallelcomputingdevice,stallingiswasteful,soinouralgorithm we let all operators work in parallel and update xk whenever they finish their work. Mathematically, we ifnorsmtansceeqsueonfcxesk oafnddelyakysw{idthk}ikt∈eNra⊆tes{0fr,o1m,..t.h,eτpp}amstanwdho{seeik}cko∈oNrd⊆ina{t0e,s1,w.e.r.e,τfdo}rmm,eadnidnwteherelpalsatceτseovrerτal i p d iterations, respectively. The final algorithm is below: 6 DamekDavis aArlbgitorrairtihlymex1ce(pStMthaAtRy0T)=L0etif{λSk}k=∈N0.bTehaenseqfourenkce oNf,spteeprfsoizrems.tChehofooslleowxi0n∈g tHhreaendsteyp10s,:...,yn0 ∈ H i,j ∗ij ∈ 1. Sampling: choose a set of coordinates S , an operator index i , and dual update decision ǫ . k k k 2. Primal update: set n (∀j ∈Sk) xkj+1 =xkj − qλjmk np1ij(Sik(xk−dk))j − np1ijyikk−,jeikk + n1 yik,−jeik!; i=1 X ( j S ) xk+1 =xk. ∀ 6∈ k j j 3. Dual update: If i triggers i, set k ∀j ∈Sk with S∗ij 6=0 yik,+j1 =yik,j +ǫk (Si(xk−dk))j −yik,j ; (cid:0) (∀j ∈/ Sk(cid:1)) yik,+j1 =yik,j. (cid:0) (cid:1) Otherwise, set yk+1 =yk . i,j i,j ⊓⊔ The xk−dk iterate in SMART can be totally synchronous, in which case dk,j = 0 for all j; it can be consistent-read asynchronous, in which case d =d for all j and j ; or it can be inconsistent-read asyn- k,j k,j ′ ′ chronous, in whichcased =d for some j and j .Totally synchronousiteratesare notdelayedat all,so k,j k,j ′ 6 ′ xk−dk = xk for all k; consistent-read asynchronous iterates are delayed, but all of their coordinates are de- layedbythesameamount,soxk−dk xk,xk−1,...,xk−τp forallk;inconsistent-readasynchronousiterates ∈{ } are delayed, and their coordinates can be delayed by different amounts, so xjk−dk,j ∈ {xkj,xjk−1,...,xjk−τp} for all j, but xk−dk is not necessarily an element of xk,xk−1,...,xk−τp for all k. { } k ei Likewise, the y − k iterate in SMART can be totally synchronous, consistent-read asynchronous, or i inconsistent-read asynchronous, corresponding to the cases ei =0 for all j, ei =ei for all j and j , or k,j k,j k,j′ ′ ei =ei for some j and j , respectively. k,j 6 k,j′ ′ The delays {dk}k∈N and {eik}k∈N come with no particular order. They can induce maximal delays at all times, e.g., d =τ for all j and k, in which case the oldest information possible is used in every iteration; k,j p they canbe cyclic,e.g.,d =k mod (τ +1)forallj andk,inwhichcasethe sameinformationisusedfor k,j p τ consecutive iterations in a row, and all the intermediate information is thrown away; but in general, the p delays can be arbitrary. These delays are artificially imposed, but we can also incur delays that are beyond our control. Uncontrolleddelays can occur in the xk iterate when a processor,called Proc , attempts to readcoordi- 1 natesxk,...,xk whileanotherprocessor,calledProc ,attemptstoreplacexk withxk+1.Itcanhappenthat 1 m 2 Proc successfullyreadscoordinatexk beforeProc replacesitwithxk+1,butthatProc replacesxk,...,xk 1 1 2 1 2 2 m with xk+1,...,xk+1 before Proc attempts to read this group of coordinates. When Proc finishes reading, 2 m 1 1 it will have the iterate (xk,xk+1,...,xk+1), which is not necessarily equal to any previous iterate xt with 1 2 m t k. This effect is exacerbated if multiple processors attempt to write and read simultaneously, but in ≤ SMART, xk−dk is inconsistent-read asynchronous,so these uncontrolled delays cause no trouble. In general, including inconsistent-read asynchronous iterates leads to tedious convergence proofs with many-term recursive identities and complicated stepsizes. The recursive identities are necessary to control algorithmprogressinachaoticenvironment.Thestepsizes,ontheotherhand,canbebeoptimizedandhave a clear dependence on the delays, sampling probabilities, and problem data. In both the inconsistent-read SMART:TheStochasticMonotone AggregatedRoot-FindingAlgorithm 7 andconsistent-readcases,thealgorithmwillconvergewiththesamerangeofparameters.However,therates of convergence depend on a measure of inconsistency, which we call δ δ := sup d d . (3.1) k N | k,j − k,j′| j,j′∈{∈1,...,m} When δ = 0, the xk−dk are consistent-read asynchronous, and the convergence rates improve. Otherwise δ 0,...,τ andtheconvergenceratesdegradewithincreasingδ.Ofcourse,whenδisnotknownexplicitly, p ∈{ } it can be replaced by its upper bound τ . p If all of the sampling variables are statistically independent and the stepsizes are chosen small enough, SMART converges: Theorem 3.1 (Convergence of SMART) For all k 0, let =σ((i ,S )), let =σ(ε ), let k k k k k ≥ I E :=σ( xl k yl,...,yl k ), Fk { }l=0∪{ 1 n}l=0 and suppose that , , are independent. Suppose that (1.5) holds. Finally, suppose that there are k k k {I E F } constants λ,λ>0 such that λk k N [λ,λ], and { } ∈ ⊆ min 2λn2pijβij if S =0; λ2 < i,j(3mM√jτqp+2qMmj ) ∗ (3.2) min λn2pijβij otherwise. i,j(Mmj√τqp√2(τd+2)+Mj(mτdq+2)) Then  1. Convergence of operator values. For i 1,...,n , the sequence of -valued random variables ∈ { } H Si(xk−dk) k Na.s. converges strongly to Si(x∗). { } ∈ 2. Weak convergence. Supposethat S is demiclosed at 0.Then thesequenceof -valuedrandom variables H xk k Na.s. weakly converges to an -valued random variable. 3. {Lin}ea∈r convergence. Let η <min S ρpT , let α [0,1), let q =min q , let M =min M , and let i,j{ ij} ∈ j{ j} j{ j} 2η(1 α)n2p β λ min − ij ij . ≤ i,j 2Mjqηj(mτd+2)(cid:18)1+ τdδ+η1 + mM(cid:16)5√√22((ττdd++22))α+2τδp√q(cid:17)(cid:19)+ Mjητpm√√2q(τd+2)(cid:16)2+ 1−ηη(cid:17)+4µ(τd+1)α(1−α)n2pijβij  (3.3)  Then if (2.1) holds, there exists a constant C(z0,φ0) R depending on x0 and φ0 such that for all 0 ∈ ≥ k N, ∈ 2αµλ k/(τp+1) E d2(xk) 1 d2(x0)+C(x0,φ0) . S ≤(cid:18) − τp+1(cid:19) S (cid:2) (cid:3) (cid:0) (cid:1) The proof of Theorem3.1 without delay (i.e., d 0 and eij 0) is presentedin Section 7. The proofof k ≡ k ≡ the full theorem is in Appendix A 8 DamekDavis Stepsizeλ Algorithm Rateforbestλ thatislargestpossible thatgivesbestrate SAGA(4.2) 1 1 1 µ 2L 4L+µN − 4L+µN SVRG(4.3) 1 1 1 µ averageupdatefrequencyτ 2L 4L+µτ − 4L+µτ SVRG(4.4) 1 1 1 µ scheduledupdatefrequencyτ (τ+2)L 2L(τ+2)+µ(τ+1) − 2L(τ+2)+µ(τ+1) Finito(4.5) 1; γ= 2 1; γ= 1 1 1 1 1 µˆ 2 L 4 L − 4N − − L (cid:18) q (cid:19) SDCA(4.8) 3 3 1 3µ0 4 8 − 8(L+µ0N) AlternatingProjections(4.10) 1 12 1− min{12,Nε2µˆL−2} Kaczmarz(4.11) 1 12 1− 2NkA1−1k22 Table4.1:Thestepsizesandconvergence ratesforthespecialcasesofSMARTintroduced inSection 4 4 Connections with Other Algorithms 4.1 SAGA, SVRG, and S2GD In the simplest case of (1.4) N 1 minimize f (x) (4.1) i x∈H0 N i=1 X where each f is convex and differentiable, and f is L-Lipschitz, we set S := f . (We also set = , i i i i 0 ∇ ∇ H H use the canonical norm = , and ignore coordinates and partial derivatives for the moment.) By 0 k·k k·k PropositionC.1,wecansetβ =(LN) 1 foralliandj.Withthischoiceofoperators,thecondition(2.1)is, ij − of course, implied by the µ-strong convexity of f :=N 1 N f . But whether or not f is strongly convex, − i=1 i if λ is set according to Theorem 3.1 or Table 4.1, SMART will converge. P The SAGA Algorithm [15] applied to (4.1) selects a function uniformly at random, performs a primal update, andupdates a single dual variable(i.e., the triggergraphis completely disconnected).When, for all i, we set y0 = f (φ0) with φ0 , SAGA takes the form:3 i ∇ i i i ∈H N 1 xk+1 =xk λ f (xk) yk + yk ; − ∇ ik − ik N i! i=1 X f (xk) if i =i; yk+1 = ∇ i k (4.2) i (yik otherwise. 3 Usen=N,m=1,dk ≡0,eik≡0,q1≡1,τp=τd=0,pij ≡N−1,ρ=1,E={(i,i)|i∈V},pTij =N−1. SMART:TheStochasticMonotone AggregatedRoot-FindingAlgorithm 9 In the SAGA algorithm, each dual variable is just a gradient, yk := f (φk), for a past iterate φk. The SAGA algorithm stores the stale gradients f (φk) i = 1,..i.,N ,∇wihicih, if x Rd, is the sizie of a {∇ i i | } ∈ d N matrix.However,inlogisticandleastsquaresregressionproblems,thefunctionsf haveasimpleform i f×(x) = ψ ( a ,x ) where the ψ : R R are differentiable functions and the a Rd are datapoints; in i i i i i h i → ∈ this case, f (x) = ψ ( a ,x )a , so the cost of storing f (φk) i = 1,...,N can be reduced to that of (ψ ( a ,φk∇),i...,ψ (i′ah i,φki )i)T Rd—a d-dimensiona{l∇veictoir.B|ut not allpro}blems have this parametric 1′ h 1 1i N′ h N Ni ∈ form, so the SAGA algorithm is somewhat limited in scope. TheStochasticVarianceReducedGradient(SVRG)algorithm[20],andthesimilarS2GDalgorithm[23], solve (4.1), but in contrast to SAGA, these algorithms store just a single vector, namely f(xk). As a ∇ consequence, SVRG and S2GD must make repeated, though infrequent, evaluations of the full gradient f—in addition to one extra evaluation of f per-iteration: ∇ ∇ ik e N 1 xk+1 =xk λ f (xk) f (xk)+ f (xk) ; − ∇ ik −∇ ik N ∇ i ! i=1 X xk+1 = xk−tk if k ≡0 mod τe−1; e (xk+1 otherwise; iwshaenreII{Diks}ekq∈uNenisceanofIuIDnifseoerqmuleyncdeisoterfiubnuitfeodrm0ly,.d.i.s,tτribu1te-dva{l1u,e.d..ra,Nnd}o-mvalvuaerdiarbalensd.oTmhevafurilalbglreasdaienndt{tkf}(kx∈kN) and the point xk are only updated once eve{ry τ N−ite}rations. ∇ ∈ The SVRG algorithm4 solvesthe SAGA storageproblem,but it requiresf to be stronglyconvex,soit is e also somewhat limited in scope. SMART can mimic SVRG—even without strong convexity—by selecting a e functionuniformlyatrandom,performingaprimalupdate,andupdatingall dualvariableswithprobability τ 1 (i.e., the trigger graph is the complete graph). When, for all i, we set y0 = f (φ0) with φ0 , our − i ∇ i ∈ H SVRG clone takes the form:5 N 1 xk+1 =xk λ f (xk) yk + yk ; − ∇ ik − ik N i! i=1 X ( i) yk+1 =yk+ǫ ( f (xk) yk). (4.3) ∀ i i k ∇ i − i As in SAGA, eachdualvariable is just a gradientyk = f (φk), fora pastiterate φk, but unlike SAGA, the i ∇ i past iterate is the same for all i. On average,all dual variables yk and, hence, the full gradient f(φk), are i ∇ only updated once every τ iterations. Another clone of SVRG, this time with dual variables that update once every τ iterations, comes from SMART as applied in (4.3), but with a cyclic uniform dual variable delay ei =e :=k mod (τ +1):6 k k N 1 xk+1 =xk−λ ∇fik(xk)−yikk−ek + N yik−ek!; i=1 X ( i) yk+1 = f (xk). (4.4) ∀ i ∇ i Thedualvariableyk = f (xk),andhencethefullgradient f(xk),isonlyupdatedonceeveryτ iterations. i ∇ i ∇ 4 Fromhereon,wewillignorethedistinctionbetween SVRGandS2GD. 5 Usen=N,m=1,M1=1,dk ≡0,eik≡0,qj ≡1,τp=τd=0,pij ≡N−1,ρ=τ−1,E=V ×V,andpTij =1. 6 Usen=N,m=1,M1=1,dk ≡0,eik=k modτ,qj ≡1,τp=0,τd=τ,pij ≡N−1,ρ=1,E=V ×V,andpTij =1. 10 DamekDavis 4.2 Finito TheFinitoalgorithm[16]solves(4.1),butunlikeSAGAandSVRG,Finitostoresonepointandonegradient per function, or a superposition of the two: 1 N (xk 1 f (xk)) if i=i ; xk+1 = N l=1 l − 2µˆ∇ l l k i (xkiP otherwise, where each f is µˆ-strongly convex. For each function f , Finito stores xk (2µˆ) 1 f (xk), which can be i i i − − ∇ i i substantially costlier than storing a matrix of gradients. In addition, only when each function f (x) is µˆ- i strongly convex and the bound N 2Lµˆ 1 holds, is Finito known to converge. − ≥ SMART can mimic SAGA and SVRG with multiple operators, but with one operator S = S , SMART 1 recovers the Finito algorithm. To get Finito, recast (4.1) into an equivalent form with duplicated variables N 1 minimize f (x ) subject to:x =x = =x , i i 1 2 N (x1,...,xN)∈HN0 N Xi=1 ··· let D := (x,...,x) N x denote the diagonal set, and define the operator (for a fixed γ >0) { ∈H0 | ∈H} x N S(x):=S (x)=x P (x γ f (x ),...,x γ f (x )). ∀ ∈H0 1 − D 1− ∇ 1 1 N − ∇ N N (cid:0) (cid:1) Then the Finito algorithm,selects a coordinate(and hence a function) uniformly atrandomand performsa primal update; the sole dual variable is set to zero at all iterations:7 (1 λ)xk+ λ N xk γ f (xk) j S ; xk+1 = − j N l=1 l − ∇ l l ∀ ∈ k (4.5) j (xkj P (cid:0) (cid:1) otherwise. Unlike the standard Finito algorithm [16], Algorithm (4.5) converges with or without strong convexity. The constants β depend on the constant γ and the inner products that we place on := and 1j j 0 H H := N. The projection operator P also depends on these inner products. As long as8 γ 2L 1, the H H0 D ≤ − operator S satisfies (1.5) with β 4 1γL, and if µ := 1 1 2γµˆ+γ2µˆL, it is µ-essentially strongly 1j − ≡ − − quasi-monotone—provided that each f is µˆ-strongly convex and we make the choices x ,z := x ,z i p h j jij h j ji0 and = ; the operator S need not be strongly monotone unless each f is strongly convex. See prod i k·k k·k Proposition C.2. The space = N is high-dimensional, so even storing a single vector xk is expensive. And in practice, H H0 the gradients should also be stored—unless they are simple to recompute. Thus, it is clear that Finito, like SAGA, but unlike SVRG, is impracticable if m is too large and memory is limited. Nevertheless, Finito performs well in practice, often better than other incremental gradient methods. 7 Usen=1,m=N,Mj ≡1,dk ≡0,eik≡0,S∗=0,qj ≡N−1,τp=0,τd =0,pij ≡1,ρ=1,E={(1,1)},andpTij =N−1. 8 SeePropositionC.2.