ebook img

Breaking Locality Accelerates Block Gauss-Seidel PDF

1.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Breaking Locality Accelerates Block Gauss-Seidel

Breaking Locality Accelerates Block Gauss-Seidel Stephen Tu Shivaram Venkataraman Ashia C. Wilson Alex Gittens Michael I. Jordan Benjamin Recht University of California, Berkeley 7 1 0 Abstract 2 Recent work by Nesterov and Stich [16] showed that momentum can be used to accelerate the rate of n convergence for block Gauss-Seidel in the setting where a fixed partitioning of the coordinates is chosen a ahead of time. We show that this setting is too restrictive, constructing instances where breaking locality J by running non-accelerated Gauss-Seidel with randomly sampled coordinates substantially outperforms 4 acceleratedGauss-Seidelwithanyfixedpartitioning. Motivatedbythisfinding,weanalyzetheaccelerated 1 blockGauss-Seidelalgorithmintherandomcoordinatesamplingsetting. Ouranalysiscapturesthebenefit ] of acceleration with a new data-dependent parameter which is well behaved when the matrix sub-blocks C arewell-conditioned. Empirically,weshowthatacceleratedGauss-Seidelwithrandomcoordinatesampling O provides speedups for large scale machine learning tasks when compared to non-accelerated Gauss-Seidel and the classical conjugate-gradient algorithm. . h t a 1 Introduction m [ The randomized Gauss-Seidel method is a commonly used iterative algorithm to compute the solution of 1 an n×n linear system Ax = b by updating a single coordinate at a time in a randomized order. While v this approach is known to converge linearly to the true solution when A is positive definite (see e.g. [9]), in 3 practice it is often more efficient to update a small block of coordinates at a time due to the effects of cache 6 locality. 8 3 In extending randomized Gauss-Seidel to the block setting, a natural question that arises is how one 0 should sample the next block. At one extreme a fixed partition of the coordinates is chosen ahead of time. . The algorithm is restricted to randomly selecting blocks from this fixed partitioning, thus favoring data 1 0 locality. At the other extreme we break locality by sampling a new set of random coordinates to form a block 7 at every iteration. 1 Theoretically, the fixed partition case is well understood both for Gauss-Seidel [20, 5] and its Nesterov v: accelerated variant [16]. More specifically, at most O(µ−1 log(1/ε)) iterations of Gauss-Seidel are sufficient part i to reach a solution with at most ε error measured in (cid:96) norm, where µ is a quantity which measures how X 2 part well the A matrix is preconditioned by the block diagonal matrix containing the sub-blocks corresponding to r a the fixed partitioning. When acceleration is used, Nesterov and Stich [16] show that the rate improves to (cid:16)(cid:113) (cid:17) O nµ−1 log(1/ε) , where p denotes the size of each partition. p part Fortherandomcoordinateselectionmodel,theexistingliteratureislesscomplete. Whileitisknown[20,5] that the iteration complexity with random coordinate section is O(µ−1 log(1/ε)) for an ε error solution, rand µ is another instance dependent quantity which is not directly comparable to µ . Hence it is not rand part obvious how much better, if at all, one expects random coordinate selection to perform compared to fixed partitioning. Our first contribution in this paper is to show that, when compared to the random coordinate selection model, the fixed partition model can perform very poorly in terms of iteration complexity to reach a pre- specified error. Specifically, we present a family of instances (similar to the matrices recently studied by Lee 1 and Wright [7]) where non-accelerated Gauss-Seidel with random coordinate selection performs arbitrarily faster than both non-accelerated and even accelerated Gauss-Seidel, using any fixed partition. This finding motivates us to further study the benefits of acceleration under the random coordinate selection model. Interestingly, the benefits are more nuanced under this model. We show that acceleration (cid:18)(cid:113) (cid:19) improves the rate from O(µ−1 log(1/ε)) to O νµ−1 log(1/ε) , where ν is a new instance dependent rand rand quantity that satisfies ν ≤µ−1 . We derive a bound on ν which suggests that if the sub-blocks of A are all rand well conditioned, then acceleration can provide substantial speedups. We note that this is merely a sufficient condition, and our experiments suggest that our bound is conservative. Intheprocessofderivingourresults,wealsodevelopageneralproofframeworkforrandomizedaccelerated methodsbasedonWilsonetal.[29]whichavoidstheuseofestimatesequencesinfavorofanexplicitLyapunov function. Using our proof framework we are able to recover recent results [16, 1] on accelerated coordinate descent. Furthermore, our proof framework allows us to immediately transfer our results on Gauss-Seidel over to the randomized accelerated Kaczmarz algorithm, extending a recent result by Liu and Wright [11] on updating a single constraint at a time to the block case. Finally, we empirically demonstrate that despite its theoretical nuances, accelerated Gauss-Seidel using random coordinate selection can provide significant speedups in practical applications over Gauss-Seidel with fixed partition sampling, as well as the classical conjugate-gradient (CG) algorithm. As an example, for a kernel ridge regression (KRR) task in machine learning on the CIFAR-10 dataset, acceleration with random coordinate sampling performs up to 2× faster than acceleration with a fixed partitioning to reach an error tolerance of 10−2, with the gap substantially widening for smaller error tolerances. Furthermore, it performs over 5× faster than conjugate-gradient on the same task. 2 Background In this section, we develop the relevant context to explain our results in more detail. For the remainder of this paper, we assume that we are given an n×n matrix A which is positive definite, and an n dimensional response vector b. We also fix an integer p which denotes a block size. To avoid special edge cases we will assume that 1<p<n. We will also assume for simplicity that n is an integer multiple of p. Under the assumption of A being positive definite, the function f :Rn −→R defined as 1 f(x)= xTAx−xTb, (2.1) 2 is a strongly convex and smooth function on Rn. Recent analysis of Gauss-Seidel [5] proceeds by noting the connection between Gauss-Seidel and (block) coordinate descent on f. This is the point of view we will take in this paper. Notation. The notation is standard. [n]={1,2,...,n} refers to the set of integers from 1 to n, and 2[n] refers to the set of all subsets of [n]. We let 1 ∈Rn denote the vector of all ones. Given a square matrix M n with real eigenvalues, we let λ (M) (resp. λ (M)) denote the maximum (resp. minimum) eigenvalue of max min M. For two symmetric matrices M,N, the notation M (cid:60)N (resp. M (cid:31)N) means that the matrix M −N is positive semi-definite (resp. positive definite). Every such M (cid:31)0 defines a real inner product space via the (cid:112) inner product (cid:104)x,y(cid:105) =xTMy. We refer to its induced norm as (cid:107)x(cid:107) = (cid:104)x,x(cid:105) . The standard Euclidean M M M inner product and norm will be denoted as (cid:104)·,·(cid:105) and (cid:107)·(cid:107) , respectively. For an arbitrary matrix M, we let M† 2 denote its Moore-Penrose pseudo-inverse and P the orthogonal projector onto the range of M, which we M denote as R(M). When M (cid:60)0, we let M1/2 denote its unique Hermitian square root. Finally, for a square n×n matrix M, diag(M) is the n×n diagonal matrix which contains the diagonal elements of M. 2 Partitions on [n]. In what follows, unless stated otherwise, whenever we discuss a partition of [n] we assume that the partition is given by (cid:83)n/pJ , where i=1 i J ={1,2,...,p}, J ={p+1,p+2,...,2p}, .... 1 2 This is without loss of generality because for any arbitrary equal sized partition of [n], there exists a permutation matrix Π such that all our results apply by the change of variables A←ΠTAΠ and b←ΠTb. 2.1 Existing rates for randomized block Gauss-Seidel We first describe the sketching framework of [20, 5] and show how it yields rates on Gauss-Seidel when blocks are chosen via a fixed partition or randomly at every iteration. While the sketching framework is very general, we will only focus on the special case when the sketch matrix represents column sampling. We introduce the sketching framework solely because it allows us to provide a unified analysis of both cases; many of our structural results will be stated in terms of the sketching framework. Tobemoreprecise, letD beadistributionoverRn×p, andletS ∼D bedrawniidfromD. Ifweperform k block coordinate descent by minimizing f along the range of S , then the randomized block Gauss-Seidel k update is given by x =x −S (STAS )†ST(Ax −b), k =0,1,.... (2.2) k+1 k k k k k k Columnsampling. EveryindexsetJ ⊆2[n]with|J|=pinducesasketchingmatrixS(J)=(e ,...,e ) J(1) J(p) where e denotes the i-th standard basis vector in Rn, and J(1),...,J(p) is any ordering of the elements of i J 1. By equipping different probability measures on 2[n], one can easily describe fixed partition sampling as well as random coordinate sampling (and many other sampling schemes). The former puts uniform mass on the index sets J ,...,J , whereas the latter puts uniform mass on all (cid:0)n(cid:1) index sets of size p. Furthermore, 1 n/p p in the sketching framework there is no limitation to use a uniform distribution, nor is there any limitation to use a fixed p for every iteration. For this paper, however, we will restrict our attention to these cases. Existing rates. Under the assumptions stated above, [20, 5] show that that for every k ≥0, the sequence (2.2) satisfies, E[(cid:107)x −x (cid:107) ]≤(1−µ)k/2(cid:107)x −x (cid:107) , µ=λ (E[P ]). (2.3) k ∗ A 0 ∗ A min A1/2S TheexpectationonthelefthandsideistakenwithrespecttotherandomnessofS ,S ,...,andtheexpectation 0 1 in the definition of µ is taken with respect to S ∼D. Under both fixed partitioning and random coordinate selection, µ>0 is guaranteed (see e.g. [5], Lemma 4.3). Thus, (2.2) achieves a linear rate of convergence to the true solution, with the rate governed by the µ quantity shown above. We now specialize (2.3) to fixed partitioning and random coordinate sampling, and provide some intuition for why we expect the latter to outperform the former in terms of iteration complexity. We first consider the case when the sampling distribution corresponds to fixed partitioning. Here, µ=µ corresponds to a part measure of how close the product of A with the inverse of the block diagonal is to the identity matrix, defined as  A−1 0 ... 0  µpart = npλminA 0J...1 A−J...21 ...... 0...  . (2.4) 0 0 ... A−1 Jn/p 1 NotethattheorderingisirrelevantbecausethequantityS(J)(S(J)TAS(J))†S(J)T isinvariantunderanyordering. 3 Above, A denotes the p×p matrix corresponding to the sub-matrix of A indexed by the i-th partition. A Ji loose lower bound on µ is part p λ (A) µ ≥ min . (2.5) part nmax λ (A ) 1≤i≤n/p max Ji On the other hand, in the random coordinate case, Qu et al. [20] derive a lower bound on µ=µ as rand p (cid:18)p−1 (cid:18) p−1(cid:19)max A (cid:19)−1 µ ≥ + 1− 1≤i≤n ii . (2.6) rand n n−1 n−1 λ (A) min Using the lower bounds (2.5) and (2.6), we can upper bound the iteration complexity of fixed partition Gauss-Seidel (N ) and random coordinate Gauss-Seidel (N ) as part rand (cid:18)nmax λ (A ) (cid:19) N ≤O 1≤i≤n/p max Ji log(1/ε) , (2.7) part p λ (A) min (cid:18) (cid:19) nmax A N ≤O 1≤i≤n ii log(1/ε) . (2.8) rand p λ (A) min Comparing the bound (2.7) to the bound (2.8), it is not unreasonable to expect that random coordinate sampling has better iteration complexity than fixed partition sampling in certain cases. In Section 3, we verify this by constructing instances A such that fixed partition Gauss-Seidel takes arbitrarily more iterations to reach a pre-specified error tolerance compared with random coordinate Gauss-Seidel. 2.2 Accelerated rates for fixed partition Gauss-Seidel We now discuss previous work on accelerating Gauss-Seidel. Based on the interpretation of Gauss-Seidel as blockcoordinatedescentonthefunctionf, wecanuseanexistingresultandrecoveranacceleratedalgorithm for Gauss-Seidel when using fixed partitions. Specifically, Theorem 1 of Nesterov and Stich [16] yields a procedure and a rate for accelerating (2.2). The procedure is described in Algorithm 1, and the details to recover Algorithm 1 from the ACDM procedure of Nesterov and Stich are described in Section C.3. Algorithm 1 Accelerated randomized block Gauss-Seidel for fixed partitions [16]. Require: A∈Rn×n, A(cid:31)0, b∈Rn, x ∈Rn, block size p, µ defined in (2.4). 0 part 1: Set A0 =0,B0 =1. 2: Set σ = npµpart. 3: Set y0 =z0 =x0. 4: for k =0,...,T −1 do 5: ik ← uniform from {1,2,...,n/p}. 6: Sk ← column selector associated with partition Jik. 7: ak+1 ← positive solution to a2k+1(n/p)2 =(Ak+ak+1)(Bk+σak+1). 8: Ak+1 =Ak+ak+1,Bk+1 =Bk+σak+1. 9: αk = Aakk++11,βk =σBakk++11. 10: yk = (1−αk)x1k−+ααkkβ(k1−βk)zk. 11: xk+1 =yk−Sk(SkTASk)−1SkT(Ayk−b). 12: zk+1 =(1−βk)zk+βkyk− pnBakk++11Sk(SkTASk)−1SkT(Ayk−b). 13: end for 14: return xT. The convergence guarantee of Algorithm 1 is that for all k ≥0, (cid:32)(cid:18) (cid:114) (cid:19)k/2 (cid:33) p E[(cid:107)x −x (cid:107) ]≤O 1− µ (cid:107)x −x (cid:107) . (2.9) k ∗ A n part 0 ∗ A 4 Above, µ is the same quantity defined in (2.4). Comparing (2.9) to the non-accelerated Gauss-Seidel rate part given in (2.3), we see that acceleration improves the iteration complexity to reach a solution with ε error (cid:16)(cid:113) (cid:17) from O(µ−1 log(1/ε)) to O nµ−1 log(1/ε) , as discussed in Section 1. part p part 2.3 Other related work We split the related work into two broad categories of interest: (a) work related to coordinate descent (CD) methods on convex functions and (b) randomized solvers designed for solving consistent linear systems. See [6] for classical background on numerical methods for solving linear systems. For a more recent survey on coordinate descent methods in optimization, see [30]. As mentioned earlier, whenever A is positive definite, Gauss-Seidel can be interpreted as an instance of coordinate descent on a strongly convex quadratic function. We therefore review related work on both non-accelerated and accelerated coordinate descent, focusing on the randomized setting instead of the more classical cyclic order or Gauss-Southwell rule for selecting the next coordinate. See [27] for a discussion on non-random selection rules. Nesterov’soriginalpaper[15]firstconsideredrandomizedCDonconvexfunctions,assumingapartitioning of coordinates fixed ahead of time. The analysis included both non-accelerated and accelerated variants for convexfunctions. ThisworksparkedaresurgenceofinterestinCDmethodsforlargescaledataanalysis. Most relevant to our paper are extensions to the block setting [23], to handling arbitrary sampling distributions [17, 18, 4], and second order updates for quadratic functions [19]. For accelerated CD, Lee and Sidford [8] generalize the analysis of Nesterov [15]. While the analysis of [8] was limited to selecting a single coordinate at a time, several follow on works [17, 10, 12, 3] generalize to block and non-smooth settings. More recently, both Allen-Zhu et al. [1] and Nesterov and Stich [16] independently improve the results of [8] by using a different non-uniform sampling distribution. One of the mostnotableaspectsoftheanalysisin[1]isadeparturefromthe(probabilistic)estimatesequenceframework of Nesterov (see [14] for background on estimate sequences). Instead, the authors of [1] construct a valid Lyapunov function for coordinate descent, although they do not explicitly mention this. In our work, we make this Lyapunov point of view explicit. The constants we choose in our acceleration updates arise from a particular discretization and Lyapunov function outlined from Wilson et al. [29]. Using this framework makes our proof particularly transparent, and allows us to recover results for strongly convex functions from [1] and [16] as a special case. From the numerical analysis side, we focus on the literature for solving consistent systems. Both the Gauss-Seidel and Kaczmarz algorithm are considered classical. Strohmer and Vershynin [26] were the first to prove a linear rate of convergence for randomized Kaczmarz, and Leventhal and Lewis [9] provide a similar kindofanalysisforrandomizedGauss-Seidel. Both[26]and[9]wereinthesingleconstraint/coordinatesetting. The block setting was later analyzed by Needell and Tropp [13]. More recently, Gower and Richt´arik [5] provide a unified analysis for both randomized block Gauss-Seidel and Kaczmarz in the sketching framework, which we adopt in this paper. Finally, Liu and Wright [11] provide an accelerated analysis of randomized Kaczmarz once again in the single constraint setting. In the appendix, we describe how our results also apply for randomized Kaczmarz, and extend the results to the block setting as well. 3 Results We first present in Section 3.1 a family of instances for which Gauss-Seidel with random coordinate sampling outperforms any fixed partitioning. Next, we propose in Section 3.2 an accelerated Gauss-Seidel algorithm (Algorithm 2) in the sketching framework described in Section 2.1 and state its convergence guarantee in Theorem 3.4. We conclude in Section 3.3 by specializing Theorem 3.4 to the case of random coordinate sampling. All proofs are deferred to the appendix. 5 3.1 Separation between fixed partition and random coordinate sampling Consider the particular family of n×n positive definite matrices A given by A ={A :α>0,α+β >0} α,β with A defined as α,β β A =αI+ 1 1T. (3.1) α,β n n n The family A exhibits a crucial property that ΠTA Π = A for every n×n permutation matrix Π. α,β α,β Lee and Wright [7] recently exploited this invariance to illustrate the behavior of cyclic versus randomized permutations in coordinate descent. We are mainly interested in the behavior of Gauss-Seidel as the matrices A become ill-conditioned. α,β To do this, we explore a particular parameterization which holds the minimum eigenvalue equal to one and sends the maximum eigenvalue to infinity. We make this explicit via the sub-family {A } . Our first 1,β β>0 proposition characterizes the behavior of Gauss-Seidel with fixed partitions on instances of A matrices. 1,β Proposition 3.1. Fix any β >0 and integers n,p such that 1≤p≤n and n/p=k for some integer k ≥1. Let J ,...,J be any partition of {1,...,n} with |J |=p for all 1≤i≤k. Define the matrices S ∈Rn×p to 1 k i i be the column selector matrix corresponding to partition J . Suppose the random matrix S ∈Rn×p takes on i value S with probability 1/k. For every A ∈A we have that i 1,β p µ =λ (E[P ])= . (3.2) part min A1/2S n+βp 1,β Next, we perform a similar calculation under the random column sampling model. Proposition 3.2. Fix any β > 0 and integers n,p such that 1 < p < n. Suppose the random matrix S ∈ Rn×p is defined such that each column of S is chosen uniformly at random from {e ,...,e } without 1 n replacement. For every A ∈A we have that 1,β p (p−1)βp µ =λ (E[P ])= + . (3.3) rand min A1/2S n+βp (n−1)(n+βp) 1,β The differences between (3.2) and (3.3) are striking. Let us assume that β is much larger than n. Then, we have that µ ≈ 1/β for (3.2), whereas µ ≈ p−1 for (3.3). That is, µ can be made arbitrarily part rand n−1 part smaller than µ . The next proposition shows that this gap between µ and µ implies a separation rand part rand between fixed partition and random coordinates Gauss-Seidel. Proposition 3.3. Let A be an n×n positive definite matrix, and let S be a random matrix satisfying µ=λ (E[P ])>0. Let x denote the solution to Ax=b. There exists a starting point x ∈Rn such min A1/2S ∗ 0 that the sequence (2.2) satisfies for all k ≥0, E[(cid:107)x −x (cid:107) ]≥(1−µ)k(cid:107)x −x (cid:107) . (3.4) k ∗ A 0 ∗ A By applying (2.3) we see that Gauss-Seidel using random coordinates computes a solution x satisfying k E[(cid:107)x −x (cid:107) ] ≤ ε in at most k = O(nlog(1/ε)) iterations. On the other hand, Proposition 3.3 states k ∗ A1,β p that for any fixed partition, there exists an input x such that k = Ω(βlog(1/ε)) iterations are required 0 to reach the same ε error tolerance. Furthermore, the situation does not improve even if use acceleration (Algorithm 1). Proposition 3.5, which we describe later, implies that for any fixed partition there exists an (cid:16)(cid:113) (cid:17) input x such that k =Ω nβlog(1/ε) iterations are required for Algorithm 1 to reach ε error. Hence as 0 p β −→∞, the gap between random coordinate and fixed partitioning can be made arbitrarily large. These findings are numerically verified in Section 4.1. 6 3.2 A convergence result for accelerated Gauss-Seidel Motivated by our findings in the last section, we investigate the behavior of accelerated Gauss-Seidel beyond fixed partition sampling. Algorithm 2 describes an accelerated Gauss-Seidel algorithm which generalizes Algorithm 1 to the sketching framework of [20, 5] discussed in Section 2. We note that the difference in the structure of the updates between Algorithm 1 and Algorithm 2 is superficial. In Section C of the appendix, we show that a minor tweak to Algorithm 1 makes it coincide with Algorithm 2 invoked with sketching matrices corresponding to a fixed partition. Our main convergence result concerning Algorithm 2 is presented in Theorem 3.4. Algorithm 2 Accelerated randomized block Gauss-Seidel. Require: A∈Rn×n, A(cid:31)0, b∈Rn, sketching matrices {S }T−1 ⊆Rn×p, x ∈Rn, µ∈(0,1), ν ≥1. k k=0 0 (cid:112) 1: Set τ = µ/ν. 2: Set y0 =z0 =x0. 3: for k =0,...,T −1 do 4: xk+1 = 1+1τyk+ 1+ττzk. 5: yk+1 =xk+1−Sk(SkTASk)†SkT(Axk+1−b). 6: zk+1 =zk+τ(xk+1−zk)− µτSk(SkTASk)†SkT(Axk+1−b). 7: end for 8: return yT. Theorem 3.4. Let A be an n×n positive definite matrix and b∈Rn. Let x ∈Rn denote the unique vector ∗ satisfying Ax =b. Suppose each S , k =0,1,2,... is an independent copy of a random matrix S ∈Rn×p. ∗ k Let µ = λ (E[P ]). Suppose the distribution of S satisfies µ > 0. Invoke Algorithm 2 with µ and ν, min A1/2S where ν is defined as (cid:16) (cid:104) (cid:105)(cid:17) ν =λ E (G−1/2HG−1/2)2 , G=E[H], H =S(STAS)†ST. (3.5) max Then we have that for all k ≥0, √ (cid:18) (cid:114)µ(cid:19)k/2 E[(cid:107)y −x (cid:107) ]≤ 2 1− (cid:107)x −x (cid:107) . (3.6) k ∗ A ν 0 ∗ A Note that in the setting of Theorem 3.4, by the definition of ν and µ, it is always the case that ν ≤1/µ. Therefore,theiterationcomplexityofacceleration(Algorithm2)isatleastasgoodastheiterationcomplexity without acceleration (Equation 2.2). The proof of Theorem 3.4 follows as a corollary of Theorem B.1 in the appendix, which establishes a general proof framework for randomized accelerated coordinate methods on strongly convex functions that we develop in Section B. This framework builds upon the Lyapunov-style analysis described in Wilson et al. [29]. In the context of Gauss-Seidel, the proof works by showing that the following function V defined as k µ 1 V =f(y )−f(x )+ (cid:107)z −x (cid:107)2 , f(x)= xTAx−xTb, k k ∗ 2 k ∗ G−1 2 satisfies, for all k ≥0, the inequality (cid:18) (cid:114) (cid:19) µ E[V ]≤ 1− E[V ]. k+1 ν k Unrolling this recursion and noting that f(y )−f(x )= 1(cid:107)y −x (cid:107)2 yields the claimed result. The details k ∗ 2 k ∗ A can be found in Section B.1. 7 To highlight the generality of our proof framework, in Section C we describe how Theorem B.1 can be used to recover the results of Nesterov and Stich [16] and Allen-Zhu et al. [1]. Furthermore, Section D describes how a minor modification of the argument proving convergence of accelerated Gauss-Seidel yields a convergence guarantee for accelerated Kaczmarz, extending the result of Liu and Wright [11] to the block setting. Finally, we describe the analogue of Proposition 3.3 for Algorithm 2, which shows that our analysis in Theorem 3.4 is tight order-wise. The following proposition applies to Algorithm 1 as well; as mentioned previously, Algorithm 1 can be viewed as a special case of Algorithm 2. Proposition 3.5. Under the setting of Theorem 3.4, there exists starting positions y ,z ∈Rn such that the 0 0 iterates {(y ,z )} produced by Algorithm 2 satisfy k k k≥0 (cid:18) (cid:114) (cid:19)k µ E[(cid:107)y −x (cid:107) +(cid:107)z −x (cid:107) ]≥ 1− (cid:107)y −x (cid:107) . (3.7) k ∗ A k ∗ A ν 0 ∗ A 3.3 Specializing Gauss-Seidel to random coordinate sampling WenowinstantiateTheorem3.4torandomcoordinatesampling. TheµquantitywhichappearsinTheorem3.4 is identical to the quantity appearing in the rate (2.3) of non-accelerated Gauss-Seidel. That is, the iteration (cid:18)(cid:113) (cid:19) complexity to reach tolerance ε is O νµ−1 log(1/ε) , and the only new term here is ν. In order to rand provide a more intuitive interpretation of the ν quantity, we present an upper bound on ν in terms of an effective block condition number defined as follows. Given an index set J ⊆2[n], define the effective block condition number of a matrix A as max A κ (A)= i∈J ii . (3.8) eff,J λ (A ) min J Note that κ (A)≤κ(A ) always. The following lemma gives upper and lower bounds on the ν quantity. eff,J J Lemma 3.6. Let A be an n×n positive definite matrix and let p satisfy 1<p<n. We have that (cid:18) (cid:18) (cid:19) (cid:19) n n p−1 p−1 ≤ν ≤ + 1− κ (A) , κ (A)= max κ (A). p p n−1 n−1 eff,p eff,p J⊆2[n]:|J|=p eff,J where ν is defined in (3.5) and the distribution of S corresponds to uniformly selecting p coordinates without replacement. Lemma 3.6 states that if the p×p sub-blocks of A are well-conditioned as defined by the effective block condition number κ (A), then the speed-up of accelerated Gauss-Seidel with random coordinate selection eff,J over its non-accelerate counterpart parallels the case of fixed partitioning sampling (i.e. the rate described in (2.9) versus the rate in (2.3)). This is a reasonable condition, since very ill-conditioned sub-blocks will lead to numerical instabilities in solving the sub-problems when implementing Gauss-Seidel. On the other hand, we note that Lemma 3.6 provides merely a sufficient condition for speed-ups from acceleration. In Section 4, we show that on many instances we encountered, the ν parameter behaves much closer to the lower bound n/p than Lemma 3.6 suggests. We also compare numerically the bound from Lemma 3.6 to the actual ν value on various instances in Section 4.4. We leave a more thorough theoretical analysis of this parameter to future work. With Lemma 3.6 in place, we can combine Theorem 3.4 with (2.6) to derive the following upper bound on the iteration complexity of accelerated Gauss-Seidel with random coordinates as (cid:32) (cid:115) (cid:33) n max A N ≤O 1≤i≤n iiκ (A)log(1/ε) . rand,acc p λ (A) eff,p min 8 We conclude this section by constructing an instance for which this effective block condition number κ eff,p is well behaved. Consider the sub-family {A } ⊆A, with δ δ>0 A =A , δ >0. (3.9) δ n+δ,−n We have that n−1+δ (A ) =n−1+δ, λ (A )=n−p+δ, κ (A )= . δ ii min J eff,p δ n−p+δ Plugging these calculations into the bound from Lemma 3.6, we conclude that for any δ >0, (cid:18) (cid:19) n n p−1 ≤ν(A )≤ 1+ . (3.10) p δ p n−1 Therefore this bound is sharp up to a constant factor of two. Let us finally combine our bound on ν with Theorem 3.4 and compare the iteration complexity of Gauss-Seidel with and without acceleration under the random coordinate sampling model. For ease of comparison, let us assume p=o(n) and δ ∈(0,1). By a nearly identical calculation to Proposition 3.2, it is straightforward to show (see Proposition A.1 in the appendix) that pδ µ =λ (E[P ])= . rand min A1/2 n(n−p+δ) δ With this calculation, Theorem 3.4 states that at most O(n√3/2 log(1/ε)) iterations are sufficient for an p δ ε-accurate solution. On the other hand, when we do not use acceleration, (2.3) states that O(n2 log(1/ε)) pδ iterations are sufficient and Proposition 3.3 shows there exists a starting position for which it is necessary. Hence, as either n grows large or δ tends to zero, the benefits of acceleration on this instance become more pronounced. 4 Experiments In this section we experimentally validate our theoretical results on how our accelerated algorithms can improve convergence rates. Our experiments use a combination of synthetic matrices and matrices from large scale machine learning tasks. Setup. We run all our experiments on a 4 socket Intel Xeon CPU E7-8870 machine with 18 cores per socket and 1TB of DRAM. We implement all our algorithms in Python using numpy, and use the Intel MKL library which ships with the Anaconda distribution for numerical operations. To study how the execution time relates to the runtime complexity proposed in Section 3, we configure MKL to only use a single thread for execution. We discuss opportunities for parallelizing Gauss-Seidel in Section 5. We run each randomized algorithm for 5 different random seeds, and plot the average value across trials, presenting the minimum and maximum values as shaded regions. We report errors as relative errors, i.e. (cid:107)x −x (cid:107)2/(cid:107)x (cid:107)2. Finally, we k ∗ A ∗ A use the best values of µ and ν which we found when tuning each experiment. 4.1 Fixed partitioning versus random coordinate sampling Our first set of experiments numerically verify the separation between fixed partitioning sampling versus random coordinate sampling. 4.1.1 Example from Section 3.1 Figure 1 shows the progress per iteration on solving A x=b, with the A defined in Section 3.1. Here we 1,β 1,β set n=5000, p=500, β =1000, and b∼N(0,I). Figure 1 verifies our analytical findings in Section 3.1, that 9 MNIST Random Features, n=5000 Id+RankOne SymPos, n=5000, p=500 100 p = 1000 10-3 2A 10-6 p = 800 k /xk∗1100-1-92 Size p = 500 2xAk∗1100--1185 GS Fixed-Partition Block p = 200 −10-21 GS-Acc Fixed-Partition p = 100 xk10-24 GS-Uniform Sampling k 10-27 GS-Acc-Uniform Sampling p = 50 10-30 0 100 200 300 400 500 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Iteration Time to 1e-5 error (s) Figure1: Experimentscomparingfixedpartitions Figure 2: The effect of block size on the accel- versus random coordinate sampling for the exam- erated Gauss-Seidel method. For the MNIST ple from Section 3.1 with n=5000 coordinates, dataset(pre-processedusingrandomfeatures[21]) block size p=500. we see that block size of p=500 works best. the fixed partition scheme is substantially worse than uniform sampling on this instance. It also shows that in this case, acceleration provides little benefit in the case of random coordinate sampling. This is because both µ and 1/ν are order-wise p/n, and hence the rate for accelerated and non-accelerated coordinate descent coincide. However we note that this only applies for matrices where µ is as large as it can be (i.e. p/n), that is instances for which Gauss-Seidel is converging already at the optimal rate (see [5], Lemma 4.2). 4.1.2 Kernel ridge regression We next evaluate how fixed partitioning and random coordinate sampling affects the performance of Gauss- Seidel on large scale machine learning tasks. To do this we use two popular image classification datasets, MNIST2 and CIFAR-103. The task we evaluate is kernel ridge regression (KRR) with a Gaussian kernel. Specifically, given a labeled dataset {(x ,y )}n , we solve the linear system i i i=1 (K+λI)α=Y , K =exp(−γ(cid:107)x −x (cid:107)2), Y =(y ,...,y )T, (4.1) ij i j 2 1 n where λ,γ >0 are tunable parameters. (See e.g. [24] for background on KRR). The key property of kernel ridge regression is that the kernel matrix K is always positive semi-definite, and hence Algorithm 2 applies to solving (4.1). For the MNIST dataset, we apply the Gaussian kernel on the raw pixels to generate the kernel matrix K with n=60000 rows and columns. For CIFAR-10, we first apply standard pre-processing steps used in image classification [2, 25] and then apply the Gaussian kernel on top of our pre-processed images; the resulting kernel matrix has n=50000 rows and columns. Results from running 500 iterations of random coordinate sampling and fixed partitioning algorithms are shown in Figure 4. We plot the convergence rates both across time and across iterations. Comparing convergence across iterations we see that for both datasets random coordinate sampling is essential to achieve errors of 10−4 or lower. In terms of running time, fixed partitioning is desirable in scenarios where the inputs are spread across machines or stored on disk drives, as random access is expensive in these cases. Furthermore, the partitions can be computed once and cached across iterations. However, we see that this speedup in implementation 2http://yann.lecun.com/exdb/mnist/ 3https://www.cs.toronto.edu/~kriz/cifar.html 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.