THE ADAPTIVE S-STEP CONJUGATE GRADIENT METHOD ERIN CARSON∗ Abstract. On modern large-scale parallel computers, the performance of Krylov subspace iterativemethods islimitedbyglobal synchronization. This hasinspiredthe development ofs-step (orcommunication-avoiding)Krylovsubspacemethodvariants,inwhichiterationsarecomputedin blocksofs. Thisreformulationcanreducethenumberofglobalsynchronizations periterationbya 7 factorofO(s),andhasbeenshowntoproducespeedups inpracticalsettings. 1 Althoughthes-stepvariants aremathematicallyequivalent totheir classicalcounterparts, they 0 canbehavequitedifferentlyinfiniteprecisiondependingontheparameters. Ifsischosentoolarge, the s-step method can suffer a convergence delay and a decrease in attainable accuracy relative to 2 the classicalmethod. Thismakes itdifficultforapotential userof suchmethods - thesvalue that n minimizesthe timeper iteration maynot bethe best sforminimizingthe overall time-to-solution, a andfurthermaycauseanunacceptable decreaseinaccuracy. J Towardsimprovingthereliabilityandusabilityofs-stepKrylovsubspacemethods,inthiswork 5 we derive the adaptive s-step CG method, a variable s-step CG method where in block k, the 1 parameter sk is determined automatically such that a user-specified accuracy is attainable. The method fordeterminingsk isbasedonabound ongrowthofthe residualgapwithinblockk,from ] which we derive a constraint on the condition numbers of the computed O(sk)-dimensional Krylov A subspace bases. The computations required for determining the block size sk can be performed without increasing the number of global synchronizations per block. Our numerical experiments N demonstrate that the adaptive s-step CG method is able to attain up to the same accuracy as . classicalCGwhilestillsignificantlyreducingthetotal numberofglobalsynchronizations. s c [ 1. Introduction. Inthispaper,weconsidertheuseofKrylovsubspacemethods 1 for solving large, sparse linear systems Ax = b, where A Rn×n. We will focus ∈ v on the conjugate gradient (CG) method [25], which is used when A is symmetric 9 positivedefinite. Givenaninitialapproximatesolutionx1 andcorrespondingresidual 8 9 r1 = b−Ax1, the CG method iteratively updates the approximate solution using a 3 sequence of nested Krylov subspaces 1(A,r1) 2(A,r1) , where K ⊂K ⊂··· 0 1. Ki(A,r1)=span(r1,Ar1,...,Ai−1r1) 0 7 denotes the i-dimension Krylov subspace with matrix A and starting vector v. In 1 iteration i, the updated approximate solution xi+1 x1 + i(A,r1) is chosen by v: imposing the Galerkin condition ri+1 =b Axi+1 ∈i. K − ⊥K i Thus each iteration of CG requires a matrix-vector product with A in order to X expand the dimension of the Krylov subspace and a number of inner products to r a perform the orthogonalization. On modern computer architectures, the speed with which the matrix-vector products and inner products can be computed is limited by communication (i.e., the movement of data). This limits the potential speed of individual iterations attainable by an implementation of CG. To perform a sparse matrix-vector product in parallel, each processor must communicate entries of the source vector and/or the destination vector that it owns to neighboring processors. Innerproductsrequireaglobalsynchronization,i.e.,thecomputationcannotproceed untilallprocessorshavefinishedtheirlocalcomputationandcommunicatedtheresult to other processors. For large-scalesparse problems on large-scalemachines, the cost of synchronization between parallel processors can dominate the run-time (see, e.g., the exascale computing report [19, pp. 28]). ∗Courant Institute of Mathematical Sciences, New York University, New York, NY. ([email protected],http://math.nyu.edu/~erinc/). 1 Research efforts toward removing the performance bottlenecks caused by com- munication in CG and other Krylov subspace methods have produced various ap- proaches. One such approach are the s-step Krylov subspace methods (also called “communication-avoiding” Krylov subspace methods); for a thorough treatment of background, related work, and performance experiments, see, e.g., the theses [5, 27]. Ins-stepKrylovsubspacemethods,insteadofperformingoneiterationatatime, the iterations are performed in blocks of s; i.e., in each iteration, the Krylov subspace is expandedbyO(s)dimensionsbycomputingO(s)newbasisvectorsandthenallinner productsbetweenthenewbasisvectorsneededforthenextsiterationsarecomputed in one block operation. In this way, computing the inner products for s iterations only requires a single globalsynchronization,decreasingthe synchronizationcost per iteration by a factor of O(s). This approach has been shown to lead to significant speedups foranumberofproblemsandreal-worldapplications(see,e.g.,[35,43]). In theremainderofthe paper,wewillrefertothematriceswhosecolumnsconsistofthe O(s) basis vectorscomputedin eachblockas s-step basis matrices. Further details of the s-step CG method are discussed in section 2. We emphasize that our use of the overloadedterm“s-stepmethods”herediffersfromotherworks,e.g.,[14,30]and[21, 9.2.7], in which ‘s-step method’ refers to a type of restarted Lanczos procedure. § In exact arithmetic the s-step CG method produces the exact same iterates as the classicalCGmethod, buttheir behaviorcandiffer significantlyinfinite precision. In both s-step and classical Krylov subspace methods, rounding errors due to finite precision arithmetic have two basic effects: a decrease in attainable accuracy and a delayofconvergence. Ithaslongbeenknownthatfors-stepKrylovsubspacemethods, ass is increased(andso the conditionnumbers ofthe s-stepbasis matricesincrease), the attainable accuracy decreases and the convergence delay increases relative to the classical CG method (see, e.g., [10]). At the extreme, if the parameter s is chosen to be too large,the O(s)-dimensional bases computedfor eachblock canbe numerically rank deficient and the s-step method can fail to converge. This sensitive numerical behavior poses a practical obstacle to optimizing the performance of s-step methods, and diminishes their usability and reliability. In a setting where the performance of CG is communication-bound, we expect that up to some point, increasing s will decrease the time per iteration. If we pick s only based on minimizing the time per iteration, however, we can run into problems. First, the finite precision error may cause a large convergence delay, negating any potential performance gain with respect to the overallruntime. Since the number of iterations requiredforconvergenceforagivensvalueisnotknownapriori,choosingthesvalue that results in the fastest time-to-solution is a difficult problem. Second, the chosen s parameter may cause s-step CG to fail to converge to the user-specified accuracy. In this case, the particular problem is unsolvable by the s-step CG method. Requiring the user to choose the parameter s thus diminishes the practical us- ability of s-stepKrylovsubspace methods. It is therefore imperative that we develop a better understanding ofthe convergencerate andaccuracyinfinite precisions-step CG and other s-step Krylov subspace methods. Our hope is that by studying the theoretical properties of methods designed for large-scale computations in finite pre- cision, we can develop methods and techniques that are efficient, capable of meeting application-dependent accuracy constraints, and which do not require that the user have extensive knowledge of numerical linear algebra. Towardthis goal,inthis paper we developa variables-stepCG method inwhich the s values can vary between blocks and are chosen automatically by the algorithm 2 such that a user-specified accuracy is attained. We call this the adaptive s-step CG method. The condition for determining s in each block is derived using a bound on the size of the gap between the true residuals b Ax and the updated residuals r i i − computed in finite precision. We show that in order to prevent a loss of attainable accuracy, in a given block of iterations, the condition number of the s-step basis ma- trix should be no larger than a quantity related to the inverse of the 2-norms of the residualscomputedwithinthatblock. Inotherwords,asthemethodconverges,more ill-conditioned bases (i.e., larger s values) can be used without a loss of attainable accuracy. Furthermore, we show that the adaptive s-step CG method can be im- plemented in a way that does not increase the synchronization cost per block versus (fixed) s-step CG. Whereasthefixeds-stepCGmethodmaynotattainthesameaccuracyasclassical CG, our numerical experiments on a set of example matrices demonstrate that our adaptives-stepCGmethodcanachievethe sameaccuracyasclassicalCG,while still reducing the number of synchronizations required by up to over a factor of 4. In section 2 we give a brief background on s-step Krylov subspace methods and detail the s-step CG method, and in section 3 we discuss related work. Section 4 presents our main contribution; in section 4.1, we derive a theoretical bound on how large the s-step basis matrix condition number can be in terms of the norms of the updated residuals without affecting the attainable accuracy, and in section 4.2, we detail the adaptive s-stepCG algorithmthat makes use ofthis bound in determining the number ofiterationsto performineachblock. Numericalexperiments forasmall set of test matrices are presented in section 5. We discuss possible extensions and future work and briefly conclude in section 6. 2. Background: s-step Krylov subspace methods. The basic idea behind s-step Krylov subspace methods is to grow the underlying Krylov subspace O(s) di- mensions at a time and perform the necessary orthogonalizationwith a single global synchronization. Withinablockofs iterations,thevectorupdatesareperformedim- plicitly by updating the coordinates of the iteration vectors in the O(s)-dimensional basis spanned by the columns of the s-step basis matrix. The algorithmic details vary depending on the particular Krylov subspace method considered, but the gen- eral approach is the same. There is a wealth of literature related to s-step Krylov subspace methods. We give a brief summary of early related work below; a thorough bibliography can be found in [27, Table 1.1]. The s-step approach was first used in the context of Krylov subspace methods by Van Rosendale [41], who developed a variant of the parallel conjugate gradient method which minimizes inner product data dependencies with the goal of exposing more parallelism. The term “s-step Krylov subspace methods” was first used by Chronopoulosand Gear,who developedan s-stepCG method [9, 10]. This workwas followedinsubsequentyearsbythedevelopmentofothers-stepvariantsofOrthomin and GMRES [11], Arnoldi and Symmetric Lanczos [31, 32], MINRES, GCR, and Orthomin [8, 13], Nonsymmetric Lanczos [33], and Orthodir [12]. Walker similarly used s-step bases, but with the goal of improving stability in GMRES by replacing the modified Gram-Schmidt orthogonalizationprocess with Householder QR [42]. Many of the earliest s-step Krylov subspace methods used monomial bases (i.e., [v,Av,A2v,...])forthecomputeds-stepbasismatrices,butitwasfoundthatconver- genceoftencouldnotbeguaranteedfors>5(see,e.g.,[10]). Thismotivatedresearch into the use of other more well-conditioned bases for the Krylov subspaces, including scaled monomial bases [26], Chebyshev bases [29, 16, 17], and Newton bases [1, 20]. 3 Thegrowingcostofcommunicationinlarge-scalesparseproblemshascreatedarecent resurgenceofinterestintheimplementation,optimization,anddevelopmentofs-step Krylovsubspacemethods;see,e.g.,therecentworks[18,27,35,46,23,34,44,45,43]. 2.1. The s-step conjugate gradient method. In this work, we focus on the s-step conjugate gradient method (Algorithm 2), which is equivalent to the s-step CG method (Algorithm 2) in exact arithmetic. In the remainder of this section, we give a brief explanation of the s-step CG method; further details on the derivation, implementation, and performance of the s-step CG method can be found in, e.g., [5, 27]. Algorithm 1 Conjugate gradient (CG) Input: n n symmetric positive definite matrix A, length-n vector b, and initial × approximation x1 to Ax=b Output: Approximate solution xi+1 to Ax=b with updated residual ri+1 1: r1 =b Ax1, p1 =r1 − 2: for i=1,2,..., until convergence do 3: αi =riTri/pTi Api 4: qi =αipi 5: xi+1 =xi+qi 6: ri+1 =ri Aqi − 7: βi =riT+1ri+1/riTri 8: pi+1 =ri+1+βipi 9: end for Thes-stepCGmethodconsistsofanouterloop,indexedbyk,whichiteratesover theblocksofiterations,andaninnerloop,whichiteratesoveriterationsj 1,...,s ∈{ } within a block. For clarity, we globally index iterations by i sk+j. It follows from ≡ the properties of CG that for j 0,...,s , ∈{ } psk+j+1,rsk+j+1 s+1(A,psk+1)+ s(A,rsk+1) and ∈K K xsk+j+1 xsk+1 s(A,psk+1)+ s−1(A,rsk+1). (2.1) − ∈K K Then the CG vectors for the next s iterations lie in the sum of the column spaces of the matrices k,s =[ρ0(A)psk+1,...,ρs(A)psk+1], with span( k,s)= s+1(A,psk+1) and P P K k,s =[ρ0(A)rsk+1,...,ρs−1(A)rsk+1], with span( k,s)= s(A,rsk+1), (2.2) R R K where ρ (z) is a polynomial of degree j satisfying the three-term recurrence j ρ0(z)=1, ρ1(z)=(z θ0)ρ0(z)/γ0, and − ρj(z)=((z θj−1)ρj−1(z) σj−2ρj−2(z))/γj−1. (2.3) − − We define the s-step basis matrix = [ , ] and define to be the Yk,s Pk,s Rk,s Yk,s sameas exceptwithallzerosincolumnss+1and2s+1. Notethatthes’sinthe k,s Y subscripts are unnecessary here, but will be useful in describing the variable s-step CG method later on. Under certain assumptions on the nonzero structure of A, k,s Y can be computed in parallel in each outer loop for the same asymptotic latency cost 4 as a single matrix-vectorproductusing the so-called“matrix powerskernel”(see [18] for details). Since the columns of satisfy (2.3), we can write k,s Y A = , (2.4) Yk,s Yk,sBk,s where is a (2s+1) (2s+1) block tridiagonal matrix defined by the 3-term k,s B × recurrence coefficients in (2.3). By (2.1), there exist vectors p′ , r′ , and x′ that represent the coordi- k,j+1 k,j+1 k,j+1 nates of the CG iterates psk+j+1, rsk+j+1, and xsk+j+1 xsk+1, respectively, in k,s − Y for j 0,...,s . That is, ∈{ } psk+j+1 =Yk,sp′k,j+1, rsk+j+1 =Yk,srk′,j+1, and (2.5) xsk+j+1−xsk+1 =Yk,sx′k,j+1. Therefore,intheinnerloopofs-stepCG,ratherthanupdatetheCGvectorsexplicitly, we instead update their coordinates in , i.e., for j 1,...,s , k,s Y ∈{ } x′k,j+1 =x′k,j +αsk+jp′k,j, rk′,j+1 =rk′,j −αsk+jBk,sp′k,j, and p′k,j+1 =rk′,j+1+βsk+jp′k,j. Thus the matrix-vector product with A in each iteration becomes a matrix-vector productwiththemuchsmallermatrix . Thisalongwiththelength-(2s+1)vector k,s B updates can be performed locally by each processor in each inner loop iteration. We can also reduce communication in computing the inner products. Letting Gk,s =YkT,sYk,s, and using (2.5) and (2.4), αsk+j and βsk+j can be computed by αsk+j = rk′T,jGk,srk′,j / p′kT,jGk,sBk,sp′k,j and βsk+j =(cid:0)rk′T,j+1Gk,srk′(cid:1),j+(cid:0)1 / rk′T,jGk,srk′,j(cid:1). The matrix G can be co(cid:0)mputed with one(cid:1) g(cid:0)lobal synchr(cid:1)onization per outer loop k,s iteration, which, in terms of asymptotic parallel latency, costs the same as a single inner product. As k,s and Gk,s are dimension (2s+1) (2s+1), αsk+j and βsk+j B × can be computed locally by each processor within the inner loop. 3. Related work. Inthissection,wediscussedrelatedworkintheareasofvari- able s-step Krylov subspace methods, the analysis of maximum attainable accuracy in finite precision CG, and inexact Krylov subspace methods. 3.1. Variable s-step Krylov subspace methods. Williams et al. [43] use a variable s-step BICGSTAB as the coarse grid solver in a multigrid method. The techniqueusedhere,termed“telescoping”,ismotivatedbytheobservationthansome coarse grid solves converge after only a few iterations, whereas other solves take significantlylonger. Bystartingwithasmallsvalueandgraduallyincreasingit,they ensurethats-stepBICGSTABwithlargersisonlyusedwhenthe solvetakesenough iterations to amortize the additional costs associated with s-step CG. Imberti and Erhel [28] have recently developed an s-step GMRES method that allowsvariablessizes. Theyrecommendchoosings valuesaccordingtotheFibonacci k sequence up to some maximum value smax, i.e., starting with s0 = s1 = 1 and for k 2, sk = min(smax,sk−1+sk−2). In this way, the sequence si used by Imberti ≥ { } and Erhel is predetermined. 5 Algorithm 2 s-step conjugate gradient Input: n n symmetric positive definite matrix A, length-n vector b, and initial × approximation x1 to Ax=b Output: Approximate solution xsk+s+1 to Ax=b with updated residual rsk+s+1 1: r1 =b Ax1, p1 =r1 − 2: for k =0,1,..., until convergence do 3: Compute k,s =[ k,s, k,s] according to (2.2) Y P R 4: Compute Gk,s =YkT,sYk,s 5: Assemble k,s such that (2.4) holds B 6: p′k,1 =[1, 01,2s]T, rk′,1 =[01,s+1, 1, 01,s−1]T, x′k,1 =[01,2s+1]T 7: for j =1 to s do 8: αsk+j = rk′T,jGk,srk′,j / p′kT,jGk,sBk,sp′k,j 9: qk′,j =αs(cid:0)k+jp′k,j (cid:1) (cid:0) (cid:1) 10: x′k,j+1 =x′k,j +qk′,j 11: rk′,j+1 =rk′,j −Bk,sqk′,j) 12: βsk+j = rk′T,j+1Gk,srk′,j+1 / rk′T,jGk,srk′,j 13: p′k,j+1 =(cid:0)rk′,j+1+βsk+jp′k,(cid:1)j (cid:0) (cid:1) 14: end for 15: Recover iterates psk+s+1,rsk+s+1,xsk+s+1 according to (2.5) { } 16: end for In contrast, in our approach, the sequence s is dynamically chosen based on i { } the2-normoftheupdatedresidual(whichisnotnecessarilymonotonicallydecreasing in CG). In addition, our method is designed such that a user-specified accuracy ε∗ can be attained when ε∗ εCG, where εCG is the accuracy attained by classical CG ≥ for the same problem. 3.2. Maximum attainable accuracy in Krylov subspace methods. In boths-stepandclassicalvariantsofKrylovsubspacemethods,finiteprecisionroundoff error in updates to the approximate solution x and the residual r in each iteration i i can cause the updated residual r and the true residual b Ax to grow further and i i − further apart as the iterations proceed. If this deviation grows large, it can limit the maximum attainable accuracy, i.e.,the accuracywithwhichwecansolveAx=bona computerwithunitround-offε. AnalysesofmaximumattainableaccuracyinCGand otherclassicalKSMsaregivenbyGreenbaum[22],vanderVorstandYe[40],Sleijpen, vanderVorst,andFokkema[37],Sleijpen,vanderVorst,andModersitzki[38],Bj¨orck, Elfving, and Strakoˇs [2], and Gutknecht and Strakoˇs [24]. One important result of these analyses is the insight that loss of accuracy can be caused at a very early stage of the computation, which can not be corrected in later iterations. Analyses of the maximum attainable accuracy in s-step CG and the s-step biconjugate gradient method (BICG) can be found in [6, 5]. 3.3. Inexact Krylov subspace methods. Theterm“inexactKrylovsubspace methods” refersto Krylovsubspacemethods inwhichthe matrix-vectorproducts Av are computed inexactly as (A+E)v where E represents some error matrix (due to either finite precision computation or intentional approximation). It is assumed that allothercomputations(orthogonalizationandvectorupdates)areperformedexactly. Using an analysis of the deviation of the true and updated residuals, it is shown that under these assumptions the size of the perturbation term E can be allowed to 6 grow inversely proportional to the norm of the updated residual without adversely affecting the attainable accuracy. This theory has potential application in mixed- precision Krylov subspace methods, as well as in situations where the operator A is expensive to apply and can be approximated in order to gain performance. Much work has been dedicated to inexact Krylov subspace methods. In [36], Simoncini and Szyld provide a general theory for inexact variants applicable to both symmetric and nonsymmetric linear systems andeigenvalue problemsand use this to derive criteria for bounding the size of E in such a way that does not diminish the attainableaccuracyofthemethod. AsimilaranalysiswasgivenbyvandeEschofand Sleijpen [39]. These analyses confirm and improve upon the earlier empirical results forthe inexactGMRESmethodofBourasandFraysse´e[3]andthe inexactconjugate gradient method of Bouras, Frayss´e,and Giraud [4]. OurworkisverycloselyrelatedtothetheoryofinexactKrylovsubspacemethods in that our analysis results in a somewhat analogous “relaxation strategy”, where insteadofrelaxingtheaccuracyofthematrix-vectorproductsweareinsteadrelaxing a constraint on the condition numbers of the computed s-step basis matrices. In addition, our analysis assumes that all parts of the computation are performed in a fixed precision ε. Algorithm 3 Variable s-step conjugate gradient Input: n n symmetric positive definite matrix A, length-n vectorb, initial approx- × imation x1 to Ax=b, sequence of s values (s0,s1,...) Output: Approximate solution xm+sk+1 to Ax=b with updated residual rm+sk+1 1: r1 =b Ax1, p1 =r1 − 2: m=0 3: for k =0,1,..., until convergence do 4: Compute Yk,sk =[Pk,sk, Rk,sk] according to (2.2) 5: Compute Gk,sk =YkT,skYk,sk 6: Assemble Bk,sk such that (2.4) holds 7: p′k,1 =[1, 01,2sk]T, rk′,1 =[01,sk+1, 1, 01,sk−1]T, x′k,1 =[01,2sk+1]T 8: for j =1 to sk do 9: αm+j = rk′T,jGk,skrk′,j / p′kT,jGk,skBk,skp′k,j 10: qk′,j =αm(cid:0)+jp′k,j (cid:1) (cid:0) (cid:1) 11: x′k,j+1 =x′k,j +qk′,j 12: rk′,j+1 =rk′,j −Bk,skqk′,j) 13: βm+j = rk′T,j+1Gk,skrk′,j+1 / rk′T,jGk,skrk′,j 14: p′k,j+1 =(cid:0)rk′,j+1+βm+jp′k,j(cid:1) (cid:0) (cid:1) 15: end for 16: Recover iterates {pm+sk+1,rm+sk+1,xm+sk+1} according to (2.5) 17: m=m+sk 18: end for 4. Variable s-step CG. As discussed, the conditioning of the s-step basis ma- tricesins-stepKrylovsubspacemethodsaffectstherateofconvergenceandattainable accuracy. In the s-stepCG method (Algorithm 2), the Krylovsubspace basis is com- puted in blocks of size s in eachouter loop. This method can be generalizedto allow the blocks to be of varying size; in other words, a different s value can be used in each outer loop iteration. We denote the block size in outer iteration k by s , where k 1 sk smax for some maximum s value smax. We call this a variable s-step Krylov ≤ ≤ 7 subspace method. A general variable s-step CG method is shown in Algorithm 3. In this section, we derive a variable s-step CG method which automatically de- termines s in each outer loop such that a user-specified accuracy can be attained, k whichwecalltheadaptives-stepCGmethod. Ouranalysisisbasedonthemaximum attainable accuracyanalysisfor s-stepCGin [6], whichshowsthat the attainable ac- curacy of s-step CG depends on the condition numbers of the s-step basis matrices computed in each outer loop. In summary, we show that if in outer loop k, k,s Y s is selected such that the condition number of the basis matrix is inversely k Yk,sk proportional to the maximum 2-norm of the residual vectors computed within outer loop k, then the approximate solution produced by the variable s-step CG method canbeasaccurateastheapproximatesolutionproducedbytheclassicalCGmethod. Further, our method allows for the user to specify the desired accuracy, so in the case that a less accurate solution is required, our method automatically adjusts to allowforhighers values. Thiseffectivelyexploitsthetradeoffbetweenaccuracyand k performance in s-step Krylov subspace methods. Inthissectionwederiveouradaptives-stepapproach,whichstemsfromabound on the growth in outer loop k of the gap between the true and updated residuals in finite precision. 4.1. A constraint on basis matrix condition number. In the remainder of the paper, hats denote quantities computed in finite precision, ε denotes the unit round-off, denotes the 2-norm, and κ(A) denotes the 2-norm condition number A−1 A .kT·kosimplifythe analysis,wedropallO(ε2)terms(andhigherorderterms k kk k in ε). Writing the true residual b Ax =b Ax r +r , we can write the bound i i i i − − − b Ax b Ax r + r . i i i i k − k≤bk − −b k b k bk Thenas r 0(i.e.,asthemethodconverges),wehave b Ax b Ax r . i i i i The sizekofkth→e residual gap, i.e.b, b Ax br , btherefobrekde−termikne≈skthe−attai−nabkle i i k − − k accuracy obf the method (see the related references in section 3.2)b. b b We begin by considering the rounding errors made in the variable s-step CG b b method (Algorithm 3). In outer loop k and inner loops j 1,...,s of finite k ∈ { } precision variable s-step CG, using standard models of floating point error (e.g., [21, 2.4]) we have § Aˆ = E , (4.1) Yk,sk Yk,skBk,sk − k where E ε (3+N ) A ˆ +4 , b k kk≤ A k kkYk,skk kYk,skkkBk,skk x′k,j+1 =x′k,j +qk′,j +ξk(cid:16),j+1, b (cid:17) (4.2) where kξk,j+1k≤εkx′k,j+1k, b b b rk′,j+1 =rk′,j −Bk,skqk′,j +ηk,j+1, (4.3) b where kηk,j+1k≤ε(krk′,j+1k+3kBk,skkkqk′,jk), b b b xm+j+1 =Yk,skx′k,j+1+xm+1b+φm+j+1, b (4.4) b whebre kbφm+j+1kb≤ε(kxm+j+1k+3jkYk,skkkx′k,j+1k), and rm+j+1 =Yk,skrk′,j+1+ψm+j+1b,, b b (4.5) b whebre kbψm+j+1k≤3jεkYk,skkkrk′,j+1k, whereN denotesthe maximumnumber ofnonzerosperrowinAandm= k−1s . A b b i=0 i Let δ b Ax r denote the residual gap in iteration i. Similar to the analysis i i i ≡ − − P 8 b b in [6], for variable s-step CG we can write the growthof the residual gap in iteration m+j+1 in outer loop k as j δm+j+1−δm+1 =− AYˆk,skξk,i+1+Yk,skηk,i+1−Ekqk′,i −Aφm+j+1−ψm+j+1. Xi=1(cid:16) (cid:17) Our goal is now to manipulate a bobund on this qubantity in order to derive a method for determining the largest s value we can use in outer loop k such that k the desired accuracy is attained. Using standard techniques, we have the norm-wise bound j j kδm+j+1−δm+1k≤εkAkkYˆk,skk kx′k,i+1k+εkYk,skk krk′,i+1k i=1 i=1 X X b b b j +ε (3+N ) A ˆ +7 q′ A k kkYk,skk kYk,skkkBk,skk k k,ik (cid:16) (cid:17)Xi=1 +εkAkkxm+j+1k+3jεkAkkYk,bskkkx′k,j+1k+3jεkYbk,skkkrk′,j+1k j ε (7+2bN ) A ˆ +1b4 b bx′ b ≤ A k kkYk,skk kYk,skkkBk,skk k k,i+1k (cid:16) (cid:17)Xi=1 j b b +ε r′ kYk,skk k k,i+1k i=1 X +εkAbkkxm+j+1kb+3jεkAkkYk,skkkx′k,j+1k+3jεkYk,skkkrk′,j+1k, where we have used ji=1kqbk′,ik ≤ (2+ε) ji=1bkx′k,i+1bk. To simplifby the nbotation, weletτ = / A andN =7+2N +14τ . We note that depends on k kBk,skk kPk k AP k kBk,skk the polynomial basis used and is not expected to be too large;e.g., for the monomial b b basis, =1. The above bound can then be written kBk,skk j j kδm+j+1−δm+1k≤εNkkAkkYk,skk kx′k,i+1k+εkYk,skk krk′,i+1k (4.6) i=1 i=1 X X +εkAkkxmb+j+1k+3jbεkAkkYk,skkbkx′k,j+1k+3bjεkYk,skkkrk′,j+1k. Note that assuming Yk,skbis full rank, we havbe b b b kYk,skkkbrk′,i+1k=kYk,skkkYk+,skYk,skrk′,i+1k b b ≤κ(bYk,sk)kbYk,skbrm+ib+1−ψm+i+1k =κ(Ybk,sk)krbm+i+b1k+O(ε), and, (4.7) x′ = + x′ kYk,skkk k,i+1k kYbk,skkkYbk,skYk,sk k,i+1k κ( ) x′ . (4.8) b b ≤ bYk,sk kbYk,skbk,i+b1k (4.9) b b b To bound x′ in terms of the size of the residuals, notice that by (4.4), kYk,sk k,i+1k Yk,skbx′k,i+b1 =xm+i+1−xm+1−φm+i+1 9 b b b b =(x xm+1) (x xm+i+1) φm+i+1 − − − − =A−1(b Axm+1) A−1(b Axm+i+1) φm+i+1 − − − − =A−1(rbm+1+δm+1)b A−1(rm+i+1+δm+i+1) φm+i+1 − − =A−1(rm+1b rm+i+1)+A−1(δmb+1 δm+i+1) φm+i+1, − − − b b so b b kYk,skx′k,i+1k≤kA−1k(krm+1k+krm+i+1k)+O(ε). Thus together witbh (4.b8) we can write b b kYk,skkkx′k,i+1k≤κ(Yk,sk)kA−1k(krm+1k+krm+i+1k)+O(ε). This allobws us tbo write the bbound (4.6) inbthe form b j kδm+j+1−δm+1k≤2εNkκ(A)κ(Yk,sk) krm+i+1k i=1 X +ε Nkjκ(A)bκ(Yk,sk)+b3jκ(A)κ(Yk,sk) krm+1k (cid:16) (cid:17) +ε 3jκ(A)κ(Ykb,sk)+3jκ(Yk,sk) bkrm+j+1bk +εk(cid:16)Akkxm+j+b1k. b (cid:17) b Using x−xm+j+1 =A−1rm+j+1+δbm+j+1, we have b kxmb+j+1−xk≤kA−1kkrm+j+1k+O(ε), and writing xm+j+1 =bxm+j+1−x+x, we havbe j b b kδm+j+1−δm+1k≤2Nkεκ(A)κ(Yk,sk) j(krm+1k+krm+j+1k)+ krm+i+1k! i=1 X +ε A x . b b b b k kk k Then kδm+j+1−δm+1k≤6Nkjεκ(A)κ(Yk,sk) ℓ∈{m0,a..x.,j}krm+ℓ+1k +εkAkkxk, (cid:18) (cid:19) and letting c =6N jκ(A), this boundbcan be written b k,j k kδm+j+1−δm+1k≤ck,jεκ(Yk,sk) ℓ∈{m0,a..x.,j}krm+ℓ+1k +εkAkkxk. (4.10) (cid:18) (cid:19) Thisgivesanorm-wiseboundontbhegrowthoftherbesidualgapduetofiniteprecision errors made in outer iteration k. Letting m = ℓ−1s , notice that we have ℓ i=0 i P k kδmk+sk+1k≤kδ1k+ kδmℓ+sℓ+1−δmℓ+1k. ℓ=0 X 10