Alternating minimal energy methods for linear systems in higher dimensions. Part I: SPD systems ∗ Sergey V. Dolgov and Dmitry V. Savostyanov † ‡ January 25, 2013 3 1 0 Abstract 2 n Weintroduceafamilyofnumericalalgorithmsforthesolutionoflinearsystemin a higherdimensionswiththematrixandrighthandsidegivenandthesolutionsought J in the tensor train format. The proposed methods are rank–adaptive and follow the 5 2 alternating directionsframework, but in contrastto ALSmethods,in each iterationa tensorsubspaceisenlargedbyasetofvectorschosensimilarlytothesteepestdescent ] A algorithm. The convergenceis analysedin the presenceofapproximation errorsand N the geometrical convergence rate is estimated and related to the one of the steepest . descent. The complexity of the presented algorithms is linear in the mode size and h dimension and the convergence demonstratedin the numerical experiments is com- t a parabletotheoneoftheDMRG–typealgorithm. m Keywords: high–dimensionalproblems,tensortrain format,ALS,DMRG, steepest [ descent,convergencerate,superfastalgorithms. 1 v 8 1 Introduction 6 0 6 Linearsystemsarisingfromhigh–dimensionalproblemsusuallycannotbesolvedbystan- . 1 dard numerical algorithms. Ifthe equation isconsidered in ddimensions on a n n 0 1 2 × × 3 ... n grid,thenumberofunknownsn ...n scalesexponentiallywithd,andevenfor d 1 d 1 × moderate dimension d and mode sizes n the numerical complexity lays far beyond the : k v technicalpossibilitiesofmodernworkstationsandparallelsystems. Tomaketheproblem i X tractable, different approximations are proposed, including sparse grids [38, 3] and ten- r sor product methods [24, 22, 23, 14]. In this paperwe consider the linear system Ax = y, a wherethematrixAandright-hand-sideyaregivenandapproximatesolutionxissought in the tensor train (TT) format. Methods based on the TT format, also known as a linear ∗PartiallysupportedbyRFBRgrants12-01-00546-a,11-01-12137-ofi-m-2011,11-01-00549-a,12-01-33013, 12-01-31056, Russian Fed. Gov. contracts No. Π1112, 14.740.11.0345, 16.740.12.0727 at Institute of Nu- mericalMathematics,RussianAcademyofSciences,andEPSRCgrantEP/H003789/1attheUniversityof Southampton. †Max-Planck-InstitutfürMathematikindenNaturwissenschaften,Inselstr. 22-26,D-04103Leipzig,Ger- many([email protected]). ‡University of Southampton, Department of Chemistry, Highfield Campus, Southampton SO17 1BJ, UnitedKingdom([email protected]) 1 tensor network, are novel and particularly interesting among all tensor product methods due totheir robustness andsimplicity. Thenumericaloptimizationontensornetworkswasfirstconsideredinquantumphysics community by S. White [42], who introduces the matrixproductstates (MPS) formalism to represent the ground state of a spin system together with the density matrix renormaliza- tiongroup(DMRG)optimizationscheme. Thetensortrainformatandsomecomputational methods were independently re-discovered in the papers of Oseledets and Tyrtyshnikov (see [30] and references therein) until the results of White et. al. were popularized in the numerical mathematics community by R. Schneider [18]. The questions concerning the convergence properties of alternating schemes for different tensor product formats were immediately raised and studied. The experimental results from quantum physics show thenotablyfastconvergenceofDMRGforthegroundstateproblem,i.e.,findingthemin- imal eigenstate of a system, but give no theoretical justification for this observation. The alternatingleastsquares(ALS)algorithm wasusedinmultilinearanalysisforthecomputa- tionofcanonicaltensordecompositionsinceearlyresultsofHitchcock[17]andwasknown for its monotone but very slow convergence. For ALS there is also a lack of convergence estimates both in the classical papers [16, 4], and in the recent ones, where ALS was ap- plied to the Tucker model [5, 33], tensor trains [29], hierarchical Tucker format [25] and high–dimensional interpolation [34]. In recent papers by Uschmajew [41, 35] the local convergence of ALS is proven for the canonical and tensor train decompositions. This is a major theoretical breakthrough, which unfortunately does not immediately lead to practical algorithms due to the local characterofconvergencestudied,unjustifiedassumptionsonthestructureoftheHessian, andverystrongrequirementsontheaccuracyoftheinitialguess. Theconvergencerateof ALS is difficult to estimate partly due to the complex geometrical structure of manifolds definedbytensornetworks. Thisproblemisnowapproachedfromseveraldirections,and we mightexpect newresults soon [12, 11]. IncontrasttoALSschemeswhichoperateonmanifoldsoffixeddimension,theDMRG algorithm changes the ranks of a tensor format. This allows to choose the ranks adap- tively to the desired error threshold or the accuracy of the result and develop more prac- tical algorithms which do not rely on a priori choice of ranks. The DMRG was adopted for novel tensor formats (see references above) and new problems, including adaptive high–dimensional interpolation [37] andsolution of linearsystems [9, 18]. The geometri- cal analysis, eg the convergence of the nonlinear Gauss–Seidel method, is however even more difficult when the dimensionsofunderlyingmanifoldsare notfixed. Apart of working with the tensor format structure directly, like ALS and DMRG do, standard algorithms from numerical linear algebra can be applied with tensor approx- imations and other tensor arithmetics. Following this paradigm, the solution of linear problemsin tensor product formats wasaddressed in [36, 1, 6]. Theusual considerations of linear algebra can be used in this case to analyze the convergence. A first notable ex- ample is the method of conjugate–gradient type for the Rayleigh quotient minimization in higherdimensions, for which the global convergence wasproven byO. Lebedeva[27]. WedevelopaframeworkwhichcombinestheALSoptimizationsteps(ranksarefixed, convergence estimates not yet possible) with the steps when the tensor subspaces are in- creased and the ranks of a tensor format grow. Choosing the new vectors in accordance with standardlinearalgebraalgorithms, werecastthe classicalconvergence estimatesfor the proposed algorithm in higherdimensions. Inthispaperweconsider the caseofsym- 2 metrical positive definite (SPD)matricesandanalyzethe convergence inthe A–norm, i.e. minimizetheenergyfunction. Thebasisenrichmentchoicefollowsthesteepestdescent(SD) algorithmandtheconvergenceoftheresultedmethodisanalyzedwithrespecttotheone of steepestdescent. Weshow that the basisenrichmentstep combined with the ALSstep can be seen as a certain computationally cheap approximation of the DMRG step. The complexity of the resulted method is equal to the one of ALS and is linear in the mode sizeanddimension. Ourchoiceofthebasisenrichmentappearstobeverygoodforprac- tical computations, and for the considered numerical examples the proposed methods converge almostasfastasthe DMRGalgorithm. Summarizingtheabove,theproposedalgorithmshave(1)provengeometricalconver- gence with the estimated rate, (2) practical convergence compared to the one of DMRG, (3)numerical complexity compared tothe one ofALS. The paperisorganizedasfollows. In Section 2 weintroduce the tensor train notation andnecessary definitions. In Section 3 we introduce the basic notation for ALS and DMRG schemes. We also studyhowthemodification ofoneTT–blockaffectstheALSproblemforitsneighborand describe this in terms ofthe Galerkin correction method. In Section 4 we develop the family of steepest descent methods for the problems in one, two and many dimensions. The proposed methods have an inner–outer structure, i.e.,asteepestdescentstepinddimensionsisfollowedbyasteepestdescentstepind−1 dimension,etc, cf. the interpolation algorithms [32,13]. Theconvergenceofthe recursive algorithms in higher dimensions is analyzed using the Galerkin correction framework. The effectofroundoff/approximation errors is alsostudied. SincewemakenoassumptionsontheTT–ranksofthesolution,theranksofthevectors intheproposedalgorithmscangrowateachiterationandmakethealgorithminefficient. InSection5wediscusstheimplementationdetails,inparticularthestepswhenthetensor approximation isrequired to reduce the ranks. InSection6themodelnumericalexperimentsdemonstratetheefficiencyofthemethod proposed andcompare it with other algorithms mentioned in the paper. 2 Tensor train notation and definitions Thetensortrain(TT)representationofad-dimensionaltensorx = [x(i ,...,i )]iswritten 1 d asthe followingmultilinearmap (cf. [35]) x = τ(X¯) = τ(X(1),...,X(d)), (1) x(i ,...,i ) = X(1) (i )X(2) (i )...X(d−1) (i )X(d) (i ), 1 d α0,α1 1 α1,α2 2 αd−2,αd−1 d−1 αd−1,αd d wherei = 1,...,n arethemode(physical)indices,α = 1,...,r aretherankindices,X(k) k k k k arethe tensortraincores(TT–cores)andX¯ = (X(1),...,X(d))denotethewholetensortrain. HereandlaterweusetheEinsteinsummation convention[10],whichassumesasumma- tion over every pair of repeated indices. Therefore, in Eq. (1) we assume the summation over all rank indices α , k = 1,...,d − 1. We also imply the closed boundary conditions k r = r = 1 to make the right–hand side a scalar for each (i ,...,i ). Eq. (1) is written in 0 d 1 d the elementwise form, i.e., the equation is assumed over all free (unpaired) indices. It is often convenientin higherdimensionsand will be usedthroughout the paper. 3 The indices can be written either in the subscript x or in brackets x(j). For the sum- j mation, there is no difference. The subscripted indices are usually considered as row and columnindicesofa matrix, whilethe indicesin bracketsare seen asparameters. For exam- ple, each TT–core X(k) is considered as a parameter-dependent on i matrix with the row k indexα andthe column indexα asfollows k−1 k X(k) = [X(k) (i )] Crk−1×nk×rk, X(k)(i ) Crk−1×rk. αk−1,αk k ∈ k ∈ In our notation X(k)(i ) is a matrix, for which standard algorithms like orthogonalization k (QR) and singular value decomposition (SVD) can be applied. We will freely transfer in- dicesfromsubscriptstobracketsinordertomaketheequationseasiertoreadortoempha- size a certain transposition of elements in tensors. It brings the notations in consistence with previous paperson the numerical tensor methods, e.g. [18, 9, 8,35]and others. Wewillreshapearraysintomatricesandvectorsbyusingtheindexgrouping,i.e.,com- biningtwoormoreindicesα,...,ζinasinglemulti-indexα...ζ.Following[35]wedefine interface matricesX6k Cn1...nk×rk andX>k Crk×nk+1...nd asfollows ∈ ∈ X6k(i i ...i ,α ) = X(1)(i )X(2) (i )...X(k) (i ), 1 2 k k α1 1 α1α2 2 αk−1,αk k (2) X>k(α ,i ...i i ) = X(k+1) (i )...X(d−1) (i )X(d) (i ), k k+1 d−1 d αk,αk+1 k+1 αd−2,αd−1 d−1 αd−1 d and similarly for symbols X<k and X>k. Using the τ notation defined in (1) we can write x = τ(X6k,X>k). For a tensor x = [x(i ,...,x )] we also define the unfolding matrix, which 1 d consists of the entries ofthe original tensor asfollows X{k}(i ...i ,i ...i ) = x(i ,...,i ), X{k} Cn1...nk×nk+1...nd. 1 k k+1 d 1 d ∈ For x in the TT–format (1) it holds X{k} = X6kX>k and therefore rankX{k} = r . In [30] k the reverse is proven: for any tensor x there exists the representation (1) with TT–ranks r = rankX{k}. This gives the term TT–rank the definite algebraic meaning. As a result, k the tensor train representation of fixed TT–ranks yields a closed manifold, and the rank- (r ,...,r ) approximation problem is well–posed. Wecan also approximate a given ten- 1 d−1 sor by a tensor train with quasi–optimal ranks using a simple and robust approximation (ranktruncation,ortensorrounding)algorithm[30]. Thisisthecaseforalltensornetworks without cycles, eg. Tucker [40], HT[15], QTT-Tucker [8], etc. Incontrast, the MPS formal- ismoriginallyassumestheperiodicboundaryconditionsα = α andsumovertheseindices, 0 d which leads to Tr(X(1)...X(d)), where all matrices can be shifted in cycle under the trace. The optimization in such type of tensor networks is difficult, because they form unclosed manifoldsandthe bestapproximation doesnot alwaysexist. The tensor train representation of the matrix is made similarly with the TT–cores de- pending on two parameters i ,j . Hence, x = τ(X¯) is sought in the form (1) and A and y k k given in the TT–formatasfollows A(i ,...,i ; j ,...,j ) = A(1)(i ,j )...A(d)(i ,j ), 1 d 1 d 1 1 d d (3) y(i ,...,i ) = Y(1)(i )...Y(d)(i ). 1 d 1 d For A and x given in the TT–format, the matrix-vector product c = Ax is also a TT– format computed asfollows c(i ,...,i ) = A(1)(i ,j ) X(1)(j ) ... A(d)(i ,j ) X(d)(j ) , 1 d 1 1 1 d d d ⊗ ⊗ (cid:0) (cid:1) (cid:0) (cid:1) 4 where denotesthe tensor (Kronecker) product oftwo matrices definedasfollows ⊗ A = A(i,j) , B = B(p,q) , C = A B = C(ip,jq) = A(i,j)B(p,q) . ⊗ Wereferto(cid:2) [30]fo(cid:3)r more d(cid:2)etailson(cid:3) basictensor operat(cid:2)ionsin the(cid:3)TT(cid:2)–format. (cid:3) Inthispaperwewillusestandardl scalarproduct( , )andtheA–scalarproduct( , ) 2 A · · · · definedby asymmetrical positive definite (SPD) matrix A asfollows (u,v) = u∗Av, u 2 = (u,u) . A k kA A ForagivennonsingularmatrixUwedefinetheA–orthogonal projectorR asfollows: for U all vand allw spanUit holds ∈ R v spanU, (w,R v) = (w,v) , R = U(U∗AU)−1U∗A. U U A A U ∈ We will use vector notations for mode indices i = (i ,...,i ) and rank indices r = 1 d (r ,...,r ). We also denote the subspace of tensor trains X¯ = (X(1),...,X(d)) with tensor 0 d ranksr as d T = Cri−1×ni×ri. r i×=1 3 Alternating minimization methods 3.1 ALS–like minimization with fixed TT–ranks TheMPSformalismwasproposedinQuantumPhysics,wheretherepresentation (1)was usedfortheminimizationoftheRayleighquotient(x,Ax)/(x,x).Similarly,thesolutionof alinearsystemAx = ywithA = A∗ canbesoughtthrough theminimizationofanenergy function J(x) = x −x 2 = (x,Ax)−2ℜ(x,y)+const, (4) k ∗ kA where x denotes the exact solution. We consider the Hermitian matrix A = A∗ and the ∗ right-hand side ygiven in the TT–format (3), and solve the minimization problem with x soughtintheTT–format(1)withfixedTT–ranksr,i.e.,X¯ = argmin J(τ(X¯)).Thisheavy ∗ X¯∈Tr nonlinearminimizationproblemcanhardlybesolvedunlessa(very)accurateinitialguess isavailable(see,eg.[35]). Tomakeittractable,wecanusethealternatinglinearoptimiza- tionframeworkandsubstitutetheglobalminimizationoverthetensortrainX¯ T bythe r ∈ linear minimization over all cores X(1),...,X(d) subsequently in a cycle. Solving the local problem we assume that all cores but k–th of the current tensor train X¯ = (X(1),...,X(d)) are ‘frozen’,and the minimization isdone overX(k) asfollows X¯ = (X(1),...,X(k−1),X(k) ,X(k+1),...,X(d)), where new new (5) X(k) = argminJ(τ(X¯)), s.t. X(k) Crk−1×nk×rk. new X(k) ∈ Clearly, the energy function does not grow during the sequence of ALS updates and the solution will converge to alocal minimum. TowriteeachALSstepasalinearproblem,letusstretchallentriesoftheTT–coreX(k)in thevectorx (α i α ) = X(k) (i ).From(1)weseethatx = X x ,whereX = P (X¯) k k−1 k k αk−1,αk k 6=k k 6=k 6=k isthe n ...n r n r matrix definedasfollows 1 d k−1 k k × X (i ...i ,α j α ) = X(1)(i )...X(k−1) (i )δ(i ,j )X(k+1) (i )...X(d) (i ), 6=k 1 d k−1 k k α1 1 αk−2,αk−1 k−1 k k αk,αk+1 k+1 αd−1 d (6) X = P (X¯) = X<k I X>k ⊤, 6=k 6=k ⊗ nk ⊗ (cid:0) (cid:1) 5 α1 α2 αk−2 αk−1 αk αk+1 αd X(1) X(2) X(k−1) X(k) X(k+1) X(d) i1 i2 ik−1 ik ik+1 id γ1 γ2 γk−2 γk−1 γk γk+1 γd A(1) A(2) A(k−1) A(k) A(k+1) A(d) j1 j2 jk−1 jk jk+1 jd β1 β2 βk−2 βk−1 βk βk+1 βd X(1) X(2) X(k−1) X(k) X(k+1) X(d) Figure 1: Tensornetwork corresponding to the quadratic form (Ax,x) with matrix A and vector x given in the tensor train format. The boxes are tensors with lines(legs) denoting indices. Each bond between two tensors assumesa summation overthe join index. where δ(i,j) is the Kronecker symbol, i.e., δ(i,j) = 1 if i = j and δ(i,j) = 0 elsewhere. If J(τ(X¯))is considered asa function ofx ,it isalsothe second-orderenergy function k J(τ(x)) = (AX x ,X x )−2(y,X x ) = (X∗ AX x ,x )−2(X∗ y,x ), 6=k k 6=k k 6=k k 6=k 6=k k k 6=k k where the gradient w.r.t. x iszerowhen1 k X∗ AX x = X∗ y. (7) 6=k 6=k k 6=k (cid:0) (cid:1) The solution of the local minimization problem (5)is therefore equivalentto the solution ofthe original system Ax = yin the reducedbasisX = P (X¯),defined by (6). 6=k 6=k The tensor train representation (1) is non-unique. Indeed, two representations X¯ and Y¯ mapto one tensor τ(X¯) = τ(Y¯)assoon as Y(k)(i ) = H−1 X(k)(i )H , k = 1,...,d, k k−1 k k where H = H = 1and H Crk×rk, k = 1,...,d−1, arearbitrary nonsingular matrices. 0 d k Given a vector in the TT–for∈mat x = τ(X¯), any transformation H = (H ,...,H ) does not 0 d changetheenergylevelsinceJ(τ(X¯)) = J(τ(Y¯))butgivesussomeflexibilityforthechoice ofthereducedbasissinceP (X¯) = P (Y¯).TheproperchoiceoftherepresentationX¯ essen- 6=k 6=k 6 tiallydefinesthereducedbasisandaffectsthepropertiesofthelocalproblem(7). Apromi- H nenttransformation isthe TT–orthogonalization algorithm proposed in [30]. Itchooses matrices H applying the QR factorization to the reshaped TT–cores, i.e., matrices of size k r n r and/orr n r .ThetransformationHgivenbytheTT–orthogonalizationim- k−1 k k k−1 k k × × pliestheleft–orthogonalityconstrainsonTT–coresY(1),...,Y(k−1)andright–orthogonality on Y(k+1),...,Y(d), which results in the orthogonality of the interfaces Y<k and Y>k and hencethereducedbasisY = P (Y¯).Such anormalizationstepwillbeassumedinmany 6=k 6=k algorithms throughout the paper;in most caseswe will dothis without introduction of a newrepresentation Y¯ justby‘claiming’thenecessaryorthogonalization patternoftheTT representation we use. If the reduced basis method is applied and such a representation 1ForillustrationseeFig.1,forthedetailedderivationsee[18,9]. 6 X¯ is chosen so that X = P (X¯) = P is orthogonal, the spectrum of the reduced ma- 6=k 6=k trixP∗APliesbetweenthe minimumandmaximumeigenvaluesofthe matrixA.Indeed, using the Rayleighquotient [19], wewrite λ (P∗AP) = min(Pv,APv) = min (u,Au) > min(u,Au) = λ (A), min min kvk=1 u∈spanP,kuk=1 kuk=1 and similarly for the maximum values. It follows that the reduced matrix is conditioned not worse than the original, cond(X∗ AX ) 6 cond(A). Therefore, the orthogonality of 6=k 6=k TT–cores ensures the stability of local problems and we will silently assume this for all reduced problemsin this paper. To conclude this part, let us calculate the complexity of the local problem (7). As was pointedoutin[9],eitheradirectelimination,oraniterativelinearsolverwithfastmatrix- by-vector products (matvecs) may be applied. If the direct solution method is used, the costswhicharerequiredtoformther n r r n r matrixofthelocalproblem(7)are k−1 k k k−1 k k × smaller than the complexity of the Gaussian elimination, i.e., the overall cost is O(n3r6).2 If an iterative method is used to solve the local problem, one multiplication X∗ AX re- 6=k 6=k quires O(nr r3 + n2r2r2) operations, where r and r denote the TT–rank of the current A A A solution xand the matrix A,respectively. Careful implementation ofthe matvec isessen- tial to reach this complexity, see [9] for details. The complexity of the normalization step isonlyO(dnr3) operations andcan be neglected. 3.2 DMRG–like minimization and adaptivity of TT–ranks In practical numerical work the TT–ranks of the solution are usually not known in ad- vance,whichputsarestrictionontheuseofthemethodswithfixedTT–ranks. Theunder- estimation of TT–ranks leads to a low accuracy of the solution, while the overestimation results in a large computational overhead. This motivates the development of methods whichcanchooseandmodifytheTT–rankson–the–flyadaptivelytothedesiredaccuracy level. AprominentexampleofsuchmethodistheDensityMatrixRenormalizationGroup (DMRG)algorithm[42],developedinthequantumphysicscommunityforthesolutionof a ground state problem. DMRGperforms similarly to the ALS but at each step combines two succeedingblocks X(k) and X(k+1) into one superblock w (α i i α ) = W(k) (i i ), W(k)(i i ) = X(k)(i )X(k+1)(i ), (8) k k−1 k k+1 k+1 αk−1,αk+1 k k+1 k k+1 k k+1 and make the minimization over w . Classical DMRG minimizes the Rayleigh quotient, k our version minimizes the energy function J(x), see [18, 9]. Similarly to (6),(7) we write the local DMRGproblem Bw = g asfollows k k P = P (X¯) = X<k I I X>k+1 ⊤ Cn1...nd×rk−1nknk+1rk+1, ∈/{k,k+1} ⊗ nk ⊗ nk+1 ⊗ ∈ (9) B = P∗AP, g = P∗y Crk−1nknk+1rk+1. k (cid:0) (cid:1) ∈ When the w is computed, new TT–blocks are obtained by the low–rank decomposition, k i.e. the right-hand side of (8) is computed and the k-th rank is updated adaptively to the chosen accuracy. The minimization over O(n2r2) components of w leads to complexity k O(n3),andseriously increasesthe computational time forsystems with large mode sizes. 2Wewillalwaysassumethatn1 =...=nd =nandr1 =...=rd−1 =rinthecomplexityestimates. 7 3.3 One–blockenrichmentasaGalerkinreductionofthetwo-dimensional system Supposethatwehavejustsolved(7)andupdatedtheTT-blockX(k). Beforewemovetothe next step, we would like to improve the reduced basis P (X¯) by adding a few vectors 6=k+1 to it. Denote the current solution vector by t = τ(T¯) and suppose we add a step s = τ(S¯). Thentheupdatedsolutionx = t+shastheTT–representationx = τ(X¯)definedasfollows X(1)(i ) := T(1)(i ) S(1)(i ) , 1 1 1 T(p)(i ) 0 T(d)(i ) (10) X(p)(i ) := (cid:2) p (cid:3) , X(d)(i ) := d , p 0 S(p)(i ) d S(d)(i ) p d (cid:20) (cid:21) (cid:20) (cid:21) where p = 2,...,d − 1. We will denote this tensor train as X¯ = T¯ + S¯.3 The considered update affects the solution process in two ways: first, naturally, adds a certain correction to the solution, and second, enlarge the reduced basis that we will use at the next step of the ALSminimization. Indeed,it caneasilybe seenfrom definition (2)that X<k = T<k S<k , X>k ⊤ = T>k ⊤ S>k ⊤ . h i (cid:2) (cid:3) (cid:0) (cid:1) (cid:0) (cid:1) (cid:0) (cid:1) From (6)we conclude that P (T¯ +S¯) = T<k S<k I T>k ⊤ S>k ⊤ and hence 6=k ⊗ ⊗ h i (cid:2) (cid:3) (cid:0) (cid:1) (cid:0) (cid:1) X = P (T¯ +S¯) = P (T¯) S<k I T>k ⊤ T<k I S>k ⊤ P (S¯) . (11) 6=k 6=k 6=k 6=k ⊗ ⊗ ⊗ ⊗ h i (cid:0) (cid:1) (cid:0) (cid:1) The clever choice of s = τ(S¯) allows to add the essential vectors to spanP (T¯ + S¯) and 6=k therefore improve the convergence ofALS. A random choice of s T with some small TT–ranks r (cf. random kick proposed r ∈ in [37, 29]) may lead to a slow convergence. It also introduces an unwanted perturbation ofthesolution. Amorerobustideaistochoosesinaccordancetosomeone-stepiterative method, for instance, take s z = y − At and construct a steepest descent or minimal ≈ residual method with approximations. This choice allowsto derive the convergence esti- mate similarlyto the classical one andwill be discussed inSec. 4. To stay within methods of linear complexity, we restrict ourselves to zero shifts s = 0 with a simpleTT–structure S¯ = (0,...,0,S(k),0,...,0).Thetensor train X¯ = T¯ +S¯ hasthe followingstructure4 T(k+1)(i ) X(k)(i ) = T(k)(i ) S(k)(i ) , X(k+1)(i ) = k+1 , (12) k k k k+1 0 (cid:20) (cid:21) (cid:2) (cid:3) and X(p) = T(p) for other p. Note that since s = 0, the enrichment step does not affect the energy J(τ(X¯)) = J(τ(T¯)). Therefore, we can choose S(k) freely and develop (proba- bly, heuristic) approaches to improve the convergence of our scheme. The reduced basis 3Due to the non–uniqueness of the TT–formatother representations(probablywith smaller TT–ranks) canexistforx=t+s. 4Wegiveadescriptionfortheforwardsweep,i.e. theonewithincreasingk = 1,...,d.Forthebackward sweeptheconstructionisdoneanalogously. 8 1 ... k−1 k k+1 k+2 ... d βk−1 βk βk+1 P6=k+1(X¯)xk+1 jk jk+1 γ k A ik ik+1 αk−1 αk αk+1 P6=k+1(X¯) = αk−1 αk αk+1 P6=k+1(X¯) ik ik+1 y Figure 2: Linearsystem Ax = yin the reducedbasis P (X¯) shown bytensor networks. 6=k+1 The reduced system has r n r unknowns, shown by the dark box. Gray boxes show k k+1 k+1 the X(k) which is updated by S(k) to improve the convergence. White boxes contribute to the local matrix Band right-hand sideg ofthe 2Dsystem (9). X = P (T¯ +S¯) dependson the choice of S(k) asfollows(cf. (6)) 6=k+1 6=k+1 X (i ...i ,α j α ) = X<k(i ...i ,α )X(k) (i )δ(i ,j )X>k+1(α ,i ...i ), 6=k+1 1 d k k+1 k+1 1 k−1 k−1 αk−1,αk k k+1 k+1 k+1 k+2 d X = X<k X(k) I X>k+1 ⊤, 6=k+1 αk−1 ⊗ αk−1 ⊗ nk+1 ⊗ X(k) = T(k) S(k) , (cid:0) (cid:1) αk−1 αk−1 αk−1 h i (13) whereX<k isacolumnofX<k andX(k) isthen r matrixwhichisthesliceof3-tensor αk−1 αk−1 k× k X(k) = [X(k)(α ,i ,α )] corresponding to the fixed α , and similarly for Sk(α ) and k−1 k k k−1 k−1 T(k)(α ). Below we will write the local system (7) at the step k + 1 and see how it is k−1 affected bythe choice of S(k). The two-dimensional system defined by (9) is shown by gray boxes in the Fig. 2. It appears here as the local problem in the DMRG method, but in the same framework we may consider the whole initial system with d = 2, and k = 1, dependingon what type of analysiswe would like to perform. Now the reduced system for the elements of x (β j β ) = X(k+1)(β ,j ,β ) k+1 k k+1 k+1 k k+1 k+1 writes X(k) B X(k) X(k+1) = X(k) G , αk,a ab,a′b′ a′,βk βk,b′ αk,a a,b (14) X(k) I ∗B X(k) I x = X(k) I ∗g, k+1 ⊗ ⊗ ⊗ where the following mu(cid:2)lti-indic(cid:3)esar(cid:2)e introd(cid:3)uced for(cid:2)brevity(cid:3)ofnotation α i = a, β j = a′, a,a′ = 1,...,r n , k−1 k k−1 k k−1 k i α = b, j β = b′, b,b′ = 1,...,r n , k+1 k+1 k+1 k+1 k+1 k+1 9 andX(k) ∈ Crk−1nk×rk,I = Irk+1nk+1.Thesystem(14)hasrknk+1rk+1 unknowns. Atthesame timeitisthereductionofa2DsystemBw = gwhichhasr n n r unknowns. There- k−1 k k+1 k+1 fore,thechoiceoftheenrichmentS(k) (asapartofX(k))canbeconsideredasacheaperap- proximationofthe2Dsystemsolution. TakingintoaccountthestructureofX(k) from (13) we rewrite (14)asfollows T S ∗B T S x = T S ∗g, T = T(k) I, S = S(k) I. (15) k+1 ⊗ ⊗ The syste(cid:2)m (15(cid:3)) is(cid:2)diffic(cid:3)ult to an(cid:2)alyze(cid:3). However, we may propose a certain approxi- mation to its solution, and estimate the quality of the solution to the whole system (15) viathepropertiesoftheapproximation. Namely,letusconsiderthezero–paddedTT–core X(k+1)in(12)astheinitialguess,i.e.,someinformationaboutthesolutionx thatwewant k+1 to use. For instance, we can apply the block Gauss–Seidel step, restricting the unknown block to the form T(k+1)(i ) t(α′i α ) = T(k+1) (i ), X(k+1)(ik+1) = (cid:20) V(ik+k1+)1 (cid:21), v(αk′k′ikk++11αkk++11) = Vααk′k′α′αkk++11(ikk++11). (16) Then (15)writes asthe followingoverdeterminedsystem T∗ T∗ B Tt+Sv = g, S∗ S∗ (cid:20) (cid:21) (cid:20) (cid:21) (cid:2) (cid:3) and followingthe Gauss–Seidel step we solve it considering onlythe lowerpart (S∗BS)v = S∗(g−BTt). (17) Equation (17) is a Galerkin reduction method with the basis S applied to the system Bw = g with the initial guess (8), and TT–cores X(k) and X(k+1) defined by (12). After (17) (k) issolved, the updated superblock W writes asfollows new T(k+1)(i ) X(k)(i ) = T(k)(i ) S(k)(i ) , X(k+1)(i ) = k+1 , k k k new k+1 V(i ) k+1 (cid:20) (cid:21) (18) W(k) (i i ) = (cid:2)T(k)(i )T(k+1)(i (cid:3))+S(k)(i )V(i ) new k k+1 k k+1 k k+1 = W(k)(i i )+S(k)(i )V(i ), k k+1 k k+1 which allows to consider the proposed method as a solver for the 2D system, which per- forms the low–rankcorrectionfor the superblock ratherthan recompute itfrom scratch. Equations (14) and (17) can be considered as certain approximate approaches to the solution of the 2D system (9). Different such approaches can be collected into Table 1, sorted from the highest tothe lowestaccuracy. 4 Steepest descent schemes 4.1 Steepest descent with perturbation Given the initial guess t, the steepest descent (SD) step minimizes the energy function (4) overvectors x = t+sα, wherethe step ischosen asfollows s = −gradJ(t) = y−At = z, (z,z) α = argminJ(t+zα) = . (z,Az) 10