ebook img

A Framework of Constraint Preserving Update Schemes for Optimization on Stiefel Manifold PDF

0.43 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Framework of Constraint Preserving Update Schemes for Optimization on Stiefel Manifold

A Framework of Constraint Preserving Update Schemes for Optimization on Stiefel Manifold Bo Jiang · Yu-Hong Dai 3 1 0 2 c RevisedDecmeber26,2013 e D Abstract This paper considers optimization problems on the Stiefel manifold XTX = I , where X Rn p 7 p × ∈ 2 is the variable and Ip is the p-by-p identity matrix. A framework of constraint preserving update schemes is proposed by decomposing each feasible point into the range space of X and the null space of XT. While this ] C general framework can unify many existing schemes, a new update scheme with low complexity cost is also O discovered. Then we study a feasible Barzilai-Borwein-like method under the new update scheme. The global convergence of the method is established with an adaptive nonmonotone line search. The numerical tests on . h the nearest low-rank correlation matrix problem, the Kohn-Sham total energy minimization and a specific t a problem from statistics demonstrate the efficiency of the new method. In particular, the new method performs m remarkablywell for the nearestlow-rankcorrelationmatrix problemin terms of speed and solutionquality and [ is considerably competitive with the widely used SCF iteration for the Kohn-Sham total energy minimization. 2 Keywords Stiefel manifold orthogonality constraint sphere constraint range space null space Barzilai- v · · · · · 2 Borwein-like method feasible global convergence adaptive nonmonotone line search low-rank correlation · · · · 7 matrix Kohn-Sham total energy minimization heterogeneous quadratic functions 1 · · 0 Mathematics Subject Classification (2010) 49Q99 65K05 90C30 . · · 1 0 3 1 Introduction 1 : v In this paper, we consider general feasible methods for optimization on the Stiefel manifold, i X T min (X), s.t. X X =I , (1.1) r p a X Rn×p F ∈ where I is the p-by-p identity matrix and (X): Rn p R is a differentiable function. The feasible set p × := X Rn p :XTX =I is referred toFas the Stiefel→manifold, which was due to Stiefel in 1935 [48]. n,p × p S { ∈ } Problem (1.1) captures many applications, for instance, nearest low-rank correlation matrix problem [25, 38, 44], linear eigenvalue problem [23, 45], Kohn-Sham total energy minimization [53], orthogonal Procrustes ThisworkwaspartlysupportedbytheChineseNSFgrants(nos. 10831106 and81211130105), theCASgrant(no. kjcx-yw-s7-03) andtheChinaNationalFundsforDistinguishedYoungScientists(no.11125107). B.Jiang·Y.-H.Dai LSEC,ICMSEC,AcademyofMathematicsandSystemsSciences,ChineseAcademyofSciences,Beijing100190, CHINA. B.Jiang E-mail:[email protected] Y.-H.Dai E-mail:[email protected] CorrespondingAuthor. 2 BoJiang,Yu-HongDai problem [19, 47], maximization of sums of heterogeneous quadratic functions from statistics [9, 40], sparse principalcomponent analysis[16, 30, 58], leakage interference minimization [33,37],joint diagonalization(blind source separation) [29, 50] and determining a minimal set of localized orbitals from atomic chemistry [10]. For other applications,we referinterestedreadersto [18, 52] andthe referencestherein.In general,problem(1.1)is difficult to solve due to the nonconvexity of the orthogonality constraint. In fact, some of the above examples, including the maxcut problem and the leakage interference minimization [33], are NP-hard. Withthewideapplicabilityandfundamentaldifficulty,problem(1.1)hasattractedmanyresearchers.Based on the geometric structure, Rapcs´ak [40, 41] reformulated it as a smooth nonlinear program by introducing a new coordinate representation.From the point of view of manifold, some authors proposed a variety of feasible algorithms to solve problem (1.1). These algorithms include steepest descent methods [1, 3, 34, 35], Barzilai- Borwein(BB)method[52],conjugategradientmethods[2,3,18],trustregionmethods[3,55],Newtonmethods [3, 18], quasi-Newton methods [46] and subspace methods [54]. Unlike the unconstrained case, it is not trivial to keep the whole iterations in the Stiefel manifold and the concept of retraction has played an important role (see [4, Theorem 15] and [3, Definition 4.1.1] for a detailed description on retractions). Simply speaking, the retractiondefines an update scheme which preservesthe orthogonalityconstraint.The existing update schemes employedbytheaforementionedmethodsaresomespecificchoicesofretractions.Theycanbeclassifiedintotwo types:geodesic-likeandprojection-likeupdate schemes.Brieflyspeaking,geodesic-likeupdate schemespreserve the constraint by moving a point along the geodesic or quasi-geodesic while projection-like update schemes do so by (approximately) projecting a point into the constraint. We will delve into the details of these update schemes in 2.1. § Inthis paper,we firstlydevelopa frameworkofconstraintpreservingupdate schemesbasedona novelidea; T i.e., decomposing each feasible point into the range space of X and the null space of X . This framework can not only unify most existing schemes including a kind of geodesic, gradient projection, Manton’s projection, polar decomposition, QR factorizationand Wen-Yin update schemes (they will be mentioned in 2.1), but also § leads to the discovery of a new scheme with low complexity cost. Secondly, under the new update scheme, we look for a suitable descent feasible curve along which the objective function can achieve a certain decrease by taking a suitable stepsize. Then the original problem can be treatedas anunconstrainedoptimizationproblem. We considerto combine the efficientBB method [7] with our new update scheme. To ensure the global convergence, we adopt the adaptive nonmonotone line search [11], leading to an adaptive feasible Barzilai-Borwein-like(AFBB) method for problem (1.1). Note that certain feasible BB-like method with a different nonmonotone line search was first studied in [52], but the convergence issue was not discussed there. We prove the global convergence of the AFBB method in the numerical sense under some mild assumptions. Although our update scheme is also a retraction, the convergence of retraction- based line search methods in [3] cannot be applied to our methods. To the best of our knowledge, this is the firstglobalconvergenceresultforfeasiblemethodswithnonmonotonelinesearchforoptimizationontheStiefel manifold. Furthermore, our convergence analysis can also be extended to feasible BB-like methods based on monotone or some other nonmonotone Armijo-type line search techniques. Thirdly, we extend the proposed update scheme and algorithm to deal with optimization with multiple generalized orthogonality constraints: min (X ,...,X ), s.t. X H X =K , ..., X H X =K , (1.2) X1∈Cn1×p1,...,Xq∈Cnq×pq F 1 q 1∗ 1 1 1 q∗ q q q where H1 Rn1×n1,...,Hq Rnq×nq are symmetric positive semidefinite matrices, and K1 Rp1×p1, ..., ∈ ∈ ∈ Kq Rpq×pq are symmetric positive definite matrices. Note that problem (1.1) is a special case of problem ∈ (1.2). Finally, to demonstrate the efficiency of the proposed method, we apply the new method to a variety of problems. For the nearest low-rank correlation matrix problem, our new method performs remarkably well in terms of speed and solution quality. We also modify the new method to deal with the extra fixed constraints for the nearest low-rank correlationmatrix problem through the augmented Lagrangianfunction. The prelimi- nary numerical results show the potential of the new method to handle some more general constraints beyond AFrameworkofConstraintPreservingUpdateSchemes forOptimizationonStiefelManifold 3 the sphere constraints. For the Kohn-Sham total energy minimization problem arising in electronic structure calculations, the new method is considerably competitive with the widely used SCF iteration. The rest of this paper is organized as follows. In 2, we review the existing update schemes and give the § first-order optimality condition of problem (1.1). In 3, we introduce our framework of constraint preserving § updateschemesandproposeanewupdate scheme.Somepropertiesofthe newupdateschemeandcomparisons with existing update schemes are also stated in this section. We present the AFBB method in 4.1, establish § its global convergence in 4.2 and then discuss how to deal with the cumulative feasibility error during the § iterations in 4.3. The AFBB method is also extended to more general problems in 5. Some numerical tests § § on an extensive collection of problems are presented to demonstrate the efficiency of our AFBB method in 6. § Finally, conclusions are drawn in the last section. Notation: For matrix M Rn n, we define sym(M)=(M+MT)/2.The maximal andminimal eigenvalues × ∈ of M are denoted by λ (M) and λ (M), respectively. We use diag(M) to stand for the vector formed max min by the diagonal entries of M. Meanwhile, we use Diag(θ ,...,θ ) to represent the diagonal matrix whose 1 n diagonal entries are θ ,...,θ . The set of n-by-n symmetric matrices is denoted by n. For S n, if S is 1 n S ∈ S positive semidefinite (positive definite), we mark S 0 (S 0). The Euclidean inner product of two matrices A,B Rm n is defined as A,B = tr(ATB), whe(cid:23)re tr()≻is the trace operator. We denote by A the i-th × (i) ∈ h i · column of A. The decomposition A = qr(A)upp(A) is the unique QR factorization with qr(A) Rn p being × ∈ a matrix with orthonormal columns and upp(A) Rp p an upper triangular matrix with positive diagonal × ∈ entries. The condition number of A is defined as cond(A) = λmax(ATA) 1/2. Denote by e the i-th unit vector λmin(ATA) i of an appropriate size. For X , we define P = I (cid:16)1XXT. Th(cid:17)e gradient of with respect to X is ∈ Sn,p X n − 2 F G:= (X)= ∂F(X) , whereas the gradient in the tangent space is denoted by . DF ∂Xij ∇F (cid:16) (cid:17) 2 Preliminaries 2.1 Overview of existing update schemes GivenanytangentdirectionD Rn psatisfyingXTD+DTX =0withX ,webrieflyreviewthegeodesic- × n,p ∈ ∈S likeandprojection-likeupdate schemes.Notethattheparameterτ 0inthe followingupdateschemesissome ≥ stepsize. Geodesic-like update schemes. In 1998, Edelman et al. [18] proposed a computable geodesic update scheme, in which the iterations lie in the curve defined by T T X D R I Ygeo1(τ;X)= X, Q exp τ − R −0 0p , (2.1) (cid:18) (cid:20) (cid:21)(cid:19)(cid:20) (cid:21) (cid:2) (cid:3) T T where QR = (I XX )D is the unique QR factorization of (I XX )D. This strategy requires com- n n − − − − puting the exponential of a 2p-by-2p matrix and the QR factorization of an n-by-p matrix at each iteration. Consequently, the flops will be high when p n/2. Another geodesic approach is proposed by Abrudan et al. ≥ [1].Givenann-by-nskew-symmetricmatrixA ,theyconsideredthecurveY (τ)=exp( τA )X.Comparing 1 geo2 1 − with (2.1), this formula can efficiently deal with the case when p n/2. Nevertheless, it still requires great ≥ effortstocomputetheexponentialofann-by-nmatrix.Toavoidcomputingexponentialsofmatrices,Nishimori and Akaho [35] proposed a kind of quasi-geodesic approach. Given an n-by-n skew-symmetric matrix A , a 2 special case of their update schemes is the Cayley transformation scheme τ 1 τ Y (τ;X)= I A − I + A X. (2.2) qgc n 2 n 2 − 2 2 (cid:16) (cid:17) (cid:16) (cid:17) The computation cost for (2.2) is O(n3). This is considerably high even for small p. In 2010, by some clever derivations, Wen and Yin [52] developed a simple and efficient constraint preserving update scheme, known as 4 BoJiang,Yu-HongDai theCrank-Nicholson-likeupdatescheme.ThisschemeisequivalenttotheCayleytransformationupdatescheme. Their update scheme is described as follows: τ 1 Y (τ;X)=X τU I + VTU − VTX, (2.3) wy 2p − 2 (cid:16) (cid:17) whereU =[P D, X],V =[X, P D] Rn 2p.Forconvenience,wecallitWen-Yinupdateschemethroughout X X × − ∈ this paper. Formula (2.3) has the lowestcomputation complexity per iterationamong the existing geodesic-like approacheswhenp<n/2.However,whenp n/2,the costisstillexpensive.Todealwiththis case,alow-rank ≥ update scheme is explored in [52]. They were also the first to combine the feasible update scheme with the nonmonotone curvilinear search for optimization with orthogonality constraints. Projection-like update schemes. In spite ofthe nonconvexityofthe Stiefel manifold,it is possible to preserve the constraintbythe projection.TheprojectionofarankpmatrixC Rn p onto is definedasthe unique × n,p ∈ S solution of T min X C , s.t. X X =I , (2.4) F p X Rn×p k − k ∈ where is the Frobenius norm. For any symmetric positive definite matrix B Rp p, denote by B1/2 the F × uniqueks·qkuare root of B. It is easy to see that the solution of (2.4) is (C) =∈C(CTC) 1/2. Then we can PSn,p − extend the gradient projection method for optimization with convex constraints for solving (1.1), yielding the update scheme Y (τ;X)= (X τG). (2.5) gp PSn,p − Infact, the famous powermethod[23] forthe extremeeigenvalueproblemofsymmetric matrixis a specialcase of this gradient projection update scheme. Manton [34] considered another different projection scheme Y (τ;X)= (X τD). mp PSn,p − Absil et al. [3] proposed the polar decomposition Y (τ;X)=(X τD)(I +τ2DTD) 1/2. (2.6) pd p − − The polar decomposition is equivalent to Manton’s projection update scheme, but has lower complexity cost1. It is then mainly considered in this paper. It is worth noting that the QR factorization update scheme Y (τ;X)=qr(X τD) (2.7) qr − proposed in [3] can be regarded as an approximation to the polar decomposition update scheme. 2.2 First-order optimality condition To begin with, we introduce some basic concepts related to the Stiefel manifold as in [18]. For any X , n,p define the tangent space at X as := ∆ : XT∆+∆TX = 0 . There are two different metrics on ∈.SThe X X T { } T first one is the Euclidean metric d (∆,∆) = ∆,∆ , which is independent of the point X. The second one is e h i the canonical metric d (∆,∆) = ∆,P ∆ , which is related to X. Then the projection of any Z Rn p onto c X × h i T ∈ the tangent space under the Euclidean or canonical metric is (Z)=Z Xsym(X Z). X The gradient T of a differential function (X) : Rn pPT R on th−e Stiefel manifold is defined such X × ∇F ∈ T F → that G,∆ =d ( ,∆) ,P ∆ (2.8) c X h i ∇F ≡h∇F i T for all tangent vectors ∆ at X. Solving (2.8) for such that X is skew-symmetric yields ∇F ∇F T =G XG X. ∇F − T Notice that is notthe projectionofG ontothe tangentspaceatX.The latter shouldbe G Xsym(X G). ∇F − We now give the first-order optimality condition without proof. It is analogous to Lemma 2.1 in [52]. 1 In[34],MantonusedtheSVDdecompositionofX−τDtoobtaintheprojection,andthecostishigherthanthatofthepolar decomposition. AFrameworkofConstraintPreservingUpdateSchemes forOptimizationonStiefelManifold 5 Lemma 2.1 (First-order optimality condition) Suppose X is a local minimizer of problem (1.1). Then X satisfies the first-order optimality condition T (X,Λ)=G XG X =0; X D L − T i.e., =0, with the associated Lagrange multiplier Λ=G X. Besides, =0 if and only if ∇F ∇F T T G X 2ρG X +(1 2ρ)X G =0, for any ρ>0. − − (cid:0) (cid:1) 3 Constraint Preserving update schemes For a feasible point X , denote = XR : R Rp p and = W Rn p : XTW = 0 to be the n,p X × X × ∈ S R T { ∈ } N { ∈ } range space of X and the null space of X , respectively. It is well known that the two spaces are orthogonal to each other and their sum forms the whole space Rn p. As a result, any point in Rn p can uniquely be × × decomposedinto the sumof twopoints,whichbelong to the twospaces,respectively.With this observation,we introduce our idea for a framework of constraint preserving update schemes for problem (1.1). Given a matrix W in the null space , consider the following curve X N Y(τ;X)=XR(τ)+WN(τ), (3.1) where R(τ),N(τ) Rp p and τ 0 is some parameter. In other words, this curve can be divided into two × ∈ ≥ T parts; i.e., XR(τ) in the range space of X and WN(τ) in the null space of X . Our goal is to determine T appropriate R(τ) and N(τ) such that the curve Y(τ;X) is still feasible; i.e., Y Y =I (notice that if there is p no confusion, we will write Y(τ;X) as Y(τ) or even simply Y, etc.). If we assume that R(τ) = R(τ)Z(τ) and N(τ)=N(τ)Z(τ), where Z(τ) is always invertible, the curve (3.1) can be expressed by e e Y(τ;X)= XR(τ)+WN(τ) Z(τ). (3.2) (cid:16) (cid:17) The condition that YTY =I requires e e p T T T T Z(τ) R(τ) R(τ)+N(τ) W WN(τ) Z(τ)=I , p (cid:16) (cid:17) or, equivalently, e e e e Z(τ)−TZ(τ)−1 =R(τ)TR(τ)+N(τ)TWTWN(τ). (3.3) Once W, R(τ) andN(τ) have beendetermined in someway, Z(τ) canbe calculatedby (3.3).Then R(τ), N(τ) e e e e and hence the whole curve Y(τ;X) will be given. Now wee focus onethe choices of W, R(τ) and N(τ). As Y(τ;X) with τ 0 is a curve starting from the ≥ current iteration X, it is natural to impose that Y(0;X)=X. To meet this condition, by (3.2), we may ask e e R(0)=I , N(0)=0, Z(0)=I . (3.4) p p With these choices, we further have bye(3.2) that e Y (0;X)=X R(0)+Z (0) +WN (0). (3.5) ′ ′ ′ ′ (cid:16) (cid:17) Assume that some matrix E is chosen, which is inteended for Y (0;Xe); i.e., E = Y (0;X). Then (3.5) holds ′ ′ − − if T T W = (I XX )E, R(0)+Z (0)= X E, N (0)=I . (3.6) n ′ ′ ′ p − − − In the following, we consider three approaches of choosing R(τ), N(τ) and Z(τ) such that (3.3) holds. e e e e 6 BoJiang,Yu-HongDai Approach I.Considerthe simple case thatZ(τ) I .This,with the secondequationin (3.6),indicates that p ≡ R(τ) T R(0)= X E. Denoting (τ)= , we know from (3.4) and (3.6) that ′ − A "WN(τ)# e e T ′(0)=e −X E B Ip and (0)= Ip , A W F 0 A 0 (cid:20) (cid:21)(cid:20) (cid:21) (cid:20) (cid:21) where B Rn p, F Rp p. Solving the above ordinary differential equations, we get that × × ∈ ∈ T X E B I (τ)=exp τ − p . (3.7) A W F 0 (cid:18) (cid:20) (cid:21)(cid:19)(cid:20) (cid:21) Since Z(τ) I , we know from (3.3) and the definition of (τ) that p ≡ A Y(τ)= X, I (τ), p A and (τ)T (τ)=I .Itfollowsthatthe matrixinsid(cid:2)e the br(cid:3)acketsof(3.7)willbe skew-symmetric.This means p A A T T T T thatB = W ,F+F =0andX E+E X =0(i.e., E ). LetW =QRbe the unique QRfactorization X of W and−assume that F takes the form F =QFQT, where∈FT Rp p is any skew-symmetric matrix. Then we × ∈ can write e e T T I 0 X E R I 0 I Y(τ)= X, I exp τ p − − p p p 0 Q R F 0 QT 0 (cid:18) (cid:20) (cid:21)(cid:20) (cid:21)(cid:20) (cid:21)(cid:19)(cid:20) (cid:21) (cid:2) (cid:3) T T X E R I = X, Q exp τ − − ep , (3.8) R F 0 (cid:18) (cid:20) (cid:21)(cid:19)(cid:20) (cid:21) (cid:2) (cid:3) where the secondequalityis due thatQ has orthonormalcolumns.The update scheme (3.8)canbe regardedas e a generalized geodesic update scheme, since letting F =0, (3.8) reduces to the geodesic update scheme (2.1). In the next two approaches,to satisfy the conditions N(0)=0 and N (0)=I , we choose ′ p e N(τ)=τI . (3.9) e p e Noticing that ∂Z(∂ττ)−1 =−Z(τ)−1Z′(τ)Z(τ)−1 andeusing (3.3) and (3.9), we can get that T T Z′(0) +Z′(0)= R′(0) +R′(0) . − (cid:16) (cid:17) T This, with the second condition in (3.6), means that X Eemust beea skew-symmetric matrix; i.e., E . X ∈T Approach II. To meet (3.3) and R(0)=I , we consider to choose p R(τ)=I +τR(0). (3.10) e p ′ Since N(τ)=τI , we can get Z(τ) from (3.3) by the polar decomposition or the Cholesky factorization. p e e If by the polar decomposition, Z(τ) and Z (τ) are always symmetric. In this case, it is easy to show that ′ Y(τ;Xe)= XR(τ)+WN(τ) , which with (3.6), (3.9) and (3.10) further implies that PSn,p (cid:0) (cid:1)Y(τ;X)= X τE τXZ (0) , (3.11) e e PSn,p − − ′ whereZ (0)isanyp-by-psymmetricmatrix.IfZ (0)(cid:0)=0,(3.11)reducest(cid:1)othe ordinarypolardecompositionor ′ ′ T T Manton’s projectionupdate scheme.If Z (0)=sym(X G) and E =G Xsym(X G), it becomes the ordinary ′ − gradient projection update scheme. If by the Cholesky factorization, Z(τ) and Z (τ) are always upper triangular. Similarly, we can derive ′ Y(τ;X)=qr X τE τXZ (0) , (3.12) ′ − − where Z (0) is any p-by-p upper triangular matrix.(cid:0)When Z (0)=0, (3.(cid:1)12) reduces to the ordinary QR factor- ′ ′ ization update scheme. AFrameworkofConstraintPreservingUpdateSchemes forOptimizationonStiefelManifold 7 Approach III. In this approach, we regard R(τ) as some function of Z(τ). To solve (3.3) easily, we may assume R(τe)=2Ip Z(τ)−1. (3.13) − Substituting (3.9) and (3.13) into (3.3) leads to e τ2 Z(τ) T+Z(τ) 1 =2I + WTW. − − p 2 Consequently, Z(τ) must be of the form τ2 1 Z(τ)= I + WTW +L(τ) − , (3.14) p 4 (cid:16) (cid:17) where L(τ) is any p-by-p skew-symmetric matrix with L(0) = 0. Notice that the above inverse always exists since L(τ) is skew-symmetric. The relations (3.13) and (3.14) indicate that R(0) = Z (0) = L(0). Further, ′ ′ ′ 1 − T by the second relation in (3.6), we must have that L(0)= X E. Thus we can choose ′ 2 e T L(τ)=g(τ)X E, where g(τ) is any function satisfying g(0) = 0 and g (0) = 1/2. For simplicity, we choose g(τ) = τ/2. See 4.2 ′ § and 6.4 for more choices of g(τ). § To sum up, given any matrix E , we can define the following update scheme X ∈T T W = (I XX )E, n − − J(τ)=Ip+ τ42WTW + τ2XTE, (3.15) Y(τ;X)=(2X +τW)J(τ)−1 X. − Some geometrical meanings of the above scheme will be discussed in the paragraphbefore Lemma 3.3. A few remarks on the framework and the direction E are made here. Firstly, it follows from (3.8), (3.11) and (3.12) that the framework can unify the famous geodesic, gradient projection, polar decomposition and QR factorization update schemes. Meanwhile, it can yield generalized geodesic, polar decomposition or QR factorization update scheme by choosing different F or Z (0). We will mainly consider the new update scheme ′ (3.15) in the remainder of this paper. Secondly, like unconstrained optimization, many possible choices for E, forinstance,the gradientdescent,conjugategradienet,orquasi-Newtondirection,canbe usedunder the update scheme (3.15). However, we focus on the gradient descent direction in 3.1 due to its simplicity. § 3.1 Choice of E In this subsection, we consider to seek an appropriate E such that the update scheme Y(τ;X) given by (3.15) defines a descent curve. Lemma 3.1 For any feasible point X , consider the curve given by (3.15), where E =D and n,p ρ ∈S T T T D =(I (1 2ρ)XX ) =G X 2ρG X +(1 2ρ)X G , ρ>0. (3.16) ρ n − − ∇F − − Then the following properties hold: (cid:0) (cid:1) T (i) Y(τ) Y(τ)=I ; p (ii) Y (0)= D is a descent direction and ′ ρ − ∂ (Y(τ)) Fτ′(Y(0)):= F∂τ ≤−min{ρ,1}k∇Fk2F; (cid:12)τ=0 (cid:12) (iii) for any τ >0 and any p-by-p orthogonal matrix Q ,(cid:12)we have YQ =X; p(cid:12) p 6 (iv) cond(J) (5+υ2)/4, where υ =τ D . ρ F ≤ k k 8 BoJiang,Yu-HongDai In particular, if p = n, Y(τ) = X(2J 1 I ) and J = I + τXTE; if p = 1, the matrices X, Y, G reduce to − − n n 2 the vectors x, y, g, respectively, and the update scheme becomes T 2+τx g τ y(τ)= 1 x g. 1+ τ2(gTg (xTg)2) − − 1+ τ2(gTg (xTg)2) (cid:18) 4 − (cid:19) 4 − T Proof The fact that X D is skew-symmetric implies that J is invertible, thus Y(τ) is well-defined. (i) follows ρ from the construction of the update scheme (3.15). Meanwhile, we know that Y (0)= D , which with the chain rule shows that ′ ρ − Fτ′(Y(0))=−hG,Dρi=−h∇F,PXDρi, where the second equality is due to D and the definition (2.8) of . Recall that P = I 1XXT. ρ ∈ TX ∇F X n − 2 Substituting (3.16) into the above equation yields Fτ′(Y(0))=−h∇F,(In+(ρ−1)XXT)∇Fi≤−min{ρ,1}k∇Fk2F. So (ii) is true. T We prove (iii) by contradiction. Multiplying X from both sides the expression of Y(τ) in (3.15) and using XTW = 0, we get that 2J 1 I = XTY. Assume that there exists a p-by-p orthogonal matrix Q such that − p p Y =XQ .Thenwehave2J 1− I =Q ;i.e.,2I J =Q J.Itfollowsfrom(2I J)T(2I J)=(Q J)TQ J p − p p p p p p p p T − − − − and Q Q =I that p p p T J +J =4I , p which is a contradiction to the definition of J. The contradiction shows the truth of (iii). For (iv), it is obvious that J 1+ τ2 W 2 + τ XTD , where is the spectral norm. It follows from D =W XXTD andkXkT2W≤=0 th4akt DkF 2 2=k W 2ρk+F XTD k2.·kN2otice that υ =τ D . Then we ρ ρ ρ F F ρ F ρ F − − k k k k k k k k have υ2 W 2 υ XTD υ2 υ J 1+ k kF + k ρkF 1+ max (1 t2)+ t (5+υ2)/4. k k2 ≤ 4 · kDρk2F 2 · kDρkF ≤ 0≤t≤1(cid:18) 4 − 2 (cid:19)≤ On the other hand, the relation 2J 1 =XTY +I indicates that J 1 1. Thus (iv) is true. − p − 2 k k ≤ In the case that p=1 or n, it is not difficult to simplify the correspondingupdate schemes and we omit the details here. Thus, we complete the proof. ⊓⊔ Before we proceed, several remarks on Lemma 3.1 are in order. Firstly, without otherwise specification, we will always choose E = D in the remainder of this paper. In this case, by (3.15) and the definition of D , we ρ ρ must have that T T W = (I XX )D (I XX )G. (3.17) n ρ n − − ≡− − There are two special choices of ρ. One is ρ = 1/2, yielding Y (0) = ; the other one is ρ = 1/4 yielding ′ T −∇F T Y (0) = G+Xsym(X G). Generally, the two directions are not the same except the case when X G = ′ T − G X. We will compare the two directions in 6.4; the results therein shows that ρ = 1/4 is a better choice. § Secondly, recalling that the Grassmann manifold is defined as the set of all p-dimensional subspaces of an n,p G n-dimensionalspace,anytwoorthogonalmatriceswhosecolumnsspanthe samep-dimensionalsubspacecanbe regarded as the same point in . By (iii) of Lemma 3.1, we know that any Y(τ) with τ >0 and X must be n,p G differentpoints in .Thus ourupdate scheme canbe also usedfor optimizationonthe Grassmannmanifold, n,p G that is, problem (1.1) with the additional homogeneity assumption that (XQ ) = (X), where Q is any p p F F p-by-porthogonalmatrix.Finally,statement(iv)ofLemma3.1indicatesthattheconditionnumberofJ canbe controlled by the term τ Dρ F, this fact will guide us to set a safeguard for the stepsize, as addressed in (4.4). k k As mentionedbefore, one typicalchoiceof E is E =D .Consider the casewhen p n, it is quite expensive ρ ≈ to calculate J 1 directly. However, in this case we may construct a low-rank matrix E so that J 1 can be − − cheaply obtained. The following lemma shows the possibility of choosing a rank-2 matrix E, with which J 1 − can be analytically givenandfast computed. Similarly,we may also forma rank-2r matrix E with 1 r <n/2 ≤ inthesamevein.Nevertheless,itisworthnotingthatseekinganappropriatelow-rankmatrixE facesatrade-off between the computational cost and the quality of the search curve. AFrameworkofConstraintPreservingUpdateSchemes forOptimizationonStiefelManifold 9 Lemma 3.2 For any feasible X , define D(i) = G eT X GT X. Consider the curve Y(τ) given by ∈ Sn,p (i) i − (i) (i) (3.15) with E =D(q), where q is the column index given by q :=argmax G,D(i) =argmax eT GT e . (3.18) h i i ∇F i i=1,...,p i=1,...,p (cid:0) (cid:1) Then we have that T 1 α 1 e J−1 =Ip− 1+α eq, b (cid:20)1 −1 (cid:21)"bTq #, (3.19) (cid:2) (cid:3) where α = τ2GT I X XT G , b = τXT G and X = X X eT. Moreover, Y(τ) is a descent 4 (q) n− (q) (q) (q) 2 −q (q) −q − (q) q curve satisfying (cid:16) (cid:17) 1 (Y(0)) 2. Fτ′ ≤−2pk∇FkF Proof The relation G,D(i) =eT GT e canbe easily verifiedfrom the definition of D(i). With the defini- tion of D(q) and XThX =I i, we hiave t∇haFt i p (cid:0) (cid:1) 2 XTD(q) =XTG eT e GT X = beT e bT . (q) q − q (q) τ q − q ∈TX (cid:0) (cid:1) Thus Y(τ) is well-defined. Moreover,we have that W = (I XXT)D(q) = (I XXT)G eT − n− − n− (q) q Hence, the matrix J =I + τ2WTW + τXTD(q) can be expressed as p 4 2 T J =Ip+ξeqeTq +beTq −eqbT =Ip+ eq, b (cid:20)1ξ −01(cid:21)"ebTq #!, (3.20) (cid:2) (cid:3) where ξ = τ2GT (I XXT)G . By the formulations of b and ξ, we easily see that 4 (q) n− (q) T T e b=0, ξ =α b b. q − Applying the SMW formula to (3.20) and using the above relations, we can obtain (3.19). Moreover,notice that = p D(i). Thus we have ∇F i=1 P 1 p (Y(0))= p G,D(q) G, 2, Fτ′ − h i≤−h ∇Fi≤−2k∇FkF where the first inequality follows from the choice of q in (3.18) and the second one is due to (ii) of Lemma 3.1. The proof is completed. ⊓⊔ Geometrically, the proposed framework is not geodesic expect for the case when p = 1. Instead, it can be T T regardedasageneralizedgradientprojectionscheme.InthespecialcasewhentherealwaysholdsX G G X, T ≡ T wecanexplicitlygettheprojectionoperatorofnewupdatescheme(3.15).NoticethattheconditionX G G X ≡ holdsforawiderangeofproblems,suchasthelineareigenvalueproblemandthevectorcase(p=1)ofproblem (1.1). T T Lemma 3.3 Assume that X G G X for any feasible point X . Then the update scheme (3.15) with n,p ≡ ∈ S E =D can be expressed by ρ τ2 T T T Y(τ)= X I +τX G G (I XX )G τG . PSn,p p − 4 n− − (cid:18) (cid:16) (cid:17) (cid:19) 10 BoJiang,Yu-HongDai Proof The condition XTG GTX implies that XTD = 0, which with (3.15) means that J = I + τ2WTW. ≡ ρ p 4 Rewrite Y(τ) in (3.15) as Y(τ)= X(2I J)+τW J 1. p − − Thus it follows from Y(τ)TY(τ)=I and the a(cid:0)bove expressions of(cid:1)Y(τ) and J that p T X(2I J)+τW X(2I J)+τW =J2. p p − − (cid:0) (cid:1) (cid:0) (cid:1) Recalling the definition of () for (2.4), we have that PSn,p · Y(τ)= X(2I J)+τW PSn,p p− (cid:0) T (cid:1)τ2 T T = X I +τX G G (I XX )G τG , PSn,p p − 4 n− − (cid:18) (cid:16) (cid:17) (cid:19) where the second equality uses (3.17). This completes the proof. ⊓⊔ 3.2 Comparison with the existing update schemes To begin with, we address a relationship between the update scheme (3.15) and the Wen-Yin update scheme. We show that the Wen-Yin update scheme can be regarded as a special member of the update scheme (3.15) with g(τ) = τ/2 under the condition that I + τXTD is invertible. Notice that this invertible condition is p 4 indispensable for the well definition of the Wen-Yin update scheme, but is not necessary for the validity of the update scheme (3.15). T T Lemma 3.4 Assume that T = T11 T12 , where T22 and T11−T12T2−21T21 are invertible. Then T is invertible (cid:20) 21 22(cid:21) and T−1 ="−T2−2(1TT1211(−T1T112−TT2−2112TT22−12)1−T121)−1 T2−21T−21((TT1111−−TT1122TT22−−2211TT2211))−−11TT1122TT22−−2211+T2−21#. Proposition 3.1 For any feasible point X , if the tangent direction D is such that I + τXTD ∈ Sn,p ∈ TX p 4 is invertible, the update scheme (2.3) is well-defined and it is equivalent to the update scheme (3.15) with g(τ)=τ/2 and E =D. Proof First, we show that the update scheme (2.3) is well-defined, provided that I + τXTD is invertible. p 4 Consider the update scheme (2.3) with U =[P D, X] and V =[X, P D]. It follows from P =I 1XXT that XTP D = 1XTD. Combining this with XXTX =I and XTD b−einXg skew-symmetric, weXcan renw−rit2e X 2 p τ I + τXTD τI I + VTU = p 4 2 p . (3.21) 2p 2 "−τ2DTPXTPXD Ip+ τ4XTD# Moreover,we derive that T T T T T 1 T 2 W W =D (I XX )D =D P P D+ X D , n− X X 4 (cid:0) (cid:1) which with (3.15) implies that τ 2 τ2 T T T J = I + X D + D P P D. p 4 4 X X (cid:16) (cid:17) Plugging this into (3.21) yields I + τXTD τI I + τVTU := T11 T12 = p 4 2 p , (3.22) 2p 2 (cid:20)T21 T22(cid:21) "τ2 Ip+ τ4XTD 2−J Ip+ τ4XTD# (cid:16)(cid:0) (cid:1) (cid:17)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.