ebook img

Decomposable Norm Minimization with Proximal-Gradient Homotopy Algorithm PDF

0.4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Decomposable Norm Minimization with Proximal-Gradient Homotopy Algorithm

Decomposable Norm Minimization with Proximal-Gradient Homotopy Algorithm Reza Eghbali · Maryam Fazel 6 1 0 2 Abstract We study the convergence rate of the proximal-gradient homotopy algorithm applied to norm- p e regularizedlinear least squares problems, for a generalclass of norms. The homotopy algorithm reduces the S regularizationparameterin a seriesofsteps,andusesa proximal-gradientalgorithmto solvethe problemat each step. Proximal-gradientalgorithm has a linear rate of convergence given that the objective function is 7 2 stronglyconvex,andthegradientofthesmoothcomponentoftheobjectivefunctionisLipschitzcontinuous. Inmanyapplications,theobjectivefunctioninthistypeofproblemisnotstronglyconvex,especiallywhenthe ] problemishigh-dimensionalandregularizersarechosenthatinducesparsityorlow-dimensionality.Weshow C that if the linear sampling matrix satisfies certain assumptions and the regularizing norm is decomposable, O proximal-gradient homotopy algorithm converges with a linear rate even though the objective function is . h not strongly convex. Our result generalizes results on the linear convergence of homotopy algorithm for t l -regularized least squares problems. Numerical experiments are presented that support the theoretical a 1 m convergence rate analysis. [ Keywords Proximal-Gradient Homotopy Decomposable norm · · 4 v 1 1 1 Introduction 7 6 In signal processing and statistical regression, problems arise in which the goal is to recover a structured 0 modelfromafew,oftennoisy,linearmeasurements.Wellstudiedexamplesincluderecoveryofsparsevectors . 1 and low rank matrices. These problems can be formulated as non-convex optimization programs,which are 0 computationally intractable in general. One can relax these non-convex problems using appropriate convex 5 penaltyfunctions,forexampleℓ , ℓ andnuclearnormsinsparsevector,groupsparseandlowrankmatrix 1 1 1,2 recovery problems. These relaxations perform very well in many practical applications. Following [10,6,8], : v there has been a flurry of publications that formalize the condition for recovery of sparse vectors, e.g., [2, i X 42], low rank matrices, e.g., [34,4,12] from linear measurements by solving the appropriate relaxed convex optimizationproblems.Alongsideresultsforsparsevectorandlowrankmatrixrecoveryseveralauthorshave r a proposed more general frameworks for structured model recovery problems with linear measurements [5,9, 27]. In many problems of interest, to recover the model from linear noisy measurements, one can formulate the following optimization program: minimize x (1) k k subject to Ax b 2 ǫ2, k − k2 ≤ ThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderGrantNo.ECCS-0847077, andinpart bytheOfficeofNavalResearchunderGrantNo.N00014-12-1-1002. MaryamFazel DepartmentofElectricalEngineering,UniversityofWashington, Seattle, WA98195,USA E-mail:[email protected] RezaEghbali DepartmentofElectricalEngineering,UniversityofWashington, Seattle, WA98195,USA E-mail:[email protected] 2 RezaEghbali,MaryamFazel where b Rm is the measurements vector, A Rm×n is the linear measurement matrix, ǫ2 is the noise energy an∈d is a normon Rn that promotes t∈he desired structure in the solution. The regularizedversion k·k of problem (1) has the following form: 1 minimize Ax b 2+λ x , (2) 2k − k2 k k where λ>0 is the regularizationparameter. There has been extensive work on algorithms for solving problem (1) and (2) in special cases of ℓ and 1 nuclear norms. First order methods have been the method of choice for large scale problems, since each iteration is computationally cheap. Of particular interest is the proximal-gradientmethod for minimization of composite functions, which are functions that can be written as sum of a differentiable convex function and a closed convex function. Proximal-gradientmethod can be utilized for solving the regularizedproblem (2). When the smooth component of the objective function has a Lipschitz continuous gradient, proximal- gradient algorithm has a convergence rate of O(1/t), where t is the iteration number. For the accelerated versionof proximal-gradientalgorithm,the convergencerate improvesto O(1/t2). When the objective func- tion is strongly convex as well, proximal-gradient has linear convergence, i.e. O(κt) with κ (0,1) [29]. ∈ However, in instances of problem (2) that are of interest, the number of samples is less than the dimension of the space, hence the matrix A has a non-zero null space which results in an objective function that is not stronglyconvex.Severalalgorithms that combine homotopycontinuationoverλ with proximal-gradient steps have been proposed in the literature for problem (2) in the special cases of ℓ and nuclear norms [13, 1 44,43,22,41]. Xiao and Zhang [45] have studied an algorithm with homotopy with respect to λ for solv- ingℓ regularized least squaresproblem.FormulatingtheiralgorithmbasedonNesterov’sproximal-gradient 1 method, they have demonstrated that this algorithm has an overall linear rate of convergence whenever A satisfies the restricted isometry property (RIP) and the final value of the regularizer parameter λ is greater than a problem-dependent lower bound. 1.1 Our result We generalize the linear convergence rate analysis of the homotopy algorithm studied in [45] to problem (2) when the regularizing norm is decomposable, where decomposability is a condition introduced in [5]. In particular, ℓ , ℓ and nuclear norms satisfy this condition. We derive properties for this class of norms 1 1,2 thatareuseddirectlyintheconvergenceanalysis.Thesepropertiescanindependently beofinterest.Among these properties is the sublinearity of the the function K : Rn 0,1,...,n , where K is generalization of 7→{ } the notion of cardinality for decomposable norms. The linearconvergenceresultholds under anassumptiononthe RIP constantsofA, whichinturnholds with high probability for several classes of random matrices when the number of measurements m is large enough (orderwise the same as that required for recovery of the structured model). 1.2 Algorithms for structured model recovery There has been extensive work on algorithms for solving problems (1) and (2) in the special cases of ℓ and 1 nuclearnorms.Foradetailedreviewoffirstordermethods wereferthe readerto [30]andreferencestherein. In [45], authors have reviewed sparse recovery and ℓ norm minimization algorithms that are related to the 1 homotopy algorithm for ℓ norm. We discuss related algorithms mostly focusing on algorithms for other 1 norms including nuclear norm here. Proximal-gradientmethodforℓ /nuclearnormminimizationhasalocallinearconvergenceinaneighbor- 1 hood of the optimal value [14,46,21]. The proximal operator for nuclear norm is soft-thresholding operator on singular values. Severalauthors have proposedalgorithms for low rank matrix recoveryand matrix com- pletion problem based on soft- or hard-thresholding operators; see, e.g., [15,3,23,22]. The singular value projectionalgorithmproposedbyJainetal.hasalinearrate;however,toapplythe hard-thresholdingoper- ator,one shouldknow the rank of x . While the authors haveintroduced a heuristic for estimating the rank 0 whenitisnotknownapriori,theirconvergenceresultsrelyuponaknownrank[15].SVPisthegeneralization DecomposableNormMinimizationwithProximal-GradientHomotopyAlgorithm 3 ofiterativehardthresholdingalgorithm(IHT)forsparsevectorrecovery.SVPandIHT belongtothe family of greedy algorithms which do not solve a convex relaxation problem. Other greedy algorithms proposed for sparse recovery such as Compressive Sampling Matching Pursuit (CoSaMP) [26] and Fully Corrective ForwardGreedySelection(FCFGS)[39]havealsobeengeneralizedforrecoveryofgeneralstructuredmodels including low-rank matrices and extended to more general loss functions [32,38]. For huge-scale problems with separable regularizing norm such as ℓ and ℓ , coordinate descent meth- 1 1,2 ods can reduce the computational cost of each iteration significantly. The convergence rate of randomized proximal coordinate descent method in expectation is orderwise the same as full proximalgradient descent; however,itcanyieldanimprovementintermsofthedependence ofconvergencerateonn[28,35,20].Tothe best of our knowledge, linear convergence rate for any coordinate descent method applied to problem (1) or (2) has not been shown in the literature. Continuation over λ for solving the regularized problem has been utilized in fixed point continuation algorithm (FPC) proposed by Ma et al. [22] and accelerated proximal-gradient algorithm with line search (APGL) by Toh et al. [41]. FPC and APGL both solve a series of regularized problems where in each outer-iteration λ is reduced by a factor less than one, the former uses soft-thresholding and the latter uses accelerated proximal-gradientfor solving each regularized problem. Agarwal et al. [1] have proposed algorithms for solving problems (1) and (2) with an extra constraint in the form of x ρ. They have introduced the assumption of decomposability of the norm and given k k ≤ convergence analysis for norms that satisfy that assumption. They establish linear rate of convergence for their algorithms up to a neighborhood of the optimal solutions. However, their algorithm uses the bound ρ which should be selected based on the norm of the true solution. In many problems this quantity is not known beforehand. Jin et al. [16] have proposed an algorithm for ℓ regularized least squares that receives 1 ρ as a parameter and has linear rate of convergence.Their algorithm utilizes proximalgradient method but unlike homotopy algorithm reduces λ at each step. By using SDP formulation of nuclear norm, interior point methods can be utilized to solve problems (1) and (2). Interior point methods do not scale as well as first order methods for large scale problems (For example, for a general SDP solver when the dimension exceeds a few hundreds). However, Specialized SDP solvers for nuclear norm minimization can bring down the computational complexity of each iteration to O(n3) [18]. 2 Preliminaries Let A Rm×n. We equip Rn by an inner product which is given by x,y = xTBy for some positive definite∈matrix B. We equip Rm with ordinary dot product v,u = vTu.hWeidenote the adjoint of A with A∗ =B−1AT. Note that for all x Rn and u Rm h i ∈ ∈ Ax,u = x,A∗u . (3) h i h i We use to denote the norms induced by the inner product in Rn and Rm, that is: k·k2 x Rn : x =√xTBx, ∀ ∈ k k2 v Rm : v =√vTv. ∀ ∈ k k2 We use and ∗ to denote a regularizing norm and its dual on Rn. The latter is defined as: k·k k·k y ∗ =sup y,x x 1 . k k {h i|k k≤ } Given a convex function f : Rn R, ∂f(x) denotes the set of subgradients of f at x, i.e., the set of all z Rn such that 7→ ∈ y Rn : f(y) f(x)+ z,y x . ∀ ∈ ≥ h − i When f is differentiable, ∂f(x)= f(x) . Note that ξ ∂ x if and only if {∇ } ∈ k k ξ,x = x , (4) h i k k ξ ∗ 1. (5) k k ≤ 4 RezaEghbali,MaryamFazel We say f is strongly convex with strong convexity parameter µ when f(x) µf x 2 is convex. For a differentiable function this implies that for all x,y Rn: f − 2 k k2 ∈ µ f(y) f(x)+ f(x),y x + f x y 2. (6) ≥ h∇ − i 2 k − k2 We call the gradient of a differentiable function Lipschitz continuous with Lipschitz constant L , when f for all x,y Rn: ∈ f(x) f(y) L y x . (7) k∇ −∇ k2 ≤ fk − k2 For a convex function f, gradient Lipschitz continuity is equivalent to the following inequality [see [31] Lemma 1.2.3. and Theorem 2.1.5]: L f(y) f(x)+ f(x),y x + f x y 2, (8) ≤ h∇ − i 2 k − k2 for all x,y Rn. ∈ 3 Properties of the regularizing norm and A In this section we introduce our assumptions on the regularizing norm , and derive the properties of the k·k normbasedontheseassumptions.Thehomotopyalgorithmof[45]fortheℓ -regularizedproblemisdesigned 1 so that the iterates maintain low cardinality throughout the algorithm, therefore one can use the restricted eigenvalue property of A, when A acts on these iterates. Said another way, the squared loss term behaves like a strongly convex function over the algorithm iterates, which is why the algorithm can achieve a fast convergence rate. In the proof, [45] uses the the structure of the subdifferential of the ℓ norm, 1 ∂ x = sgn(x)+v v =0 when x =0, v 1 , k k1 { | i i 6 k k∞ ≤ } as well as the following properties that hold for the cardinality function, x 2 card(x) x 2, k k1 ≤ k k2 card(x+y) card(x)+card(y) (sublinearity). ≤ Wefirstgiveourassumptiononthestructureofthesubdifferentialofaclassnorms(whichinlcudesℓ and 1 nuclear norms but is much more general), and then derive the rest of the properties needed for generalizing the results of [45]. Before stating our assumptions, we add some more definitions to our tool box. Let Sn−1 = x Rn x = 1 , and let be the set of extreme points of the norm ball := x x 1 {. W∈e |k k2 } Gk·k Bk·k { | k k ≤ } impose two conditions on the regularizing norm. Condition 1 For any x , x =1, i.e., all the extreme points of the norm ball have unit -norm. ∈Gk·k k k2 k·k2 The secondconditiononthe normis the decomposabilityconditionintroducedin[5], whichwasinspired by the assumption introduced in [27]. Condition 2 (Decomposability) For all x Rn, there exists a subspace T and a vector e T such x x x ∈ ∈ that ∂ x = e +v v T⊥, v ∗ 1 . (9) k k { x | ∈ x k k ≤ } Note that x T for all x Rn because if x / T , then x = y +z with y T and z T⊥ 0 . Let ∈ x ∈ ∈ x ∈ x ∈ x −{ } z′ =z/ z ∗. Since e +z′ ∂ x , x = e +z′,y+z = x + z 2/ z ∗, which is a contradiction. k k x ∈ k k k k h x i k k k k2 k k The decomposability condition has been used in both [5] and [27] to give a simpler and unified proof for recovery of several structures such as sparse vectors and low-rank matrices. When attempting to extend this algorithm to general norms, several challenges arise. First, what is the appropriategeneralizationofcardinalityforother structuresandtheircorrespondingnorms?Essentially,we DecomposableNormMinimizationwithProximal-GradientHomotopyAlgorithm 5 wouldneed to count the number of nonzero coefficients in an appropriaterepresentationand ensure there is a small number of nonzero coefficients in our iterates, to be able to apply a similar proof idea as in [45]. The next theorem captures one of our main results for any decomposable norm. This theorem provides a new set of conditions that is based on the geometry of the norm ball, and we show are equivalent to decomposability on Rn. As a result, one can find a decomposition for any vector in Rn in terms of an orthogonalsubset of . k·k G Theorem 1 (Orthogonal representation) Suppose Sn−1, then is decomposable if and only if k·k for any x Rn 0 and a argmax a,x thereGexis⊂t a ,...,a k·k such that a ,a ,...,a is ∈ −{ } 1 ∈ a∈Gk·kh i 2 k ∈Gk·k { 1 2 k} an orthogonal set that satisfies the following conditions: I There exists γ >0 i=1,...,k such that: i { | } k x= γ a , i i i=1 X k x = γ . (10) i k k i=1 X II For any set η η 1,i=1,...,k : i i { || |≤ } ∗ k η a 1. (11) i i (cid:13) (cid:13) ≤ (cid:13)Xi=1 (cid:13) (cid:13) (cid:13) Moreover, if a ,a ,...,a satisfy I and(cid:13)II, then(cid:13)e = k a . { 1 2 k}⊂Gk·k (cid:13) (cid:13) x i=1 i The proof of Theorem 1 is presented in Appendix B. P We willseeinsection5thatwe needanorthogonalrepresentationforallvectorsto be ableto boundthe number of nonzero coefficients throughout the algorithm. First, we define a quantity K(x) that bounds the ratio of the norm to the Euclidean norm, and plays the same role in our analysis as cardinality played k·k in [45]. Then we show that K(x) is a sublinear function, that is, K(x+y) K(x)+K(y) for all x,y. This is a key property that is needed in the convergence analysis. Define K :Rn≤ 0,1,2,...,n 7→{ } K(x)= e 2. k xk2 Note that for every x Rn, ∈ x 2 = e ,x 2 e 2 x 2 =K(x) x 2. (12) k k h x i ≤k xk2k k2 k k2 Here, the first equality follows from (4), and the inequality follows from the Cauchy-Schwarzinequality. In the analysis of homotopy algorithm we utilize (12) alongside the structure of the subgradient given by (9).ℓ , ℓ ,andnuclearnormsarethreeimportantexamplesthatsatisfyconditions1and2.Herewebriefly 1 1,2 discuss each one of these norms. – Nuclear norm on Rd1×d2 is defined as min{d1,d2} X = σ (X) k k∗ i i=1 X Where σ (X) is the ith largest singular value of X given by the singular value decomposition X = i min{d1,d2}σ (X)u vT. With the trace inner product X,Y = trace XTY , nuclear norm satisfies i=1 i i i h i conditions1and2.Inthiscase,K(X)=rank(X),γ =σ (X)anda =u vT fori 1,2,...,rank(X) . P i i i (cid:0)i i (cid:1) ∈{ } The subspace T is given by: X rank(X) TX = uiziT +z′iviT (cid:12) zi ∈Rd2,z′i ∈Rd1, for alli,  Xi=1 (cid:12)(cid:12)  (cid:12) while eX = ir=an1k(X)uiviT.  (cid:12)(cid:12)  P 6 RezaEghbali,MaryamFazel – Weighted ℓ norm on Rn is defined as: 1 n x = w x k k1 i| i| i=1 X where w is a vector of positive weights. With x,y = n w2x y , ℓ norm satisfies conditions 1 and 2. h i i=1 i i i 1 For ℓ norm, K(x)= ix =0 , γ ,γ ,...,γ = w x x >0,i=1,...,n . T is the support of 1 i 1 2 k i i i x |{ | 6 }| { } {P| ||| | } x, which is defined as: T = y Rn y =0 if x =0 , x i i { ∈ | } while the ith element of e is sign(x )w . x i i – ℓR1d,21,nWoermdefionne:Rd1×d2: For a given inner product h·,·i : Rd1 ×Rd1 7→ R and its induced norm k·k2 on d2 X = X , k k1,2 k ik2 i=1 X where X denotes the ith column of X. With inner product X,Y = d2 X ,Y , ℓ norm satisfies i h i i=1h i ii 1,2 conditions 1 and 2. For this norm, K(X) = iX = 0 and γ ,γ ,...,γ = X X > 0,i = |{ | i 6 }| { 1 2 P k} {k ik2 | k ik2 1,...,d . T is the column support of X, which is defined as: 2 X } TX = [Y1,Y2,...,Yd2]∈Rd1×d2 Yi =0 if Xi =0 , while the ith column of e is eq(cid:8)ual to 0 if X =0 and is eq(cid:12)ual to X / X o(cid:9)therwise. X i (cid:12) i k ik2 Oursecondresultonpropertiesofdecomposablenormsiscapturedinthenexttheoremwhichestablishes sublinearity of K for decomposable norms. Theorem 2 For all x,y Rn ∈ K(x+y) K(x)+K(y). (13) ≤ Theorem2 for ℓ , ℓ and nuclear norm is equivalent to sublinearity of cardinality of vectors,number of 1 1,2 non-zero columns and rank of matrices. The proof of this theorem is included in Appendix B. 3.1 Properties of A Restricted Isometry Property was first discussed in [6] for sparse vectors. Generalization of that concept to low rank matrices was introduced in [34]. Note that if K(x) k, then x √k x . Based on this observation we define restricted isometry constants of A Rm×n a≤s: k k ≤ k k2 ∈ Definition 1 The upper (lower)restrictedisometryconstantρ (A,k)(ρ (A,k))ofa matrixA Rm×n is + − ∈ the smallest (largest) positive constant that satisfies this inequality: ρ (A,k) x 2 Ax 2 ρ (A,k) x 2, − k k2 ≤k k2 ≤ + k k2 whenever x 2 k x 2. k k ≤ k k2 Proposition 1 Let A Rm×n and f(x)= 1 Ax b 2. Suppose that ρ (A,k) and ρ (A,k) are restricted ∈ 2k − k2 + − isometry constants corresponding to A, then: 1 f(y) f(x)+ f(x),y x + ρ (A,k) x y 2, (14) ≥ h∇ − i 2 − k − k2 1 f(y) f(x)+ f(x),y x + ρ (A,k) x y 2, (15) ≤ h∇ − i 2 + k − k2 for all x,y Rn such that x y 2 k x y 2. ∈ k − k ≤ k − k2 Proposition(1) follows from the definition of restricted isometry constants and the following equality: 1 A(x y) 2 =f(y) f(x) f(x),y x . 2k − k2 − −h∇ − i DecomposableNormMinimizationwithProximal-GradientHomotopyAlgorithm 7 4 Proximal-gradient method and homotopy algorithm Westatetheproximal-gradientmethodandthehomotopyalgorithmforthefollowingoptimizationproblem: minimize φ (x)=f(x)+λ x , λ k k where f(x) = 1 Ax b 2. While, for simplicity, we analyze the homotopy algorithm for the least squares 2k − k2 lossfunction, the analysiscanbe extendedto everyfunction offormf(x)=g(Ax) wheng is adifferentiable strongly convex function with Lipschitz continuous gradient.. The key element in the proximal-gradient method isthe proximaloperatorwhichwasdevelopedby Moreau[25]andlaterextended to maximalmono- tone operatorsby Rockafellar[36].Nesterovhas proposedseveralvariantsofthe proximal-gradientmethods [29].Inthissection,wediscussthegradientmethodwithadaptivelinesearch.Foranyx,y Rn andpositive ∈ L, we define: L m (y,x)=f(y)+ f(y),x y + x y 2+λ x , λ,L h∇ − i 2k − k2 k k Prox (y)=argminm (y,x) λ,L λ,L x∈Rn ω (x)= min λξ+ f(x) ∗. λ ξ∈∂kxkk ∇ k Xiao and Zhang [45] have considered the proximal-gradient homotopy algorithm for ℓ norm. Here we 1 state it for general norms. Algorithm (1), introduces the homotopy algorithm and contains the proximal- gradient method as a subroutine. The stopping criteria in the proximal-gradient method is based on the quantity ∗ M (x(t−1) x(t))+ f(x(t)) f(x(t−1)) , t − ∇ −∇ (cid:13) (cid:13) whichisanupperboundonωλ((cid:13)(cid:13)x(t)).Thisfollowsfromthefactthatsincex(t(cid:13)(cid:13)) =argminx∈Rnmλ,Mt(x(t−1),x), there exists ξ ∂ x(t) such that f(x(t−1))+λξ+M (x(t) x(t−1))=0. Therefore, t ∈ ∇ − (cid:13) (cid:13) ∗ ∗ ω (cid:13)(x(t))(cid:13) λξ+ f(x(t)) = λξ+ f(x(t−1))+ f(x(t)) f(x(t−1)) λ ≤ ∇ ∇ ∇ −∇ (cid:13) (cid:13) (cid:13) (cid:13) ∗ (cid:13) (cid:13) (cid:13)M (x(t−1) x(t))+ f(x(t)) f(x(t−1))(cid:13) . (16) (cid:13) (cid:13) ≤(cid:13) t − ∇ −∇ (cid:13) (cid:13) (cid:13) Thehomotopyalgorithmreducesthevalue(cid:13)ofλinaseriesofstepsandineachstepapp(cid:13)liesthe proximal- (cid:13) (cid:13) gradient method. At step t, λ =λ ηt and ǫ =δ′λ with η (0,1) and δ′ (0,1). In the proximal-gradient t 0 t t ∈ ∈ method and the backtracking subroutine, the parameters γ 1 and γ >1 should be initialized. Since dec inc ≥ the function f satisfies the inequality (8), it is clear that L should be chosen less than L . min f Theorem 5 in [29] states that the proximal-gradient method has a linear rate of convergence when f satisfies (6) and(8). In proposition2 we restate that theorem with minimal assumptions which is f satisfies (6) and (8) on a restricted set. The proof of this proposition is given in appendix B. Proposition 2 Let x∗ argminφ . If for every t: λ ∈ µ 2 f x(t) f(x∗)+ f(x∗),x(t) x∗ + f x(t) x∗ , (17) ≥ h∇ − i 2 − 2 (cid:16) (cid:17) (cid:13) µ (cid:13) 2 f x(t+1) f x(t) + f x(t) ,x(t+1) x(t(cid:13)(cid:13)) + f x(cid:13)(cid:13)(t) x(t+1) , (18) ≥ h∇ − i 2 − 2 (cid:16) (cid:17) (cid:16) (cid:17) (cid:16) (cid:17) L (cid:13) (cid:13)2 f x(t+1) f x(t) + f x(t) ,x(t+1) x(t) + f(cid:13)(cid:13)x(t) x(t+1)(cid:13)(cid:13) , (19) ≤ h∇ − i 2 − 2 (cid:16) (cid:17) (cid:16) (cid:17) (cid:16) (cid:17) (cid:13) (cid:13) then (cid:13) (cid:13) (cid:13) (cid:13) t µ γ φ x(t) φ (x∗) 1 f inc φ x(0) φ (x∗) . (20) λ λ λ λ − ≤ − 4L − (cid:16) (cid:17) (cid:18) f (cid:19) (cid:16) (cid:16) (cid:17) (cid:17) In addition, if ∗ f x(t) f x(t+1) L′ x(t) x(t+1) (21) ∇ −∇ ≤ f − 2 (cid:13) (cid:16) (cid:17) (cid:16) (cid:17)(cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 8 RezaEghbali,MaryamFazel Algorithm 1 Homotopy Input: λtgt>0,ǫ>0 Parameters: η∈(0,1),δ′∈(0,1),Lmin>0 y(0)←0,λ0←kA∗bk∗,M ←Lmin,N ←⌊log(cid:16)λλtg0t(cid:17)/log(η)⌋ fort=0,1,...,N 1do − λt+1 ηλt ← [ǫyt(←t+1δ)′,λMt ]←ProxGrad φλt+1 y(t),M,Lmin,ǫt endfor (cid:0) (cid:1) [y,M]←ProxGrad φλtgt y(N),M,Lmin,ǫ Subroutine 1[x,M]=ProxG(cid:0)rad φλ x(0),L0(cid:1),Lmin,ǫ′ Parameter: γdec≥1, (cid:0) (cid:1) t 0 ← repeat [x(t+1),Mt+1]←Backtrack φλ x(t),Lt Lt+1 ←max{Lmin,Mt+1(cid:0)/γdec} (cid:1) t t+1 until←Mt(x(t−1)−x(t))+∇f x(t) −∇f x(t−1) ∗≤ǫ′ x←x(cid:13)(cid:13)(t),M ←Mt (cid:0) (cid:1) (cid:0) (cid:1)(cid:13)(cid:13) Subroutine 2[y,M]=Backtrack φλ (x,L) Parameter: γinc>1 whileφλ Proxλ,L(x) >mλ,L x,Proxλ,L(x) do (cid:0)L←γincL (cid:1) (cid:0) (cid:1) endwhile y←Proxλ,L(x),M ←L and ∗ x(t) x(t+1) θ x(t) x(t+1) (22) − ≤ − 2 (cid:13) (cid:13) (cid:13) (cid:13) for some constants θ and L′, then (cid:13) (cid:13) (cid:13) (cid:13) f (cid:13) (cid:13) (cid:13) (cid:13) ∗ ω x(t+1) M (x(t) x(t+1))+ f x(t+1) f x(t) λ t+1 ≤ − ∇ −∇ (cid:16) (cid:17) (cid:13) L′ (cid:16) (cid:17) (cid:16) (cid:17)(cid:13) θ(cid:13)(cid:13) 1+ f 2γincLf φλ x(t) φλ(x∗) . (cid:13)(cid:13) (23) ≤ µ − (cid:18) f(cid:19)q (cid:0) (cid:0) (cid:1) (cid:1) 5 Convergence result First note that since the objective function is not strongly convex if one applies the sublinear convergence rate of proximal gradient method, the iteration complexity of the homotopy algorithm is O(1 + N 1 ) ǫ t=1 δ′λt which can be simplified to O(1 + 1 ). As it was stated in the introduction, we use the structure ǫ δ′(1−η)λtgt P of this problem to provide a linear rate of convergence when assumptions similar to those needed to derive recovery bounds hold. Suppose b = Ax +z, for some x Rn and z Rm. Here, z is the noise vector that is added to linear 0 0 ∈ ∈ measurements from an structured model x . Also, we define k :=K(x ) and the constant c: 0 0 0 x 2 c:= max k k . x∈Tx0−{0}k0kxk22 Note that c = 1 for ℓ and ℓ norms, and c 2 for nuclear norm. This follows from the fact that 1 1,2 ≤ K(x) = k when x T for ℓ , ℓ norms, while K(x) 2k when x T in case of nuclear norm. 0 ∈ x0 1 1,2 ≤ 0 ∈ x0 Throughoutthis section,weassumethe regularizingnormsatisfiesconditions1and2introducedinSection 3. Before we state the convergence theorem, we introduce an assumption: DecomposableNormMinimizationwithProximal-GradientHomotopyAlgorithm 9 Assumption 1 λ is such that A∗z ∗ λtgt. Furthermore, there exist constants r > 1 and δ (0, 1] tgt k k ≤ 4 ∈ 4 such that: ρ A,ck (1+γ)2 − 0 c > (24) ρ (A(cid:16),72rck (1+γ)γ(cid:17) ) r + 0 inc ρ (A,72rck (1+γ)γ )>0 (25) − 0 inc where: λ (1+δ)+ A∗z ∗ tgt γ := k k . (26) λ (1 δ) A∗z ∗ tgt − −k k We define k˜ = 36rck (1+γ)γ . In appendix A, we provide an upper bound on the number of mea- 0 inc surementneededfor (24)to be satisfiedwithhigh probabilitywheneverrowsof A aresub-Gaussianrandom vectors. The next theorem establishes the linear convergence of the proximal gradient method when ω x(0) = λ min f(x)+λξ ∗ is sufficiently small, while Theorem 4 establishes the overalllinear rate of con- ξ∈∂ x(0) k∇ k (cid:0) (cid:1) k k vergence of homotopy algorithm. Theorem 3 Let x(t) denote the tth iterate of ProxGrad φ x(0),L ,L ,ǫ′ , and let x∗ argminφ (x). λ 0 min λ ∈ Suppose Assumption 1 holds true for some r and δ, Lmin γ(cid:0)incρ+ A,2k˜ , a(cid:1)nd λ λtgt. If x(0) satisfies: ≤ ≥ (cid:16) (cid:17) K x(0) k˜, ω x(0) δλ, λ ≤ ≤ (cid:16) (cid:17) (cid:16) (cid:17) then: K x(t) k˜, (27) ≤ (cid:16) (cid:17) t 1 φ x(t) φ (x∗) 1 φ x(0) φ (x∗) , (28) λ λ λ λ − ≤ − 4γ κ − (cid:16) (cid:17) (cid:18) inc (cid:19) (cid:16) (cid:16) (cid:17) (cid:17) and ρ (A,1)ρ A,2k˜ + + ωλ x(t) 1+ r (cid:16) (cid:17) 2γincρ+ A,2k˜ φλ x(t−1) φλ(x∗) , (29) (cid:16) (cid:17)≤ ρ− A,2k˜ r (cid:16) (cid:17)(cid:0) (cid:0) (cid:1)− (cid:1)  (cid:16) (cid:17)    where κ= ρ+(A,2k˜). ρ−(A,2k˜) Theorem 4 Let y(t) denote the tth iterate of Homotopy algorithm, and let y∗ argminφ (y). Suppose ∈ λtgt Assumption 1 holds truefor some r and δ, L γ ρ A,2k˜ , and λ λ . Furthermore, suppose that min inc + 0 tgt ≤ ≥ δ′ and η in the algorithm satisfy: (cid:16) (cid:17) 1+δ′ η. (30) 1+δ ≤ When t=0,1,...,N 1, the number of proximal-gradient iterations for computing y(t) is bounded by − log C/δ2 , (31) −1 log 1 (cid:0) 1 (cid:1) − 4γincκ (cid:16) (cid:17) The number of proximal-gradient iterations for computing y is bounded by log Cλ /ǫ2 tgt , (32) −1 log 1(cid:0) 1 (cid:1) − 4γincκ (cid:16) (cid:17) 10 RezaEghbali,MaryamFazel 2 where C := 6γ κδck (1+γ) ρ A,2k˜ + ρ (A,1)κ ρ A,c(1+γ)2k and κ = ρ+(A,2k˜). The objective gainpc of th0e output(cid:18)yris−bo(cid:16)unded(cid:17)by p + (cid:19) , −(cid:16) 0(cid:17) ρ−(A,2k˜) 9ck λ (1+γ)ǫ φ (y) φ (y∗) 0 tgt , λtgt − λtgt ≤ ρ A,c(1+γ)2k − 0 (cid:16) (cid:17) while the total number of iterations for computing y is bounded by: log Cλ /ǫ2 +(log λtgt /log(η))log C/δ2 tgt λ0 . (cid:0) (cid:1)log 1(cid:16) 1(cid:17) −1 (cid:0) (cid:1) − 4γincκ (cid:16) (cid:17) 5.1 Parameters selection satisfying the assumptions Four parameters of L , λ , δ′ and η should be set in the homotopy algorithm. The assumption on L min tgt min is only for convenience. If L >γ ρ A,2k˜ , one can replace γ ρ A,2k˜ with L in the analysis. min inc + inc + min Assumption 1 requires λ 4 A∗z(cid:16)∗. Thi(cid:17)s assumption on the regu(cid:16)larizat(cid:17)ion parameter is a standard tgt ≥ k k assumption that is used in the literature to provide optimal bounds for recovery error [4,7,27]. The lower bound onλ ,ensures γ 5+4δ. Ifwe chooseδ andη, we cansetδ′ =(1+δ)η 1to ensure thatit satisfies tgt ≤ 3−4δ − (30). The parameter δ is directly related to satisfiability of (24) in Assumption 1. For example, if δ =1/12, then γ 2 and Assumption 1 is satisfied with r =2c if: ≤ ρ (A,9ck ) 1 − 0 > , ρ (A,432c2k γ ) 2 + 0 inc ρ A,432c2k γ >0. − 0 inc Theoretically, the optimal choice of δ ma(cid:0)ximizes κ subj(cid:1)ect to existence of r > 1 that satisfies (24) and (25). In appendix A, we provide an upper bound on the number of measurement needed for (24) and (25) to be satisfied with high probability for given δ and r > 1 whenever rows of A are sub-Gaussian random vectors. The parameter η should be chosen to be greater than 1 for (30) to be satisfied. 2 5.2 Convergence proof The main part of the proof of Theorems 3 and 4 is establishing the fact that K x(t) k˜. Given that ≤ K x(t) k˜ for all t, Proposition 1 ensures that hypothesis of Proposition 2, i.e., strong convexity and ≤ (cid:0) (cid:1) gradient Lipschitz continuity over a restricted set, are satisfied. We adapt the same strategy as in [45] and pro(cid:0)ve t(cid:1)hat K x(t) k˜ in a series of three lemmas. We have written the statement of the lemmas here, ≤ while their proofs are given in Appendix B. Lemma 1 states that if ω (x) does not exceed a small fraction λ (cid:0) (cid:1) of λ, then x is close to x . 0 Lemma 1 If ω (x) δλ and ρ A,c(1+γ)2k >0, then: λ − 0 ≤ (cid:16) (cid:17) 1 ck (1+γ) (1+δ)λ+ A∗z ∗ 0 max x x , (φ (x) φ (x )) k k . (33) 0 λ λ 0 {k − k δλ − }≤ ρ− A(cid:0),c(1+γ)2k0 (cid:1) (cid:16) (cid:17) Note that if λ 4 A∗z ∗ and δ 1, we can simplify the conclusion of Lemma 1 as ≥ k k ≤ 4 1 3ck λ(1+γ) 0 max x x , (φ (x) φ (x )) 0 λ λ 0 {k − k δλ − }≤ 2ρ A,c(1+γ)2k − 0 (cid:16) (cid:17)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.