ebook img

Fitting Spectral Decay with the $k$-Support Norm PDF

0.39 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fitting Spectral Decay with the $k$-Support Norm

Fitting Spectral Decay with the k-Support Norm Andrew M. McDonald1 Massimiliano Pontil1,2 Dimitris Stamos2 6 1 0 (1) Department of Computer Science 2 n University College London a email: {a.mcdonald,d.stamos.12}@ucl.ac.uk J 4 Gower Street, London WC1E 6BT, UK ] G (2) Istituto Italiano di Tecnologia L s. Via Morego 30, 16163 Genova, Italy c [ 1 January 5, 2016 v 9 4 4 0 Abstract 0 . 1 Thespectralk-supportnormenjoysgoodestimationpropertiesinlowrankmatrixlearning 0 6 problems, empiricallyoutperformingthetracenorm. Itsunitballistheconvexhullofrankk 1 matriceswithunitFrobeniusnorm. Inthispaperwegeneralizethenormtothespectral(k,p)- : v supportnorm,whoseadditionalparameterpcanbeusedtotailorthenormtothedecayofthe i X spectrum of the underlying model. We characterize the unit ball and we explicitly compute r the norm. We further provide a conditional gradient method to solve regularization problems a with the norm, and we derive an efficient algorithm to compute the Euclidean projection on the unit ball in the case p = ∞. In numerical experiments, we show that allowing p to vary significantly improves performance over the spectral k-support norm on various matrix completionbenchmarks,andbettercapturesthespectraldecayoftheunderlyingmodel. Keywords. k-supportnorm,orthogonallyinvariantnorms,matrixcompletion,multitasklearn- ing,proximalpointalgorithms. 1 1 Introduction Theproblemoflearningasparsevectororalowrankmatrixhasgeneratedmuchinterestinrecent years. A popular approach is to use convex regularizers which encourage sparsity, and a number ofthesehavebeenstudiedwithapplicationsincludingimagedenoising,collaborativefilteringand multitask learning, see for example, [BuehlmannandvanderGeer2011, Wainwright2014] and referencestherein. Recently, the k-support norm was proposed by [Argyriouetal. 2012], motivated as a tight relaxation of the set of k-sparse vectors of unit Euclidean norm. The authors argue that as a regu- larizerforsparsevectorestimation,thenormempiricallyoutperformstheLasso[Tibshirani1996] and Elastic Net [ZouandHastie2005] penalties. Statistical bounds on the Gaussian width of the k-supportnormhavebeenprovidedby[Chatterjeeetal. 2014]. Thek-supportnormhasalsobeen extended to the matrix setting. By applying the norm to the vector of singular values of a matrix, [McDonaldetal. 2014] obtain the orthogonally invariant spectral k-support norm, reporting state oftheartperformanceonmatrixcompletionbenchmarks. Motivatedbytheperformanceofthek-supportnorminsparsevectorandmatrixlearningprob- lems, in this paper we study a natural generalization by considering the (cid:96) -norms (for p ∈ [1,∞]) p in place of the Euclidean norm. These allow a further degree of freedom when fitting a model to the underlying data. We denote the ensuing norm the (k,p)-support norm. As we demonstrate in numerical experiments, using p = 2 is not necessarily the best choice in all instances. By tuning the value of p the model can incorporate prior information regarding the singular values. When prior knowledge is lacking, the parameter can be chosen by validation, hence the model can adapt to a variety of decay patterns of the singular values. An interesting property of the norm is that it interpolates between the (cid:96) norm (for k = 1) and the (cid:96) -norm (for k = d). It follows that varying 1 p both k and p the norm allows one to learn sparse vectors which exhibit different patterns of decay inthenon-zeroelements. Inparticular,whenp = ∞thenormprefersvectorswhichareconstant. Amaingoalofthepaperistostudytheproposednorminmatrixlearningproblems. The(k,p)- support norm is a symmetric gauge function hence it induces the orthogonally invariant spectral (k,p)-support norm. This interpolates between the trace norm (for k = 1) and the Schatten p- norms (for k = d) and its unit ball has a simple geometric interpretation as the convex hull of matricesofranknogreaterthank andSchattenp-normnogreaterthanone. Thissuggeststhatthe new norm favors low rank structure and the effect of varying p allows different patterns of decay in the spectrum. In the special case of p = ∞, the (k,p)-support norm is the dual of the Ky-Fan k-norm[Bhatia1997]anditencouragesaflatspectrumwhenusedasaregularizer. Themaincontributionsofthepaperare: i)weproposethe(k,p)-supportnormasanextension of the k-support norm and we characterize in particular the unit ball of the induced orthogonally invariant matrix norm (Section 3); ii) we show that the norm can be computed efficiently and we discuss the role of the parameter p (Section 4); iii) we outline a conditional gradient method to solve the associated regularization problem for both vector and matrix problems (Section 5); 2 and in the special case p = ∞ we provide an O(dlogd) computation of the projection operator (Section 5.1); finally, iv) we present numerical experiments on matrix completion benchmarks whichdemonstratethattheproposednormofferssignificantimprovementoverpreviousmethods, and we discuss the effect of the parameter p (Section 6). The appendix contains derivations of resultswhicharesketchedinorareomittedfromthemainbodyofthepaper. Notation. We use N for the set of integers from 1 up to and including n. We let Rd be the n d-dimensionalrealvectorspace,whoseelementsaredenotedbylowercaseletters. Foranyvector w ∈ Rd, its support is defined as supp(w) = {i ∈ N : w (cid:54)= 0}, and its cardinality is defined d i as card(w) = |supp(w)|. We let Rd×m be the space of d×m real matrices. We denote the rank of a matrix as rank(W). We let σ(W) ∈ Rr be the vector formed by the singular values of W, where r = min(d,m), and where we assume that the singular values are ordered nonincreasing, that is σ (W) (cid:62) ··· (cid:62) σ (W) (cid:62) 0. For p ∈ [1,∞) the (cid:96) -norm of a vector w ∈ Rd is defined 1 r p as (cid:107)w(cid:107) = ((cid:80)d |w |p)1/p and (cid:107)w(cid:107) = maxd |w |. Given a norm (cid:107) · (cid:107) on Rd or Rd×m, (cid:107) · (cid:107) p i=1 i ∞ i=1 i ∗ denotes the corresponding dual norm, defined by (cid:107)u(cid:107) = sup{(cid:104)u,w(cid:105) : (cid:107)w(cid:107) (cid:54) 1}. The convex ∗ hullofasubsetS ofavectorspaceisdenotedco(S). 2 Background and Previous Work Foreveryk ∈ N ,thek-supportnorm(cid:107)·(cid:107) isdefinedasthenormwhoseunitballisgivenby d (k) co(cid:8)w ∈ Rd : card(w) (cid:54) k,(cid:107)w(cid:107) (cid:54) 1(cid:9), (2.1) 2 that is, the convex hull of the set of vectors of cardinality at most k and (cid:96) -norm no greater than 2 one[Argyriouetal. 2012]. Wereadilyseethatfork = 1andk = dwerecovertheunitballofthe (cid:96) and(cid:96) -normsrespectively. 1 2 The k-support norm of a vector w ∈ Rd can be expressed as an infimal convolution [Rockafellar1970,p.34], (cid:40) (cid:41) (cid:88) (cid:88) (cid:107)w(cid:107) = inf (cid:107)v (cid:107) : v = w , (2.2) (k) g 2 g (vg) g∈G g∈G k k where G is the collection of all subsets of N containing at most k elements and the infimum k d is over all vectors v ∈ Rd such that supp(v ) ⊆ g, for g ∈ G . Equation (2.2) highlights that g g k the k-support norm is a special case of the group lasso with overlap [Jacobetal. 2009], where the cardinality of the support sets is at most k. This expression suggests that when used as a regularizer, the norm encourages vectors w to be a sum of a limited number of vectors with small support. Due to the variational form of (2.2) computing the norm is not straightforward, however [Argyriouetal. 2012]notethatthedualnormhasasimpleform,namelyitisthe(cid:96) -normofthek 2 3 largestcomponents, (cid:118) (cid:117) k (cid:117)(cid:88) (cid:107)u(cid:107) = (cid:116) (|u|↓)2, u ∈ Rd, (2.3) (k),∗ i i=1 where |u|↓ is the vector obtained from u by reordering its components so that they are nonin- creasing in absolute value. Note also from equation (2.3) that for k = 1 and k = d, the dual normisequaltothe(cid:96) -normand(cid:96) -norm,respectively,whichagreeswithourearlierobservation ∞ 2 regardingtheprimalnorm. Arelatedproblemwhichhasbeenstudiedinrecentyearsislearningamatrixfromasetoflinear measurements,inwhichtheunderlyingmatrixisassumedtohavesparsespectrum(lowrank). The trace norm, the (cid:96) -norm of the singular values of a matrix, has been shown to perform well in this 1 setting,seee.g. [Argyriouetal. 2008,JaggiandSulovsky2010]. Recallthatanorm(cid:107)·(cid:107)onRd×m is called orthogonally invariant if (cid:107)W(cid:107) = (cid:107)UWV(cid:107), for any orthogonal matrices U ∈ Rd×d and V ∈ Rm×m. A classical result by von Neumann establishes that a norm is orthogonally invariant if and only if it is of the form (cid:107)W(cid:107) = g(σ(W)), where σ(W) is the vector formed by the singular values of W in nonincreasing order, and g is a symmetric gauge function [VonNeumann1937]. In other words, g is a norm which is invariant under permutations and sign changes of the vector components,thatisg(w) = g(Pw) = g(Jw),whereP isanypermutationmatrixandJ isdiagonal withentriesequalto±1[HornandJohnson1991,p. 438]. Examplesofsymmetricgaugefunctionsarethe(cid:96) normsforp ∈ [1,∞]andthecorresponding p orthogonally invariant norms are called the Schatten p-norms [HornandJohnson1991, p. 441]. In particular, those include the trace norm and Frobenius norm for p = 1 and p = 2 respectively. RegularizationwithSchattenp-normshasbeenpreviouslystudiedby[Argyriouetal. 2007]anda statistical analysis has been performed by [RohdeandTsybakov2011]. As the set G includes all k subsets of size k, expression (2.2) for the k-support norm reveals that is a symmetric gauge func- tion. [McDonaldetal. 2014]usethisfacttointroducethespectralk-supportnormformatrices,by defining (cid:107)W(cid:107) = (cid:107)σ(W)(cid:107) , for W ∈ Rd×m and report state of the art performance on matrix (k) (k) completionbenchmarks. 3 The (k,p)-Support Norm In this section we introduce the (k,p)-support norm as a natural extension of the k-support norm. Thisfollowsbyapplyingthe(cid:96) -norm,ratherthantheEuclideannorm,intheinfimumconvolution p definitionofthenorm. Definition 1. Let k ∈ N and p ∈ [1,∞]. The (k,p)-support norm of a vector w ∈ Rd is defined d 4 as (cid:40) (cid:41) (cid:88) (cid:88) (cid:107)w(cid:107) = inf (cid:107)v (cid:107) : v = w . (3.1) (k,p) g p g (vg) g∈G g∈G k k wheretheinfimumisoverallvectorsv ∈ Rd suchthatsupp(v ) ⊆ g,forg ∈ G . g g k Let us note that the norm is well defined. Indeed, positivity, homogeneity and non degeneracy are immediate. To prove the triangle inequality, let w,w(cid:48) ∈ Rd. For any (cid:15) > 0 there exist {v } g and {v(cid:48)} such that w = (cid:80) v , w(cid:48) = (cid:80) v(cid:48), (cid:80) (cid:107)v (cid:107) (cid:54) (cid:107)w(cid:107) + (cid:15)/2, and (cid:80) (cid:107)v(cid:48)(cid:107) (cid:54) g g g g g g g p (k,p) g g p (cid:107)w(cid:48)(cid:107) +(cid:15)/2. As(cid:80) v +(cid:80) v(cid:48) = w+w(cid:48),wehave (k,p) g g g g (cid:88) (cid:88) (cid:107)w+w(cid:48)(cid:107) (cid:54) (cid:107)v (cid:107) + (cid:107)v(cid:48)(cid:107) (k,p) g p g p g g (cid:54) (cid:107)w(cid:107) +(cid:107)w(cid:48)(cid:107) +(cid:15), (k,p) (k,p) andtheresultfollowsbyletting(cid:15)tendtozero. Notethat,sinceaconvexsetisequivalenttotheconvexhullofitsextremepoints,Definition1 implies that the unit ball of the (k,p)-support norm, denoted by Cp, is given by the convex hull of k thesetofvectorswithcardinalitynogreaterthank and(cid:96) -normnogreaterthan1,thatis p Cp = co(cid:8)w ∈ Rd : card(w) (cid:54) k,(cid:107)w(cid:107) (cid:54) 1(cid:9). (3.2) k p Definition 1 gives the norm as the solution of a variational problem. Its explicit computation is not straightforward in the general case, however for p = 1 the unit ball (3.2) does not depend on k and is always equal to the (cid:96) unit ball. Thus, the (k,1)-support norm is always equal to the 1 (cid:96) -norm, and we do not consider further this case in this section. Similarly, for k = 1 we recover 1 the (cid:96) -norm for all values of p. For p = ∞, from the definition of the dual norm it is not difficult 1 to show that (cid:107)· (cid:107) = max{(cid:107)· (cid:107) ,(cid:107)· (cid:107) /k}. We return to this in Section 4 when we describe (k,p) ∞ 1 howtocomputethenormforallvaluesofp. Note further that in Equation (3.1), as p tends to ∞, the (cid:96) -norm of each v is increasingly p g dominated by the largest component of v . As the variational formulation tries to identify vectors g v withsmallaggregate(cid:96) -norm,thissuggeststhathighervaluesofpencourageeachv totendto g p g a vector whose k entries are equal. In this manner varying p allows us adjust the degree to which thecomponentsofvectorw canbeclusteredinto(possiblyoverlapping)groupsofsizek. As in the case of the k-support norm, the dual (k,p)-support norm has a simple expression. Recallthatthedualnormofavectoru ∈ Rd isdefinedbytheoptimizationproblem (cid:8) (cid:9) (cid:107)u(cid:107) = max (cid:104)u,w(cid:105) : (cid:107)w(cid:107) = 1 . (3.3) (k,p),∗ (k,p) 5 Proposition2. Ifp ∈ (1,∞]thenthedual(k,p)-supportnormisgivenby (cid:32) (cid:33)1 q (cid:88) (cid:107)u(cid:107) = |u |q , u ∈ Rd, (k,p),∗ i i∈I k whereq = p/(p−1)andI ⊂ N isthesetofindicesofthek largestcomponentsofuinabsolute k d value. Furthermore,ifp ∈ (1,∞)andu ∈ Rd\{0}thenthemaximumin(3.3)isattainedfor  (cid:16) (cid:17) 1 sign(u ) |ui| p−1 ifi ∈ I , w = i (cid:107)u(cid:107) k (3.4) i (k,p),∗ 0 otherwise. Ifp = ∞themaximumisattainedfor  sign(u ) ifi ∈ I ,u (cid:54)= 0,  i k i w = λ ∈ [−1,1] ifi ∈ I ,u = 0, i i k i  0 otherwise. Notethatforp = 2werecoverthedualofthek-supportnormin(2.3). 3.1 The Spectral (k,p)-Support Norm From Definition 1 it is clear that the (k,p)-support norm is a symmetric gauge function. This followssinceG containsallgroupsofcardinalityk andthe(cid:96) -normsonlyinvolveabsolutevalues k p ofthecomponents. Hencewecandefinethespectral(k,p)-supportnormas (cid:107)W(cid:107) = (cid:107)σ(W)(cid:107) , W ∈ Rd×m. (k,p) (k,p) Since the dual of any orthogonally invariant norm is given by (cid:107) · (cid:107) = (cid:107)σ(·)(cid:107) , see e.g. ∗ ∗ [Lewis1995],weconcludethatthedualspectral(k,p)-supportnormisgivenby (cid:107)Z(cid:107) = (cid:107)σ(Z)(cid:107) , Z ∈ Rd×m. (k,p),∗ (k,p),∗ The next result characterizes the unit ball of the spectral (k,p)-support norm. Due to the rela- tionshipbetweenanorthogonallyinvariantnormanditscorrespondingsymmetricgaugefunction, weseethatthecardinalityconstraintforvectorsgeneralizesinanaturalmannertotherankopera- torformatrices. Proposition 3. The unit ball of the spectral (k,p)-support norm is the convex hull of the set of matricesofrankatmostk andSchattenp-normnogreaterthanone. 6 In particular, if p = ∞, the dual vector norm is given by u ∈ Rd, by (cid:107)u(cid:107) = (cid:80)k |u|↓. (k,∞),∗ i=1 i Hence, for any Z ∈ Rd×m, the dual spectral norm is given by (cid:107)Z(cid:107) = (cid:80)k σ (Z), that (k,∞),∗ i=1 i is the sum of the k largest singular values, which is also known as the Ky-Fan k-norm, see e.g. [Bhatia1997]. 4 Computing the Norm Inthissectionwecomputethenorm,illustratinghowitinterpolatesbetweenthe(cid:96) and(cid:96) -norms. 1 p Theorem4. Letp ∈ (1,∞). Foreveryw ∈ Rd,andk (cid:54) d,itholdsthat (cid:34)(cid:88)(cid:96) (cid:32)(cid:80)d |w|↓(cid:33)p(cid:35)p1 (cid:107)w(cid:107) = (|w|↓)p + √i=(cid:96)+1 i (4.1) (k,p) i q k −(cid:96) i=1 where 1 + 1 = 1, and for k = d, we set (cid:96) = d, otherwise (cid:96) is the largest integer in {0,...,k −1} p q satisfying d (cid:88) (k −(cid:96))|w|↓ (cid:62) |w|↓. (4.2) (cid:96) i i=(cid:96)+1 Furthermore,thenormcanbecomputedinO(dlogd)time. Proof. Note first that in (4.1) when (cid:96) = 0 we understand the first term in the right hand side to bezero,andwhen(cid:96) = dweunderstandthesecondtermtobezero. Weneedtocompute (cid:40) (cid:41) d (cid:88) (cid:107)w(cid:107) = max u w : (cid:107)u(cid:107) (cid:54) 1 (k,p) i i (k,p),∗ i=1 where the dual norm (cid:107)·(cid:107) is described in Proposition 2. Let z = |w|↓. The problem is then (k,p),∗ i i equivalentto (cid:40) (cid:41) d k (cid:88) (cid:88) max z u : uq (cid:54) 1,u (cid:62) ··· (cid:62) u . (4.3) i i i 1 d i=1 i=1 Thisfurthersimplifiestothek-dimensionalproblem (cid:40) (cid:41) k−1 d k (cid:88) (cid:88) (cid:88) max u z +u z : uq (cid:54) 1,u (cid:62) ··· (cid:62) u . i i k i i 1 k i=1 i=k i=1 7 Note that when k = d, the solution is given by the dual of the (cid:96) -norm, that is the (cid:96) -norm. For q p the remainder of the proof we assume that k < d. We can now attempt to use Holder’s inequality, which states that for all vectors x such that (cid:107)x(cid:107) = 1, (cid:104)x,y(cid:105) (cid:54) (cid:107)y(cid:107) , and the inequality is tight if q p andonlyif (cid:18) |y | (cid:19)p−1 i x = sign(y ). i i (cid:107)y(cid:107) p We use it for the vector y = (z ,...,z ,(cid:80)d z ). The components of the maximizer u satisfy 1 k−1 i=k i (cid:16) (cid:17)p−1 u = zi ifi (cid:54) k −1,and i M k−1 (cid:32)(cid:80)d z (cid:33)p−1 u = i=(cid:96)+1 i . k M k−1 whereforevery(cid:96) ∈ {0,...,k−1},M denotesther.h.s. inequation(4.1). Wethenneedtoverify (cid:96) thattheorderingconstraintsaresatisfied. Thisrequiresthat (cid:32) (cid:33)p−1 d (cid:88) (z )p−1 (cid:62) z k−1 i i=k whichisequivalenttoinequality(4.2)for(cid:96) = k−1. Ifthisinequalityistruewearedone,otherwise wesetu = u andsolvethesmallerproblem k k−1 (cid:26)k−2 d (cid:88) (cid:88) max u z +u z : i i k−1 i i=1 i=k−1 k−2 (cid:27) (cid:88) uq +2uq (cid:54) 1, u (cid:62) ··· (cid:62) u . i k−1 1 k−1 i=1 We use again Ho¨lder’s inequality and keep the result if the ordering constraints are fulfilled. Con- tinuinginthisway,thegenericproblemweneedtosolveis (cid:26) (cid:96) d (cid:88) (cid:88) max u z +u z : i i (cid:96)+1 i i=1 i=(cid:96)+1 (cid:96) (cid:27) (cid:88) uq +(k −(cid:96))uq (cid:54) 1, u (cid:62) ··· (cid:62) u i (cid:96)+1 1 (cid:96)+1 i=1 where (cid:96) ∈ {0,...,k − 1}. Without the ordering constraints the maximum, M , is obtained by (cid:96) 1 the change of variable u(cid:96)+1 (cid:55)→ (k − (cid:96))qu(cid:96) followed by applying Ho¨lder’s inequality. A direct 8 (cid:16) (cid:17)p−1 computationprovidesthatthemaximizerisu = zi ifi (cid:54) (cid:96),and i M (cid:96) (cid:32) (cid:80)d z (cid:33)p−1 (k −(cid:96))1qu(cid:96)+1 = (k −i=(cid:96)(cid:96))+1q1Mip . (cid:96) Usingtherelationship 1 + 1 = 1,wecanrewritethisas p q (cid:32) (cid:80)d z (cid:33)p−1 u = i=(cid:96)+1 i . (cid:96)+1 (k −(cid:96))Mp (cid:96) Hence,theorderingconstraintsaresatisfiedif (cid:32)(cid:80)d z (cid:33)p−1 zp−1 (cid:62) i=(cid:96)+1 i , (cid:96) (k −(cid:96)) whichisequivalentto(4.2). FinallynotethatM isanondecreasingfunctionof(cid:96). Thisisbecause (cid:96) theproblemwithasmallervalueof(cid:96)ismoreconstrained,namely,itsolves(4.3)withtheadditional constraintsu = ··· = u . Moreover,iftheconstraint(4.2)holdsforsomevalue(cid:96) ∈ {0,...,k− (cid:96)+1 d 1} then it also holds for a smaller value of (cid:96), hence we maximize the objective by choosing the largest(cid:96). The computational complexity stems from using the monotonicity of M with respect to (cid:96), (cid:96) whichallowsustoidentifythecriticalvalueof(cid:96)usingbinarysearch. Note that for k = d we recover the (cid:96) -norm and for p = 2 we recover the result in p [Argyriouetal. 2012,McDonaldetal. 2014],howeverourprooftechniqueisdifferentfromtheirs. Remark 5 (Computation of the norm for p ∈ {1,∞}). Since the norm (cid:107)·(cid:107) computed above (k,p) forp ∈ (1,∞)iscontinuousinp,thespecialcasesp = 1andp = ∞canbederivedbyalimiting argument. We readily see that for p = 1 the norm does not depend on k and it is always equal to the(cid:96) -norm,inagreementwithourobservationintheprevioussection. Forp = ∞weobtainthat 1 (cid:107)w(cid:107) = max((cid:107)w(cid:107) ,(cid:107)w(cid:107) /k). (k,∞) ∞ 1 5 Optimization In this section, we describe how to solve regularization problems using the vector and matrix (k,p)-supportnorms. Weconsidertheconstrainedoptimizationproblem min(cid:8)f(w) : (cid:107)w(cid:107) (cid:54) α(cid:9), (5.1) (k,p) 9 Algorithm1Frank-Wolfe. Choosew(0) suchthat(cid:107)w(0)(cid:107) (cid:54) α (k,p) fort = 0,...,T do Computeg := ∇f(w(t)) Computes := argmin (cid:8)(cid:104)s,g)(cid:105) : (cid:107)s(cid:107) (cid:54) α(cid:9) (k,p) Updatew(t+1) := (1−γ)w(t) +γs,forγ := 2 t+2 endfor wherew isinRd orRd×m,α > 0isaregularizationparameterandtheerrorfunctionf isassumed to be convex and continuously differentiable. For example, in linear regression a valid choice is thesquareerror,f(w) = (cid:107)Xw−y(cid:107)2,whereX ismatrixofobservationsandy avectorofresponse 2 variables. Constrained problems of form (5.1) are also referred to as Ivanov regularization in the inverseproblemsliterature[Ivanovetal. 1978]. Aconvenienttooltosolveproblem(5.1)isprovidedbytheFrank-Wolfemethod[FrankandWolfe1956], seealso[Jaggi2013]forarecentaccount. ThemethodisoutlinedinAlgorithm1,andithasworst caseconvergencerateO(1/T). Thekeystepofthealgorithmistosolvethesubproblem argmin (cid:8)(cid:104)s,g(cid:105) : (cid:107)s(cid:107) (cid:54) α(cid:9), (5.2) (k,p) whereg = ∇f(w(t)),thatisthegradientoftheobjectivefunctionatthet-thiteration. Thisproblem involves computing a subgradient of the dual norm at g. It can be solved exactly and efficiently as a consequence of Proposition 2. We discuss here the vector case and postpone the discussion of the matrix case to Section 5.2. By symmetry of the (cid:96) -norm, problem (5.2) can be solved in the p samemannerasthemaximuminProposition2,andthesolutionisgivenbys = −αw ,wherew i i i is given by (3.4). Specifically, letting I ⊂ N be the set of indices of the k largest components of k d g inabsolutevalue,forp ∈ (1,∞)wehave  (cid:16) (cid:17) 1 −αsign(g ) gi p−1 , ifi ∈ I s = i (cid:107)g(cid:107) k (5.3) i (k,p),∗ 0, ifi ∈/ I k and,forp = ∞wechoosethesubgradient (cid:40) −αsign(g ) ifi ∈ I , g (cid:54)= 0, i k i s = (5.4) i 0 otherwise. 5.1 Projection Operator Analternativemethodtosolve(5.1)inthevectorcaseistoconsidertheequivalentproblem (cid:110) (cid:111) min f(w)+δ (w) : w ∈ Rd , (5.5) {(cid:107)·(cid:107) (cid:54)α} (k,p) 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.