ebook img

MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions PDF

0.36 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions

MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions ElizabethD.Schifano∗, Robert L.Strawderman∗, MartinT. Wells∗ 1 1 0 Abstract 2 n The use of regularization, or penalization, has become increasingly common in high- a dimensional statistical analysis over the past decade, where a common goal is to simultaneously J selectimportantvariablesandestimatetheireffects. Ithasbeenshownbyseveralauthorsthatthese 1 goals can be achieved by minimizing some parameter-dependent“goodnessof fit” function (e.g., 2 a negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that ] are nonsmooth (i.e. not differentiable) at the origin have received substantial attention, arguably O beginningwithLASSO(Tibshirani,1996). C The current literature tends to focus on specific combinations of smooth data fidelity (i.e., . goodness-of-fit)andnonsmoothpenaltyfunctions.Oneresultofthiscombinedspecificityhasbeen t a aproliferationinthenumberofcomputationalalgorithmsdesignedtosolvefairlynarrowclassesof t s optimizationproblemsinvolvingobjectivefunctionsthatarenoteverywherecontinuouslydifferen- [ tiable. Inthispaper,weproposeageneralclassofalgorithmsforoptimizinganextensivevarietyof 2 nonsmoothlypenalizedobjectivefunctionsthatsatisfycertainregularityconditions. Theproposed v frameworkutilizesthemajorization-minimization(MM)algorithmasitscoreoptimizationengine. 6 Theresultingalgorithmsrelyoniteratedsoft-thresholding,implementedcomponentwise,allowing 7 forfast,stableupdatingthatavoidstheneedforanyhigh-dimensionalmatrixinversion.Weestablish 7 a local convergencetheoryfor this class of algorithmsunder weaker assumptionsthan previously 4 . consideredinthestatisticalliterature.Wealsodemonstratetheexceptionaleffectivenessofnewac- 1 celerationmethods,originallyproposedfortheEMalgorithm,inthisclassofproblems.Simulation 0 resultsandamicroarraydataexampleareprovidedtodemonstratethealgorithm’scapabilitiesand 0 1 versatility. : v Xi Keywords: IterativeSoftThresholding,MIST,MMalgorithm. r a 1 Introduction Variable selection is an important and challenging issue in the rapidly growing realm of high- dimensionalstatisticalmodeling. Insuchcases,itisoftenofinteresttoidentifyafewimportantvariables inaveritableseaofnoise. Modernmethods,increasingly basedontheprincipleofpenalizedlikelihood estimation applied to high dimensional regression problems, attempt to achieve this goal through an adaptivevariableselectionprocessthatsimultaneously permitsestimationofregressioneffects. Indeed, the literature on the penalization of a “goodness of fit” function (e.g., negative loglikelihood), with a penalty singular attheorigin, isquickly becomingvast, proliferating inpartduetotheconsideration of ∗DepartmentofStatisticalScience,301MalottHall,CornellUniversity,IthacaNY14853USA 1 specificcombinationsofdatafidelity(i.e.,goodness-of-fit)andpenaltyfunctions,theassociatedstatisti- calpropertiesofresultingestimators,andthedevelopmentofseveralcombination-specific optimization algorithms,(e.g.,Tibshirani,1996;Zou,2006;ZouandHastie,2005;ZouandZhang,2009;FanandLi, 2001;ParkandHastie,2007;Friedmanetal.,2008). In this paper, we propose a unified optimization framework that appeals to the Majorization- Minimization (MM) algorithm (Lange, 2004) as the primary optimization tool. The resulting class ofalgorithms isreferred toasMIST,anacronym forMinimization byIterative SoftThresholding. The MMalgorithmhasbeenconsideredbeforeforsolvingspecificclassesofsingularlypenalizedlikelihood estimation problems (e.g., Daubechies etal., 2004; HunterandLi, 2005; ZouandLi, 2008); to a large extent, this work is motivated by these ideas. A distinct advantage of the proposed work is the excep- tionalversatilityoftheclassofMISTalgorithms,theirassociatedeaseofimplementationandnumerical stability,andthedevelopmentofafixedpointconvergencetheorythatpermitsweakerassumptionsthan existingpapersinthisarea. Weemphasizeherethefocusofthispaperisonthedevelopmentofastable and versatile class of algorithms applicable to a wide variety of singularly penalized estimation prob- lems. In particular, the consideration of asymptotic and oracle properties of estimators derived from particular combinations of fidelity and penalty functions, as well as methods for effectively choosing associated penalty parameters, are not focal points of this paper. A comprehensive treatment of these results may be found in Johnson etal. (2008), where asymptotics and oracle properties for estimators derivedfromageneral classofpenalized estimatingequations aredeveloped insomedetail. Thepaperisorganizedasfollows. InSection2,weintroducenotation andprovidesufficientcondi- tionsforlocalconvergenceoftheMMalgorithmappliedtoalargeclassofdata-fidelityandnon-smooth penalty functions. InSection 3,wepresent aspecialized version ofthisgeneral algorithm, demonstrat- ing in particular how the minimization step of the MM algorithm can be carried out using iterated soft-thresholding. In its most general form, iterated soft-thresholding is required at each minimization step;wefurtherdemonstratehowtocarryoutthisminimizationstepinoneiterationthroughajudicious choiceofmajorization function. Asaconsequence, wepresentasimplifiedclassofiterativealgorithms thatareapplicable toawideclassofsingularly penalized, generalized linearregression models. Simulation results are provided in Section 4, while an application in survival analysis to Diffuse Large B Cell Lymphoma expression data (Rosenwaldetal., 2002) is presented in Section 5. We con- cludewithadiscussion inSection6. Proofsandotherrelevantresultsarecollected intheAppendix. 2 MM algorithms for nonsmooth objective functions Letξ(β)denote areal-valued objective function tobe minimized for β = (β ,...,β )T insome convex 1 p subset B of Rp. Let ξSUR(β,α) denote a real-valued “surrogate” objective function, where α ∈ B. Definetheminimization map M(α)= argmin ξSUR(β,α). (1) β∈B Then, if ξSUR(β,α) majorizes ξ(β) for each α, a generic MM algorithm for minimizing ξ(β) takes the followingform(e.g.,Lange,2004): 1. Initialize β(0) 2. Forn ≥ 0,computeβ(n+1) = M(β(n)),iteratinguntilconvergence. 2 Providedthattheobjectivefunction,itssurrogateandthemapping M(·)satisfycertainregularitycondi- tions,onecanestablishconvergenceofthisalgorithmtoalocalorglobalsolution. Lange(2004,Ch.10) developssuchatheoryassumingthattheobjectivefunctionsξ(β)andξSUR(β,α)aretwicecontinuously differentiable. For problems that lack this degree of smoothness (e.g., all singularly penalized regres- sion problems, including lasso, Tibshirani (1996); adaptive lasso, Zou (2006); and SCAD, FanandLi (2001)), a more general theory of local convergence is required. One such theory is summarized in Appendix A.1; related results for the EM algorithm may be found in Wu (1983), Tseng (2004) and Chre´tienandHero(2008). Letk·kdenotetheusualEuclideanvectornorm. BasedonthetheorysummarizedinAppendixA.1, we propose a new and general class of algorithms for minimizing penalized objective functions of the form ξ(β)= g(β)+ p(β;λ)+λεkβk2, λ > 0, ε ≥ 0 (2) whereg(β)and p(β;λ)arerespectively datafidelity(e.g.,negativeloglikelihood) andpenaltyfunctions that satisfy regularity conditions to be delineated below. As will be shown later, the class of problems representedby(2)containsallofthepenalizedregressionproblemscommonlyconsideredinthecurrent literature. It also covers numerous other problems by expanding the class of permissible fidelity and penaltyfunctions inasubstantial way. Weassumethroughoutthatg(β)isconvexandcoerciveforβ ∈ B,whereBisanopenconvexsubset ofRp. Wefurtherassumethat p p(β;λ) = p˜(|β |;λ ), (3) j j Xj=1 wherethevector λ = (λT,...,λT)T and λ denotes theblock ofλassociated withβ .Itisassumed that 1 p j j each λ has dimension greater than or equal to one, that all blocks have the same dimension, and that j the λ = λ for each j ≥ 1. Evidently, the case where dim(λ ) = 1 for j ≥ 1 simply corresponds to j1 j the setting in which each coefficient ispenalized in exactly the same way; permitting the dimension of λ to exceed one allows the penalty to depend on additional parameters (e.g., weights, such as in the j case of the adaptive lasso considered in Zou (2006)). We are interested in problems with penalization; therefore, λ is assumed bounded and strictly positive throughout this paper. Several specific examples will be discussed below. For any bounded θ with λ > 0 as the first element, and the remainder of θ collecting anyadditional parametersusedtodefinethepenalty, thescalarfunction p˜(r;θ)isassumedto satisfythefollowingcondition: (P1) p˜(r;θ) > 0 for r > 0; p˜(0;θ) = 0; p˜(r;θ) is a continuously differentiable concave function with p˜′(r;θ)≥ 0forr > 0,and, p˜′(0+;θ)∈ [M−1,M ]forsomefinite M > 0. θ θ θ Evidently, (P1) implies that p˜′(r;θ) > 0 for r ∈ (0,K ), where K > 0 may be finite or infinite. The θ θ combinationoftheconcavityandnonnegativederivativeconditionsthusimplythatthepenaltyincreases away from the origin, but with a decreasing rate of growth that may become zero. The case where (3) isidentically zero forr > 0isruled outbythepositivity oftherightderivative attheorigin imposed in (P1);similarly,theconcavity assumption alsorulesoutthepossibility ofastrictlyconvexpenalty term. Neither of these restrictions is particularly problematic. Our specific interest lies in the development of algorithms for estimation problems subject to a penalty singular at the origin. Were (3) absent, or 3 replaced byastrictly convex penalty term, the convexity of g(β)implies (2)can be minimized directly usinganysuitable convexoptimization algorithm, suchasthatdiscussed inTheorem3.2below. Theorem2.1establishes localconvergence oftheindicated classofMMalgorithms forminimizing objective functions of the form (2). A proof is provided in Appendix A.2, where it is shown that conditions imposed in the statement of the theorem are sufficient conditions for the application of the generalMMlocalconvergence theorysummarizedinAppendixA.1. Theorem2.1. Letg(β)beconvexandcoerciveandassume p(β;λ)satisfiesboth(3)andcondition(P1). Leth(β,α) ≥ 0beareal-valued, continuous function ofβandαthatiscontinuously differentiable inβ foreachαandsatisfies h(β,α) = 0whenβ = α. Let p q(β,α;λ)= q˜(|β |,|α |;λ ), (4) j j j Xj=1 whereq˜(r,s;θ)= p˜(s;θ)+ p˜′(s;θ)(r− s)forr,s ≥ 0,anddefine ψ(β,α)= h(β,α)+q(β,α;λ)− p(β;λ). Assumethesetofstationary pointsSforξ(β),β ∈ Bisfiniteandisolated. Then: (i) ξ(β)in(2)islocallyLipschitzcontinuous andcoercive; (ii) q(β,α;λ)− p(β;λ)iseitheridentically zeroornon-negative forallβ , α; (iii) ξSUR(β,α) ≡ ξ(β)+ψ(β,α)majorizes ξ(β)andthe MMalgorithm derived from ξSUR(β,α)con- verges to a stationary point of ξ(β) if ξSUR(β,α) is uniquely minimized in β for each α and at leastoneofh(β,α)orq(β,α;λ)− p(β;λ)isstrictlypositive foreachβ , α. Condition(iii)ofTheorem2.1establishesconvergenceundertheassumptionthatξSUR(β,α)strictly majorizes ξ(β) and has a unique minimizer in β for each α. Such a uniqueness condition is shown by Vaida (2005) to ensure convergence of the EM and MM algorithms to a stationary point under more restrictive differentiability conditions. Importantly, the assumption of globally strict majorization is only a sufficient condition for convergence; this condition is only important insofar as it guarantees a strict decrease inthe objective function at every iteration. Ascan beseen from the proof, itis possible to relax this condition to locally strict majorization, in which ξSUR(β,α) majorizes ξ(β), with strict majorization beingnecessary onlyinanopenneighborhood containing M(α). The use of the MM algorithm and selection of (4) are motivated by the results ZouandLi (2008); we refer the reader to Remark 3.1 below for further comments in this direction. The assumptions on g(β) clearly cover the case of the linear and canonically parameterized generalized linear models upon setting g(β) = −ℓ(β), where ℓ(β) denotes the corresponding loglikelihood function. Estimation under the semiparametric Cox regression model (Cox, 1972) and accelerated failure time models are also covered upon setting g(β) to be either the negative logarithm of the partial likelihood function (e.g., Andersenetal., 1993, Thm VII.2.1) or the Gehan objective function (e.g., FygensonandRitov, 1994; Johnson andStrawderman,2009). Theassumption(P1)onthepenaltyfunctioncoversawidevarietyofpopularandinterestingexam- ples; see Figure 1 for illustration. For example, the lasso (LAS; e.g., Tibshirani, 1996), adaptive lasso (ALAS; e.g., Zou, 2006), elastic net (EN; e.g., ZouandHastie, 2005), and adaptive elastic net (AEN; 4 e.g., ZouandZhang, 2009) penalties are all recovered as special cases upon considering the combi- nation of (3) and the ridge-type penalty λεkβk2. Specifically, with λ = (λ,ω )T for ω ≥ 0, taking j j j p˜(r;λ ) = λω r in (3) gives LAS (ω = 1,ǫ = 0), ALAS (ω > 0,ǫ = 0), EN (ω = 1,ǫ > 0) and the j j j j j AEN(ω > 0,ǫ > 0) penalties. Itiseasy to seethat selecting p˜(r;λ ) = λω r also implies the equality j j j of (3)and(4),aresultrelevantinboth(ii)and(iii)ofTheorem2.1above. LASSO [q =l =1] SCAD [q =(l ,a)=(1.25,3.7)] Geman & Reynolds, 1992 [q =(l ,d )=(5,1)] 5 5 5 4 4 4 ) 3 ) 3 ) 3 ~(, qplrl ~(, qplrl ~(, qplrl 2 2 2 1 1 1 0 0 0 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 r r r Figure1: Threeexamplesofpenalties satisfying (P1). The proposed penalty specification also covers the smoothly clipped absolute deviation (SCAD; e.g., FanandLi, 2001) penalty upon setting p˜(r;λ ) = p˜ (r;λ,a) for each j ≥ 1, where p˜ (r;λ,a) is j S S definedasthedefiniteintegralof (aλ−u) p˜′(u;λ,a) = λ[I(u ≤ λ)+ +I(u > λ)] (5) S (a−1)λ ontheinterval0≤ u ≤ randsomefixedvalueofa > 2(e.g.,a= 3.7). Theresulting penaltyfunctionis continuously differentiable andconcave onr ∈ [0,∞). Theconcavity of p˜ (·;λ,a)on[0,∞),combined S with p˜ (0;λ,a) = 0andthefactthat p˜′(0+;λ,a)isfinite,implies S S p˜ (r;λ,a) ≤ p˜ (s;λ,a)+ p˜′(s;λ,a)(r− s) (6) S S S for each r,s ≥ 0, the boundary cases for r = 0 and/or s = 0 following from Hiriart-Urruty andLemare´chal(1996,Remark4.1.2,p.21). Inotherwords, p˜ (r;λ,a)canbemajorized S byalinearfunction ofr. Thelassopenalty,itsvariants,andSCADhavereceivedthegreatestattentionintheliterature. More recently, Zhang (2007) introduced the minimax concave penalty (MCP), which similarly to SCAD is defined in terms of its derivative. Specifically, one takes p˜(r;λ ) = p˜ (r;λ,a) for each j ≥ 1 in (3), j M where p˜ (r;λ,a)isdefinedfora> 2asthedefiniteintegralof M u p˜′ (u;λ,a) = λ− (7) M (cid:18) a(cid:19)+ on the interval 0 ≤ u ≤ r and some fixed value of a > 2(e.g., a = 3.7 as in Fanetal.,2009b). Further examplesofdifferentiable concavepenalties include p˜(r;λ )= p˜ (r;λ,δ)for j G δr p˜ (r;λ,δ) = λ , δ> 0 (8) G 1+δr 5 (e.g.GemanandReynolds,1992;Nikolova,2000);and p˜(r;λ )= p˜ (r;λ,δ)for j Y p˜ (r;λ,δ) = λlog(δr+1), δ > 0; (9) Y (e.g.Antoniadis etal.,2009). Thesepenalties represent justasmallsampleofthesetofpossible penal- tiessatisfying (P1)thatonemightreasonably consider. Remark 2.2. TheSCADand MCPpenalties arenot strictly concave and lead to surrogate majorizers that fail to satisfy the globally strict majorization condition in (iii) of Theorem 2.1 unless h(β,α) is strictlypositive wheneverβ , α;seeRemark3.1forfurther discussion andalsoTheorem3.4below. 3 MIST: Minimization by Iterated Soft Thresholding 3.1 The MIST algorithm In general, the statistical literature on penalized estimation has proposed optimization algorithms tai- loredforspecificcombinationsoffidelityandpenaltyfunctions. TheclassofMMalgorithmssuggested byTheorem2.1providesaverygeneralandusefulframeworkforproposingnewalgorithms, thekeyto whichisamethodologyforsolvingtheminimizationproblem(1),asteprepeatedwitheachiterationof the MM algorithm. In this regard, it is helpful to note that the problem ofminimizing ξSUR(β,α) for a givenαisequivalent tominimizing p g(β)+λεkβk2+h(β,α)+ p˜′(|α |;λ )|β | (10) 2 j j j Xj=1 inβ. Inparticular, ifg(β)+λεkβk2+h(β,α)isstrictlyconvexforeachboundedα,whichclearlyoccurs if both g(β) and h(β,α) are convex in β and at least one is strictly convex, then (10) is also strictly convexandthecorresponding minimizationproblem hasauniquesolution. Remark3.1. Forε = h(β,α)= 0andg(β) = −ℓ(β)forℓ(β)= n ℓ(β)withℓ(β)atwicecontinuously i=1 i i differentiable loglikelihood function, themajorizerusedbythePMMalgorithm inducedbythesurrogate function (10) corresponds (up to sign) to the minorizer employed in the LLA algorithm of ZouandLi (2008),animprovementontheso-calledLQAalgorithm proposed inHunterandLi(2005). ZouandLi (2008, Proposition 1)assert convergence oftheir LLAalgorithm under imprecisely stated assumptions and are additionally unclear as to the nature of convergence result actually estabished. For example, while ZouandLi (2008, Theorem 1) demonstrate that the LLA algorithm does indeed have an ascent property, their result appears to be insufficient to ensure that the proper analog of condition Z3(ii) of TheoremA.1holdsinthecaseoftheSCADpenalty. Asaconsequence, itisunclear whetherevenweak “subsequence” convergence results (cf. Wu, 1983) hold with useful generality in the case of the LLA algorithm. Incontrast, Theorem 2.1showsthatstrict majorization, under afewprecisely stated condi- tions,issufficienttoensurelocalconvergenceoftheresultingMMalgorithmtoastationarypointof (2) InSection3.2,itisfurtherdemonstratedhowaparticularchoiceofh(β,α)yieldsastrictmajorizerthat permitsbothclosedformminimization andcomponentwise updating ateachstepoftheMMalgorithm, eveninthecaseofpenalties thatfailtobestrictlyconcave. Numerous methods exist for minimizing a differentiable convex objective function (e.g., BoydandVandenberghe,2004). However,because(10)isnotdifferentiable,suchmethodsdonotapply 6 in the current setting. Specialized methods exist for nonsmooth problems of the form (10) in settings whereg(β)hasaspecial structure; awell-knownexamplehereisLARS(Efronetal.,2004),whichcan beusedtoefficientlysolvelasso-type problemsinthecasewhereg(β)isreplacedbyaleastsquaresob- jective function. Recently, CombettesandWajs (2005, Proposition 3.1; Theorem 3.4) proposed a very generalclassoffixedpointalgorithmsforminimizing f (h)+ f (h),where f(·),i = 1,2areeachconvex 1 2 i andhtakesvaluesinsomerealHilbertspaceH. Haleetal.(2008,Theorem4.5)specializetheresultsof CombettesandWajs(2005)tothecasewhereH issomesubsetofRpand f (h) = p |h|. Thecollec- 2 j=1 i tiveapplicationoftheseresultstotheproblemofminimizing(10)generatesaniteraPtedsoft-thresholding procedure with an appealingly simple structure. Theorem 3.2, given below, states the algorithm, along with conditions under which the algorithm is guaranteed to converge; a proof is provided in Appendix A.3. Theresultingclassofprocedures, thatis,MMalgorithmsinwhichtheminimizationof (10)iscar- ried out via this iterated soft-thresholding procedure, is hereafter referred to as MIST, an acronym for (M)inimization by(I)terated(S)oft(T)hresholding. TwoimportantandusefulfeaturesofMISTinclude the absence of high-dimensional matrix inversion and the ability to update each individual parameter separately. Theorem 3.2. Suppose the conditions of Theorem 2.1 hold. Let m(β) = g(β) + h(β,α) + λǫkβk2 be strictly convex with a Lipschitz continuous derivative of order L−1 > 0 for each bounded α. Then, for anysuchαandaconstant ̟ ∈(0,2L),theuniqueminimizerof (10)canbeobtainedinafinitenumber ofiterations usingiteratedsoft-thresholding: 1. Setn = 1andinitialize b(0) 2. Computed(n) = b(n−1)−̟∇m(b(n−1)) 3. Computeb(n) = S(d(n);̟τ),whereforanyvectorsu,v∈ Rp, p S(u;v) = s(u ,v )e , (11) j j j Xj=1 e denotesthe jth unitvectorforRp, j s(u ,v ) = sign(u )(|u |−v ) , (12) j j j j j + istheunivariate soft-thresholding operator, and τ = (p˜′(|α |;λ ),...,p˜′(|α |;λ ))T. 1 1 p p 4. Stopifconverged; else,setn = n+1andreturntoStep2. Theproposedalgorithm,asoriginallydevelopedinCombettesandWajs(2005),issuitableformin- imizing the sum of a differentiable convex function m(·) and another convex function; hence, under similarconditions, onecouldemploythisalgorithm directly tominimize(2)incaseswherethepenalty (3) is derived from some convex function p˜(·;θ). Theorem 3.4 of CombettesandWajs (2005) further showsthattheupdateinStep3canbegeneralized to b(n) = b(n−1)+δ S b(n−1)−̟ ∇m(b(n−1));̟ τ −b(n−1) , n n n h (cid:16) (cid:17) i 7 where,foreveryn,̟ ∈ (0,2L)andδ ∈ (0,1]isasuitablesequenceofrelaxationconstants. Judicious n n selection of these constants, possibly updated at each step, may improve the convergence rate of this algorithm. Theorem 3.2imposes therelatively strongcondition thatthegradient ofm(β)is L−1-Lipschitzcon- tinuous. The role of this condition, also imposed in CombettesandWajs (2005, Proposition 3.1; The- orem 3.4), is to ensure that the update at each step of the proposed algorithm is a contraction, thereby guaranteeing itsconvergence toafixedpoint. Toseethis, notethat theupdate fromb(n) tob(n+1) inthe algorithm of Theorem 3.2 involves the mapping S(b−̟∇m(b);̟τ). For any bounded b and a, it is easilyshownthat kS(b−̟∇m(b);̟τ)−S(a−̟∇m(a);̟τ)k≤ kb−a−̟(∇m(b)−∇m(a))k. When∇m(b) = ∇m(a),theright-hand sidereducestokb−ak,andtheresultingmappingisonlynonex- pansive(i.e.,notnecessarily contractive). However,understrictconvexity, thissituation canoccuronly ifb = a.Inparticular,supposethatb(n) , b(n−1);then,∇m(b(n)), ∇m(b(n−1))and,usingthemeanvalue theorem, kb(n+1)−b(n)k = kS b(n)−̟∇m(b(n));̟τ −S b(n−1)−̟∇m(b(n−1));̟τ k ≤ kI −(cid:16) ̟H(b(n),b(n−1))kkb((cid:17)n) −b(cid:16)(n−1)k, (cid:17) where H(b,a) = 1∇m(b + t(a − b))dt. The assumption that the gradient of m(β) is L−1-Lipschitz 0 continuous now imRplies that choosing ̟ as indicated guarantees kI − ̟H(b(n),b(n−1))k < 1, thereby producing acontraction. In view of the generality of the Contraction Mapping Theorem (e.g., Luenberger andYe, 2008, Thm. 10.2.1), it is possible to relax the requirement that ∇m(β) is globally L−1-Lipschitz continuous providedthatoneselectsasuitablestartingpoint. Therelevantextensionissummarizedinthecorollary below;onemayprovethisresultinamannersimilartoTheorem4.5ofHaleetal.(2008). Corollary3.3. LettheconditionsofTheorem2.1hold. Supposeαisaboundedvectorandassumethat m(β)= g(β)+h(β,α)+λǫkβk2isstrictlyconvexandtwicecontinuouslydifferentiable. Then,foragiven boundedα,thereexistsauniqueminimizerβ∗. LetΩbeaboundedconvexsetcontainingβ∗ anddefine λ (β)to bethe largest eigenvalue of∇2m(β). Then, the algorithm ofTheorem 3.2 converges toβ∗ in max afinitenumberofiterations providedthatb(0) ∈ Ω,λ∗ = max λ (β) < ∞,and̟ ∈ (0,2/λ∗). β∈Ω max Someuseful insight into theform oftheproposed thresholding algorithm canbegained byconsid- eringthebehavior ofthepenalty derivativeterm p˜′(r;θ).Assuggested earlier, (P1)impliesthat p˜′(r;θ) decreases from its maximum value towards zero as r moves away from the origin. For some penal- ties (e.g., SCAD, MCP), this derivative actually becomes zero at some finite value of r > 0, resulting in situations for which τ = p˜′(|α |;λ ) = 0 for at least one j. If this occurs for component j, then j j j jth component of the vector S b(n)−̟∇m(b(n));̟τ simply reduces to the jth component of the ar- gument vector b(n) − ̟∇m(b(n(cid:16))). In the extreme cas(cid:17)e where τ = 0, the proposed update reduces to b(n+1) = b(n) −̟∇m(b(n)), an inexact Newton step in which the inverse hessian matrix is replaced by ̟I ,I denoting the p× pidentity matrix,andwithstep-size chosentoensurethatthisupdate yieldsa p p contraction. Hence, if each of the components of b(n) −̟∇m(b(n)) are sufficiently large in magnitude, the proposed algorithm simply takes an inexact Newton step towards the solution; otherwise, one or morecomponents ofthisNewton-likeupdatearesubjecttosoft-thresholding. 8 3.2 Penalized estimationforgeneralized linearmodels Thecombination ofTheorems2.1,3.2andCorollary 3.3lead toauseful andstableclass ofalgorithms with the ability to deal with a wide range of penalized regression problems. In settings where g(β) is strictly convex and twice continuously differentiable, one can safely assume that h(β,α) = 0 for all choicesofβandαprovidedthat p˜′(r;θ)in(P1)isstrictlypositiveforr > 0;importantexamplesofsta- tisticalestimation problemshereinclude manycommonlyusedlinearandgeneralized linearregression models, semiparametric Cox regression (Cox, 1972), and smoothed versions of the accelerated failure time regression model (cf. JohnsonandStrawderman, 2009). The SCAD and MCP penalizations, as well as other penalties having p˜′(r;θ) ≥ 0 for r > 0, can also be used; however, additional care is required. In particular, and as pointed out in an earlier remark, if one sets h(β,α) = 0 for all β and α then convergence of the resulting algorithm to a stationary point is no longer guaranteed by the above resultsduetotheresulting failureofthesepenalties toinducestrictmajorization. The need to use an iterative algorithm for repeatedly minimizing (10) is not unusual for the class of MM algorithms. However, it turns out that for certain choices of g(β), a suitable choice of h(β,α) in Theorem 3.2 guarantees both strict majorization and permits one to minimize (10) in a single iter- ation, resulting in a single soft-thresholding update at each iteration. Below, we demonstrate how the MISTalgorithmsimplifiesinsettingswhereg(β)correspondstothenegativeloglikelihoodfunctionofa canonically parameterized generalized linearregression modelhaving abounded hessianfunction. The result applies toallpenalties satisfying condition (P1), including SCADand MCP.Aproof isprovided inAppendix A.4. Theorem3.4. Letybe N ×1andsupposetheprobability distribution ofyfollowsageneralized linear modelwithacanonicallinkandlinearpredictorXβ,whereX = [1 ,X]isN×(p+1)andβ = [β ,βT]T N 0 is(p+1)×1withβ denoting anintercept. Assumethatg(β) = −ℓ(β),where 0 ee e e ℓ(β) = 1T(c(y)−b(η))e+yTη e denotesthecorresponding loglikelihoode; here,η= XβaendE[y]e= b′(η˜ )fori= 1...N forb(·)strictly i i convex and twice continuously differentiable. Let the penalty function be defined as in (3) and satisfy e ee (P1);notethatβ remainsunpenalized. Define 0 h(β,α) = ℓ(β)−ℓ(α)−∇ℓ(α)T(β−α)+̟−1(β−α)T(β−α); (13) whereα˜ ≡ [α ,αT]T ise(pe+1)×e1,ande̟isdefienedeasienCorollarey3.3e. Theen:e 0 1. Theobjective function (2),sayξ (β˜),ismajorizedby glm ξSUR(β˜,α˜) = −ℓ(α˜)−∇ℓ(α˜)T(β˜ −α˜) glm p +̟−1(β˜ −α˜)T(β˜ −α˜)+ (τ |β |+γ +λǫβ2) (14) j j j j Xj=1 where τ = p˜′(|α |;λ )and γ = p˜(|α |;λ )− p˜′(|α |;λ )|α |arebounded, nonnegative, and func- j j j j j j j j j tionally independent ofβ˜. 2. Thefunctions g(β) = −ℓ(β˜)andh(β˜,α˜)satisfytheregularity conditions ofTheorems2.1and3.2; hence, thecorresponding MMalgorithm converges toastationary pointof (2). e 9 3. Foreachbounded α˜, (a) theminimizerβ˜∗ ofξSUR(β˜,α˜)isuniqueandsatisfies glm 1 ̟ ̟ β∗ = S α+ [∇ℓ(α˜)] , τ , A 1+̟λǫ 2 2 (cid:18) (cid:19) ̟ β∗ = α + [∇ℓ(α)] (15) 0 0 2 0 whereS(·;·)isthesoft-thresholding operatoredefinedin(11)andA= {1,...,p}. (b) foreachκ˜ ≡ [κ ,κT]T ∈ R(p+1), 0 ξSUR(β˜∗+κ˜,α˜) ≥ ξSUR(β˜∗,α˜)+̟−1kκ˜k2. (16) glm glm Inviewofprevious results, theresultin#3ofTheorem 3.4showsthattheresulting MMalgorithm (n) takesaverysimpleform: giventhecurrentiterateβ , 1. updatetheunpenalized intercept β(n) : e 0 ̟ β(n+1) = β(n)+ ∇ℓ(β˜(n)) 0 0 2 (cid:20) (cid:21)0 2. updatetheremaining parametersβ(n): 1 ̟ ̟ β(n+1) = S β(n)+ [∇ℓ(β˜(n))] ; τ(n) , (17) A 1+̟λǫ 2 2 (cid:18) (cid:19) whereτ(n) = (p˜′(|β(n)|;λ ),...,p˜′(|β(n)|;λ ))T. 1 1 p p The specific choice of function h(β,α) clearly serves two useful purposes: (i) it leads to componentwise-soft thresholding; and, (ii) it leads to strict majorization, as is required in condition e e (iii) ofTheorem 2.1, allowing one toestablish the convergence ofMISTforSCADand other penalties having p˜′(r,θ) = 0atsomefiniter > 0. Evidently, the algorithm above immediately covers the setting of penalized linear regression. For example, suppose that y has been centered to remove β from consideration and that the problem has 0 alsobeenrescaledsothatX,whichisnow N× p,satisfiestheindicated conditions. Then,theresultsof theTheorem3.4applydirectlywith 1 1 −ℓ(β) = kXβ−yk2, ∇ℓ(β)= XT(y−Xβ), h(β,α) = ̟−1kβ−αk2− kXβ−Xαk2, 2 2 where̟isdefinedasinCorollary3.3. Fortheclassofadaptiveelasticnetpenalties(i.e., p˜(r;λ ) = λω r j j in(3)),theresultingiterativeschemeisexactlythatproposedin(DeMoletal.,2008,pg.17),specialized tothesettingofaEuclideanparameter. Inparticular, wehaveτ = λω andγ = 0inTheorem 3.4,and j j j theproposed updatereducesto 1 β(k+1) = S (νI−X′X)β(k)+X′y;λ , ν+2λǫ (cid:16) (cid:17) where ν = 2̟−1. Setting ν = 1 and ǫ = 0 yields the iterative procedure proposed in Daubechies etal. (2004),providedthatX′XisscaledsuchthatI−X′Xispositivedefinite. TheproposedMISTalgorithm 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.