ebook img

A Selective Review of Group Selection in High-Dimensional Models PDF

0.48 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Selective Review of Group Selection in High-Dimensional Models

StatisticalScience 2012,Vol.27,No.4,481–499 DOI:10.1214/12-STS392 (cid:13)c InstituteofMathematicalStatistics,2012 A Selective Review of Group Selection in High-Dimensional Models Jian Huang, Patrick Breheny and Shuangge Ma 3 1 Abstract. Groupingstructuresarisenaturallyinmanystatisticalmod- 0 eling problems. Several methods have been proposed for variable se- 2 lection that respect grouping structure in variables. Examples include n the group LASSO and several concave group selection methods.In this a J article,wegiveaselective reviewofgroupselection concerningmethod- 4 ological developments, theoretical properties and computational algo- rithms. We pay particular attention to group selection methods involv- ] T ing concave penalties. We address both group selection and bi-level se- S lection methods. We describe several applications of these methods in . nonparametric additive models, semiparametric regression, seemingly h t unrelated regressions, genomic data analysis and genome wide associ- a ation studies. We also highlight some issues that require further study. m [ Key words and phrases: Bi-level selection, group LASSO, concave group selection, penalized regression, sparsity, oracle property. 2 v 1 9 1. INTRODUCTION vector of regression coefficients of the jth group and 4 ε is the error vector. Without loss of generality, we 6 Consider a linear regression model with p predic- take both the predictors and response to be cen- . tors. Suppose the predictors can be naturally di- 4 tered around the mean. It is desirable to treat each 0 vided into J nonoverlapping groups, and the model 2 is written as group of variables as a unit and take advantage of 1 thegroupingstructurepresentinthesemodelswhen J : v (1.1) y= X β +ε, estimating regression coefficients and selecting im- i j j portant variables. X j=1 X Manyauthorshaveconsideredtheproblemofgroup r whereyisann 1vectorofresponsevariables,X is a × j selection in various statistical modeling problems. the n d design matrix of the d predictors in the × j j Bakin(1999)proposedthegroupLASSOandacom- jth group, β =(β ,...,β )′ Rdj is the d 1 j j1 jdj ∈ j × putationalalgorithm.Thismethodandrelatedgroup selection methods and algorithms were further de- Jian Huang is Professor, Department of Statistics and veloped by Yuan andLin (2006). ThegroupLASSO Actuarial Science, 241 SH, University of Iowa, Iowa uses an ℓ norm of the coefficients associated with a 2 City, Iowa 52242, USA e-mail: [email protected]. group of variables in the penalty function and is a Patrick Breheny is Assistant Professor, Department of natural extension of the LASSO (Tibshirani, 1996). Statistics, University of Kentucky, Lexington, Kentucky Antoniadis and Fan (2001) studied a class of block- 40506, USA e-mail: [email protected]. Shuangge Ma is Associate Professor, Division of Biostatistics, wiseshrinkageapproachesforregularizedwaveletes- School of Public Health, Yale University, New Haven, timationinnonparametricregressionproblems.They Connecticut 06520, USA e-mail: [email protected]. discussed several ways to shrink wavelet coefficients in their natural blocks, which include the blockwise This is an electronic reprint of the original article hard- and soft-threshold rules. Meier, van de Geer published by the Institute of Mathematical Statistics in Statistical Science, 2012, Vol. 27, No. 4, 481–499. This and Bu¨hlmann (2008) studied the group LASSOfor reprint differs from the original in pagination and logistic regression. Zhao, Rocha and Yu (2009) pro- typographic detail. posedaquitegeneralcompositeabsolutepenaltyfor 1 2 J. HUANG,P. BREHENY ANDS. MA groupselection, whichincludesthegroupLASSOas We give a selective review of group selection con- a special case. Huang, Ma, Xie and Zhang (2009) cerning methodological developments, theoretical considered the problem of simultaneous group and properties and computational algorithms. We de- individual variable selection, or bi-level selection, scribeseveral important applications of group selec- and proposed a group bridge method. Breheny and tion andbi-level selection innonparametricadditive Huang (2009) proposed a general framework for bi- models, semiparametric regression, seemingly unre- level selection in generalized linear models and de- latedregressions,genomicdataanalysisandgenome rived a local coordinate descent algorithm. wide association studies. We also highlight some is- Grouping structures can arise for many reasons, sues that require further study. For the purposes andgive risetoquitedifferentmodelinggoals. Com- of simplicity, we focus on penalized versions of least mon examples include the representation of multi- squaresregressioninthisreview.Manyauthorshave level categorical covariates in a regression model by extended these models to other loss functions, in a group of indicator variables, and the representa- particular thoseof generalized linear models.We at- tion of the effect of a continuous variable by a set tempt to point out these efforts when relevant. of basis functions. Grouping can also be introduced into a model in the hopes of taking advantage of 2. GROUP SELECTION METHODS priorknowledgethatisscientificallymeaningful.For 2.1 Group LASSO example, in gene expression analysis, genes belong- ing to the same biological pathway can be consid- For a column vector v Rd with d 1 and a ∈ ≥ ered a group. In genetic association studies, genetic positive definite matrix R, denote v =(v′v)1/2 2 markers from the same gene can be considered a and v =(v′Rv)1/2. Let β=(β′,k...k,β′ )′, where k kR 1 J group.Itisdesirabletotake into account thegroup- β Rdj.ThegroupLASSOsolutionβˆ(λ)isdefined j ∈ ing structure in the analysis of such data. as a minimizer of Depending on the situation, the individual vari- J 2 J ables in the groups may or may not be meaningful 1 (2.1) y X β +λ c β , scientifically. If they are not, we are typically not 2n(cid:13) − j j(cid:13) jk jkRj interested in selecting individual variables; our in- (cid:13)(cid:13) Xj=1 (cid:13)(cid:13)2 Xj=1 terest is entirely in group selection. However, if in- where λ≥(cid:13)(cid:13)0 is the penal(cid:13)(cid:13)ty parameter and Rj’s are dividual variables are meaningful, then we are usu- dj dj positivedefinitematrices.Herethecj’sinthe × ally interested in selecting important variables as penaltyareusedtoadjustforthegroupsizes.Area- well as important groups; we refer to this as bi- sonable choice is cj = dj. Because (2.1) is convex, level selection. For example, if we represent a con- any local minimizer of (2.1) is also a global min- p tinuous factor by a set of basis functions, the indi- imizer and is characterized by the Karush–Kuhn– vidual variables are an artificial construct, and se- Tucker conditions as given in Yuan and Lin (2006). lecting the important members of the group is typ- It is possible, however, for multiple solutions to ex- ically not of interest. In the gene expression and ist, as (2.1) may not be strictly convex in situations genetic marker examples, however, selection of in- where the ordinary least squares estimator is not dividual genes/markers is just as important as se- uniquely defined. lecting important groups. In other examples, such An important question in the definition of group as a group of indicator functions for a categorical LASSOisthechoiceofRj.FororthonormalXj with variable, whether we are interested in selecting indi- Xj′Xj/n = Idj, j = 1,...,J, Yuan and Lin (2006) vidualmembersdependsonthecontextofthestudy. suggested taking Rj =Idj. However, using Rj =Idj We address both group selection and bi-level se- may not be appropriate, since the scales of the pre- lection in this review. Thedistinction between these dictors may not be the same. In general, a reason- two goals is crucial for several reasons. Not only are able choice of Rj is to take the Gram matrix based different statistical methods used for each type of on Xj, that is, Rj =Xj′Xj/n, so that the penalty is problem,butaswewillsee,thepredictorsinagroup proportional to kXjβjk2. This is equivalent to per- can be made orthonormal in settings where bi-level forming standardization at the group level, which selection is not a concern. This has a number of canbeseen asfollows. Write Rj =Uj′Uj fora dj×dj ramifications for deriving theoretical results and de- upper triangular matrix Uj via Cholesky decompo- veloping algorithms to fit these models. sition. Let X =X U−1 and b = U β . Criterion j j j j j j e 3 GROUPSELECTION (2.1) becomes Anaturalquestion aboutthegroupLASSOis un- der what conditions it will perform better than the J 2 J 1 standard LASSO. This question was addressed by (2.2) y X b +λ c b . j j j j 2 2n(cid:13) − (cid:13) k k Huang and Zhang (2010), who introduced the con- (cid:13)(cid:13) Xj=1 (cid:13)(cid:13)2 Xj=1 cept of strong group sparsity. They showed that the e Thesolutio(cid:13)ntotheorigina(cid:13)lproblem(2.1)canbeob- (cid:13) (cid:13) group LASSO is superior to the standard LASSO tained by using the transformation β =U−1b . By j j j under the strong group sparsity and certain other thedefinition of Uj, wehave n−1Xj′Xj =Idj.There- conditions, including a group sparse eigenvalue con- fore, by using this choice of Rj, without loss of gen- dition.Morerecently,Lounicietal.(2011)conducted erality, we can assume that Xj saetisefies n−1Xj′Xj = a detailed analysis of the group LASSO. They es- Idj,1≤j≤J. Note that we do not assume Xj and tablished oracle inequalities for the prediction and Xk, j6=k, are orthogonal. ℓ2 estimation errors of group LASSO under a re- The above choice of Rj is easily justified in the stricted eigenvalue condition on the design matrix. specialcasewheredj =1,1 j J.Inthiscase, the They also showed that the rate of convergence of ≤ ≤ groupLASSOsimplifiestothestandardLASSOand their upper bounds is optimal in a minimax sense, R = X 2/n is proportional to the sample vari- j j up to a logarithmic factor, for all estimators over a k k ance of the jth predictor. Thus, taking R to be j class of group sparse vectors. Furthermore, by de- the Gram matrix is the same as standardizing the riving lower bounds for the prediction and ℓ esti- 2 predictors before the analysis, which is often recom- mation errors of the standard LASSO they demon- mended when applying LASSO for variable selec- stratedthatthegroupLASSOcanhavesmallerpre- tion. diction and estimation errors than the LASSO. Several authors have studied thetheoretical prop- While the group LASSO enjoys excellent proper- erties of the group LASSO, building on the ideas ties in terms of prediction and ℓ estimation errors, 2 and approaches for studying the behavior of the its selection consistency hinges on the assumption LASSO, on which there is an extensive literature; that the design matrix satisfies the irrepresentable see Bu¨hlmann and van de Geer (2011) and the ref- condition. This condition is, in general, difficult to erences therein. Bach (2008) showed that the group satisfy, especially in p n models (Zhang, 2010a). LASSOisgroupselectionconsistentinarandomde- ≫ Fan and Li (2001) pointed out that the standard sign model for fixed p under a variant of the irrep- LASSOover-shrinks large coefficients duetothena- resentable condition (Meinshausen and Bu¨hlmann, ture of ℓ penalty. As a result, the LASSO tends 2006; Zhao and Yu, 2006; Zou,2006). Nardi and Ri- 1 to recruit unimportant variables into the model in naldo (2008) considered selection consistency of the order to compensate for its overshrinkage of large group LASSO under an irrepresentable condition coefficients, and consequently, it may not be able to andtheboundsontheprediction andestimation er- distinguish variables with small to moderate coeffi- rors under a restricted eigenvalue condition (Bickel, cients from unimportant ones. This can lead to rel- Ritov and Tsybokov, 2009; Koltchinskii, 2009), as- sumingthattheGrammatrices X′X /narepropor- atively high false positive selection rates. Leng, Lin j j andWahba(2006)showedthattheLASSOdoesnot tional to the identity matrix. Wei andHuang (2010) achieve selection consistency if the penalty param- consideredthesparsityandℓ boundsontheestima- 2 eter is selected by minimizing the prediction error. tion and prediction errors of the group LASSO un- der the sparse Riesz condition (Zhang and Huang, The group LASSO is likely to behave similarly. In 2008). They also studied the selection property of particular,thegroupLASSOmayalsotendtoselect the adaptive group LASSO using the group LASSO a model that is larger than the underlying model as the initial estimate. The adaptive group LASSO with relatively high false positive group selection can be formulated in a way similar to the stan- rate. Further work is needed to better understand dard adaptive LASSO (Zou, 2006). Recently, there the properties of the group LASSO in terms of false has been considerable progress in the studies of the positive and false negative selection rates. LASSO based on sharper versions of the restricted 2.2 Concave 2-Norm Group Selection eigenvalue condition (van de Geer and Bu¨hlmann, 2009; Zhang, 2009; Ye and Zhang, 2010). It would ThegroupLASSOcanbeconstructedbyapplying be interesting to extend these results to the group the ℓ penalty to the norms of the groups. Specif- 1 LASSO. ically, for ρ(t;λ)=λ t , the group LASSO penalty | | 4 J. HUANG,P. BREHENY ANDS. MA canbewrittenasλc β =ρ( β ;c λ).Other lent to the ℓ penalty when γ = , the ℓ penalty jk jkRj k jkRj j 1 ∞ 1 penaltyfunctionscouldbeusedinstead.Thusamore also satisfies (2.4). However, many other penalties, generalclassofgroupselectionmethodscanbebased including the SCAD and ℓ penalties with q=1, do q 6 on the criterion not satisfy (2.4). An interesting question that has not received ad- J 2 J 1 equate attention is how to determine the value of γ. (2.3) y X β + ρ( β ;c λ,γ), 2n(cid:13) − j j(cid:13) k jkRj j In linear regression models with standardized pre- (cid:13)(cid:13) Xj=1 (cid:13)(cid:13)2 Xj=1 dictors, Fan and Li (2001) suggested using γ 3.7 where ρ(t;(cid:13)(cid:13)cjλ,γ) is conc(cid:13)(cid:13)ave in t. Here γ is an addi- in the SCAD penalty, and Zhang (2010a) sugg≈ested tional tuning parameter that may be used to mod- usingγ 2.7 intheMCP.Note, however, thatwhen ≈ ify ρ. As in the definition of the group LASSO, we γ , the group MCP converges to the group → ∞ assumewithoutloss of generality thateach Xj is or- LASSO, and when γ 1, it converges to the group thonormal with X′X /n=I and β = β . hard threshold penalt→y (Antoniadis, 1996) j j dj k jkRj k jk2 Itisreasonabletousepenalty functionsthatwork ρ(t;λ)=λ2 1(t λ)21 . well for individual variable selection. Some possi- − 2 | |− {|t|≤λ} Clearly, the choice of γ has a big impact on the es- ble choices include: (a) the bridge penalty with ρ(x; λ,γ)=λ x γ,0<γ 1 (Frank andFriedman, 1993); timate. See Mazumder, Friedman and Hastie (2011) | | ≤ and Breheny and Huang (2011) for further discus- |x| (b)theSCADpenaltywithρ(x;λ,γ)=λ min 1, 0 { sion on the choice of γ. (γ t/λ) /(γ 1) dt, γ>2 (Fan and Li, 2001; Fan and−Peng+,2004−),w}hereforanya R,a Rdenotesits To illustrate this point in the grouped variable ∈ + setting, we consider a simple example with J =20 positivepart,thatis, a =a1 ;(c)theminimax + {a≥0} groups, in which only the firsttwo groups have non- concave penalty (MCP) with ρ(x;λ,γ)=λ |x|(1 zero coefficients with β =( √2,√2)′,β =(0.5,1, 0 − 1 − 2 t/(γλ)) dt,γ >1 (Zhang, 2010a). All these penal- 0.5)′, so β = 2 and β 1.22. The sizes + R − k 1k2 k 2k2 ≈ tieshavetheoraclepropertyforindividualvariables, of the groups with zero coefficients are 3. The top meaning that the corresponding penalized estima- panel in Figure 1 shows the paths of the estimated tors are equal to the least squares estimator assum- norms βˆ and βˆ for γ=1.2,2.5 and , where k 1k k 2k ∞ ing the model is known with high probability under γ = corresponds to the group LASSO. The bot- ∞ appropriate conditions. See Huang, Horowitz and tompanelshowsthesolution pathsoftheindividual Ma (2008) for the bridge penalty, Fan and Li (2001) coefficients. It can be seen that the characteristics and Fan and Peng (2004) for the SCAD penalty of the solution paths are quite different for different and Zhang (2010) for the MC penalty. By applying valuesof γ.For the2-normgroupMCPwithγ=1.2 these penalties to (2.3), we obtain the2-norm group or 2.5, there is a region in the paths where the esti- bridge,2-normgroupSCADand2-normgroupMCP, mates are close to the true parameter values. How- respectively. Another interesting concave penalty is ever, for the group LASSO (γ = ), the estimates ∞ the capped-ℓ penalty ρ(t;λ,γ) = min(γλ2/2,λ t ) are always biased toward zero except when λ=0. 1 | | withγ>1(Zhang,2010b;Shen,ZhuandPan,2011). 2.3 Orthogonal Groups However, this penalty has not been applied to the To have some understanding of the basic charac- group selection problems. teristics of the group LASSO and nonconvex group For c = d ,thegroupMCPandcapped-ℓ pen- j j 1 selectionmethods,weconsiderthespecialcasewhere alty satisfy the invariance property p the groups are orthonormal with Xj′Xk = 0,j 6= k (2.4) ρ(kβjk2; djλ,γ)=ρ( djkβjk2;λ,djγ). and Xj′Xj/n=Idj. In this case, the problem sim- plifies to that of estimation in J single-group mod- Thustherescalinpgofλcanalsopbeinterpretedbased els of the form y=X θ+ε. Let z=X′y/n be the on the expression on the right-hand side of (2.4). j j least squares estimator of θ. Without loss of gen- Themultiplier d of β standardizes thegroup size. This ensuresjthatk sjmka2ller groups will not be erality, let cj = 1 below in this section. We have overwhelmed byplarger groups. Themultiplier dj for Xn−′1Xky/n−=XIjθk.22T=hukszt−heθpk22en+alniz−e1dkyleka22st−sqkuzka22ressincrcie- γ makestheamountofregularizationpergrouppro- j j dj terion is 2−1 z θ 2+ρ( θ ;λ,γ). Denote portional to its size. Thus the interpretation of γ k − k2 k k2 remains the same as that in the case where group t (2.5) S(z;t)= 1 z. sizes are equal to one. Because the MCP is equiva- − z (cid:18) k k2(cid:19)+ 5 GROUPSELECTION Fig. 1. Thesolution paths ofthe 2-norm group MCPfor γ=1.2,2.7 and ∞,where γ=∞ corresponds to thegroup LASSO. The top panel shows the paths of the ℓ norms of β ; the bottom shows the paths of the individual coefficients. The solid lines 2 j and dashed linesin the plots indicate the paths of the coefficients inthe nonzero groups 1 and 2, respectively. The dotted lines represent the zero groups. This expression is used in Yuan and Lin (2006) for 2-norm group SCAD: for γ>2, • computing the group LASSO solutions via a group θ (z;λ,γ) gSCAD coordinate descent algorithm. It is a multivariate (2.8) version of the soft-threshold operator (Donoho and b S(z;λ), if z 2 2λ, Johnstone, 1994) in which the soft-thresholding is = γ−1S(z; γλ ), if 2kλk<≤z γλ, applied to the length of the vector, while leaving γ−2 γ−1 k k2≤ its direction unchanged. By taking ρ to be the ℓ1, z, if kzk2>γλ. MCP and SCAD penalties, it can be verified that ThegroupLASSOsolutionhereissimplythemul- the group LASSO, group MCP and group SCAD tivariate soft-threshold operator. For the 2-norm solutions in a single group model have the following group MCP solution, in the region z 2 >γλ, it is k k equal to the unbiased estimator z, and in the re- expressions: maining region, it is a scaled-up soft threshold op- Group LASSO: erator. The 2-norm group SCAD is similar to the 2- • norm group MCP in that it is equal to the unbiased (2.6) θ (z;λ)=S(z,λ). gLASSO estimator z in the region z >γλ. In the region 2 k k 2-norm group MCP: for γ>1, z 2 γλ,the2-normgroupSCADisalsorelatedto • b kthekso≤ft threshold operator, but takes a more com- θgMCP(z;λ,γ) plicated form than the 2-norm group MCP. (2.7) For the 2-norm group MCP, θˆ (;λ,γ) b = γ−γ1S(z,λ), if kzk2≤γλ, θˆgLASSO(;λ) as γ and θˆgMCP(;gλM,CγP) · H(;→λ) (cid:26)z, if kzk2>γλ. as γ →1·for any →giv∞en λ>0, wher·e H(·;→λ) is·the 6 J. HUANG,P. BREHENY ANDS. MA hard-threshold operator defined as algorithms are particularly suitable for fitting group LASSO,groupSCADandgroupMCPmodels,since 0, if z λ, (2.9) H(z;λ) k k2≤ all three have simple closed-form expressions for a ≡(cid:26)z, if kzk2>λ. single-group model (2.6)–(2.8). Therefore,for agiven λ>0, θˆ (;λ,γ):1<γ A group coordinate descent step consists of par- gMCP { · ≤ is a family of threshold operators with the mul- tially optimizing the penalized least squares crite- ∞} tivariate hard and soft threshold operators at the rion (2.1) or (2.3) with respect to the coefficients in extremes γ=1 and . group j. Define For the2-normgro∞upSCAD,wehave θˆgSCAD(;λ, 2 · 1 Hγ)∗→(;λθˆg)LaAsSSγO(·;λ2,)wahseγre→∞ and θˆgSCAD(·;λ,γ)→ Lj(βj;λ,γ)= 2n(cid:13)y− Xkβ˜k−Xjβj(cid:13) · → (cid:13) Xk6=j (cid:13)2 (cid:13) (cid:13) S(z;λ), if z 2λ, +ρ(cid:13)( β ;c λ,γ), (cid:13) (2.10) H∗(z;λ) k k2≤ (cid:13)k jk2 j (cid:13) ≡(cid:26)z, if kzk2>2λ. where β˜ denotes the most recently updated value This is different from the hard threshold operator ofβ.Denotey˜ = X β˜ andz˜ =X′(y y˜ )/n. (2.9). For a given λ > 0, {θˆgSCAD(·;λ,γ):2 < γ ≤ Note that y˜j rjepresekn6=tjs thkekfitted vjaluesjexc−ludjing is a family of threshold operators with H∗ and thecontribution frPomgroupj,andz˜ representsthe ∞} j soft threshold operators at the extremes γ =2 and corresponding partial residuals. Just as in ordinary . Note that the hard threshold operator is not least squares regression, thevalue β that optimizes ∞ j included in the group SCAD family. L (β ;λ,γ) is equal to the value we obtain from re- j j Theclosed-formexpressionsgivenaboveillustrate gressing β on the partial residuals. In other words, j some important differences of the three group selec- the minimizer of L (β ;λ,γ) is given by F(z˜ ;λ,γ), j j j tion methods. They also provide building blocks of where F is one of the solutions in (2.6) to (2.8), the group coordinate descent algorithm for comput- depending on the penalty used. ing these solutions described below. Let β˜(0) = (β˜(0)′,...,β˜(0)′)′ be the initial value, 1 J 2.4 Computation via Group Coordinate Descent and let s denote the iteration. The GCD algorithm consists of the following steps: Group coordinate descent (GCD) is an efficient approach for fitting models with grouped penalties. Step 1. Set s=0. Initialize vector of residuals r= The first algorithm of this kind was proposed by y y˜, where y˜= J X β˜(0). Yuan and Lin (2006) as a way to compute the so- − j=1 j j Step 2. For j =1,...,J, carry out the following lutions to the group LASSO. Because the solution P calculations: paths of the group LASSO are not piecewise linear, (a) calculate ˜z =n−1X′r+β˜(s); j j j theycannotbecomputedusingtheLARSalgorithm (Efron et al., 2004). (b) update β˜(js+1)=F(z˜j;λ,γ), Coordinate descent algorithms (Fu, 1998; Fried- (c) update r r X (β˜(s+1) β˜(s)). man et al., 2007; Wu and Lange, 2008) have be- Step 3. Upda←te s− sj+1j. − j come widely used in the field of penalized regres- Step 4. Repeat ste←ps 2–3 until convergence. sion. These algorithms were originally proposed for The update in Step 2(c) ensures that r always optimization inproblemswithconvexpenaltiessuch holdsthecurrentvaluesoftheresiduals,andisthere- as the LASSO, but have also been used in calculat- fore ready for Step 2(a) of the next cycle. By tak- ingSCADandMCPestimates(BrehenyandHuang, 2011).Wediscussheretheideabehindthealgorithm ing F(;λ,γ) to be θgLASSO(;λ), θgMCP(;λ,γ) and · · · and its extension to the grouped variable case. θ (;λ,γ) in (2.6) to (2.8), we obtain the solu- gSCAD · Coordinate descent algorithms optimize an objec- tions to the group bLASSO, groupbMCP and group tive function with respect to a single parameter at SbCAD, respectively. The algorithm has two attrac- a time, iteratively cycling through the parameters tive features. First, each step is very fast, as it in- until convergence is reached; similarly, group coor- volves only relatively simple calculations. Second, dinate descent algorithms optimize the target func- the algorithm is stable, as each step is guaranteed tion with respect to a single group at a time, and to decrease the objective function (or leave it un- cycles through the groups until convergence. These changed). 7 GROUPSELECTION Theabovealgorithmcomputesβˆ foragiven(λ,γ) ture. However, it is not necessarily the case that pair; to obtain pathwise solutions, we can use the all variants within that gene are associated with algorithm repeatedly over a grid of (λ,γ) values. the disease. In such a study, the goal is to iden- For a given value of γ, we can start at λ = tify important individual variants, but to increase max max n−1X y /c , for which βˆ has the solution the power of the search by incorporating grouping j j 2 j {k k } 0,andproceedalong thegridusingthevalueof βˆ at information. the previous point in the λ-grid as the initial value In this section, we discuss bi-level selection meth- for the current point in the algorithm. An alterna- ods,whicharecapableofselectingimportantgroups tive approach is to use the group LASSO solution aswellasimportantindividualvariableswithinthose (corresponding to γ= ) as the initial value as we groups.Theunderlyingassumptionisthatthemodel ∞ decreaseγ foreachvalueofλ.SeeMazumder,Fried- is sparse at both the group and individual variable man and Hastie (2011) for a detailed description of levels. Thatis, the nonzero group coefficients β are j the latter approach in the nongrouped case. also sparse. It should be noted, however, that less The results of Tseng (2001) establish that the work has been done on bi-level selection than on algorithm converges to a minimum. For the group group LASSO, and there are still many unanswered LASSO, which has a convex objective function, the questions. algorithm therefore converges to the global mini- 3.1 Concave 1-Norm Group Penalties mum. For group SCAD and group MCP, conver- gence to a local minimum is possible. See also The- As one might suspect, based on analogy with orem 4 of Mazumder, Friedman and Hastie (2011) LASSO and ridge regression, it is possible to con- for the nongrouped case. structpenaltiesforbi-levelselectionbystartingwith The availability of the explicit expression in step the ℓ norm instead of the ℓ norm. This substitu- 1 2 2(b) of the algorithm dependson the choice of Rj = tion is not trivial, however: a na¨ıve application of Xj′Xj/n in (2.1) or (2.3). If a different norm is used, the LASSO penalty to the ℓ1 norm of a group re- then the groups are not orthonormal, and there are sults in the original LASSO,which obviously has no noexplicitsolutionstotheproblem.Withoutclosed- grouping properties. form solutions, step 2(b) must be solved using nu- Applying a concave penalty to the ℓ norm of 1 merical optimization. Algorithms proposed for com- a group, however, does produce an estimator with puting the group LASSO solutions without using grouping properties, as suggested by Huang et al. R =X′X /nincludeFriedmanetal.(2007),Jacob, j j j (2009), whoproposedthegroup bridgepenalty. The Obozinski and Vert (2009) and Liu and Ye (2010). 1-norm group bridge applies a bridge penalty to the For generalized linear models, the group coordinate ℓ norm of a group, resulting in the criterion 1 descent can be applied based on quadratic approxi- mations to the log-likelihood in the objective func- 1 J 2 J (3.1) y X β +λ c β γ, tion (Meier, van de Geer and Bu¨hlmann (2008)). 2n(cid:13) − j j(cid:13) jk jk1 (cid:13) Xj=1 (cid:13)2 Xj=1 (cid:13) (cid:13) 3. BI-LEVEL SELECTION where λ >(cid:13)0 is the regu(cid:13)larization parameter, γ (cid:13) (cid:13) ∈ The methods described in Section 2 produce es- (0,1) is the bridge index and cj are constants { } timates that are sparse at the group level and not that adjust for the dimension of group j. For mod- at the level of individual variables. Within a group, els with standardized variables, a reasonable choice there are only two possibilities for the selection re- is cj = dj γ. When dj =1,1 j J, (3.1) simplifies | | ≤ ≤ sults based on these methods: either all of the vari- to the standard bridge criterion. The method pro- ables are selected, or none of them are. This is not posed by Zhou and Zhu (2010) can be considered a always appropriate for the data. special case of group bridge with γ=0.5. A general For example, consider a genetic association study composite absolute penalty based on ℓq norms was in which the predictors are indicators for the pres- proposed by Zhao, Rocha and Yu (2009). ence of genetic variation at different markers. If a Huang et al. (2009) showed that the global group genetic variant located in a gene is associated with bridge solution is group selection consistent under thedisease, then itis morelikely thatother variants certainregularityconditions.Theirresultsallowp → located inthesamegenewillalsobeassociated with as n but require p<n. In contrast to the ∞ →∞ the disease—the predictors have a grouping struc- group LASSO, the selection consistency of group 8 J. HUANG,P. BREHENY ANDS. MA bridge does not require an irrepresentable-type con- strong signals, or may be excluded if the rest of its dition.However,noresultsareavailableforthegroup group displays little association with the outcome. bridge in the J n settings. An interesting special case of the composite pen- ≫ In principle, we could apply other concave penal- alty is using the MCP as both the outer and in- ties to the group ℓ norm as well, leading to the ner penalties, which we refer to as the composite 1 more general penalized criterion MCP (this penalty was referred to as “group MCP” in Breheny and Huang (2009); we use “composite J 2 J 1 MCP”bothtobetterreflecttheframeworkandavoid (3.2) y X β + ρ( β ;c λ,γ). 2n(cid:13) − j j(cid:13) k jk1 j confusionwiththe2-normgroupMCPofSection2.2). (cid:13)(cid:13) Xj=1 (cid:13)(cid:13)2 Xj=1 The composite MCP uses the criterion Choosing ρ(cid:13) to be the S(cid:13)CAD or MCP penalty in (cid:13) (cid:13) (3.2) would seem particularly promising, but to our 1 J 2 y X β knowledge, these estimators have not been studied. 2n(cid:13) − j j(cid:13) 3.2 Composite Penalties (3.5) (cid:13)(cid:13) Xj=1 (cid:13)(cid:13)2 (cid:13) (cid:13) (cid:13) J dj(cid:13) An alternative way of thinking about concave + ρ ρ (β ) , 1-norm group penalties is that they represent the λ,γO λ,γI | jk| ! j=1 k=1 composition of two penalties: a concave group-level X X penalty and an individual variable-level 1-norm where ρ is the MCP penalty and γO, the tuning pa- penalty.Itisnatural,then,toalsoconsiderthecom- rameteroftheouterpenalty,ischosentobedjγIλ/2 position of concave group-level penalties with other in order to ensure that the group level penalty at- individual variable-level penalties. This framework tains its maximum if and only if each of its com- was proposed in Breheny and Huang (2009), who ponents are at their maximum. In other words, the describedgroupedpenaltiesasconsistingofanouter derivative of the outer penalty reaches 0 if and only penalty ρO applied to a sum of inner penalties ρI. if |βjk|≥γIλ ∀k∈{1,...,dj}. Thepenaltyappliedtoagroupofpredictorsisthere- Figure 2 shows the group LASSO, 2-norm group fore written as MCP, 1-norm group Bridge and composite MCP penaltiesforatwo-predictor group.Notethatwhere dj the penalty comes to a point or edge, there is the (3.3) ρ ρ (β ) , O I jk | | ! possibility that the solution will take on a sparse k=1 X value; all penalties come to a point at 0, encour- where β is the kth member of the jth group, and jk aging group-level sparsity, but only group bridge the partial derivative with respect to the jkth co- and composite MCP allow for bi-level selection. In variate is addition, one can see that the MCP penalties are dj capped, while the group LASSO and group bridge (3.4) ρ′O ρI(|βjk|) ρ′I(|βjk|). penalties are not. Furthermore, note that the indi- ! k=1 vidualvariable-level penaltyforthecompositeMCP X Note that the group bridge fits into this framework iscappedatalevelbelowthatofthegroup;thislim- with an outer bridge penalty and an inner LASSO its the extent to which one variable can dominate penalty, as does the group LASSO with an outer the penalty of the entire group. The 2-norm group bridge penalty and an inner ridge penalty. MCP does not have this property. This illustrates From (3.3), we can view group penalization as ap- the two rationales of composite MCP: (1) to avoid plyingarate ofpenalization to apredictorthat con- overshrinkage by allowing covariates to grow large, sists of two terms: the first carries information re- and (2) to allow groups to remain sparse internally. garding the group; the second carries information The 1-norm group bridge allows the presence of a about the individual predictor. Whether or not a single large predictor to continually lower the entry variableentersthemodelisaffectedbothbyitsindi- threshold of the other variables in its group. This vidualsignalandbythecollectivesignalofthegroup property, whereby a single strong predictor draws that it belongs to. Thus, a variable with a moderate others into the model, prevents the group bridge individual signal may be included in a model if it from achieving consistency for the selection of in- belongs to a group containing other members with dividual variables. 9 GROUPSELECTION Fig. 2. The group LASSO, group bridge and composite mcp penalties fora two-predictor group. Note that where the penalty comes to a point or edge, there is the possibility that the solution will take on a sparse value; all penalties come to a point at 0, encouraging group-level sparsity, but only group bridge and composite MCP allow for bi-level selection. Figure 3 shows the coefficient paths from λ oracle model. (4) In general, the coefficient paths max down to 0 for group LASSO, 1-norm group bridge, for these group penalization methods are continu- andcomposite MCPfor asimulated dataset featur- ous, but are not piecewise linear, unlike those for ing two groups, each with three covariates. In the the LASSO. underlying model, the group represented by solid Although composite penalties do not, in general, lines has two covariates with coefficients equal to 1 have closed-form solutions in single-group models and the other equal to 0; the group represented by like the penalties in Section 2, the idea of group co- dashed lines has two coefficients equal to 0 and the ordinatedescent can stillbeused.Themaincompli- otherequalto 1.Thefigurerevealsmuchaboutthe cation is in step 2(b) for the algorithm described in − behaviorofgroupedpenalties.Inparticular,wenote Section2.4,wherethesingle-groupsolutionsneedto the following: (1) Even though each of the nonzero besolvednumerically.Anotherapproachisbasedon coefficientsisofthesamemagnitude,thecoefficients a local coordinate descent algorithm (Breheny and fromthemoresignificantsolidgroupenterthemodel Huang, 2009). This algorithm first uses a local lin- more easily than the lone nonzero coefficient from ear approximation to the penalty function (Zou and the dashed group. (2) This phenomenon is less pro- Li, 2008). After applyingthis approximation, in any nounced for composite MCP, which makes weaker given coordinate direction the optimization prob- assumptionsaboutgrouping.(3)ForcompositeMCP lem is equivalent to the one-dimensional LASSO, at λ 0.3, all of the variables with true zero coef- which has the soft-threshold operator as its solu- ≈ ficients have been eliminated while the remaining tion. The thresholding parameter λ in each update coefficients are unpenalized.Inthis region, thecom- is given by expression (3.4). Because the penalties posite MCP approach is performing as well as the involved are concave on [0, ), the linear approxi- ∞ Fig. 3. Coefficient paths from 0 to λ for group LASSO, 2-norm group MCP, 1-norm group bridge, and composite MCP max for a simulated data set featuring two groups, each with three covariates. In the underlying data-generating mechanism, the group represented by solid lines has two covariates with coefficients equal to 1 and the other equal to 0; the group represented by dashed lines has two coefficients equal to 0 and the other equal to −1. 10 J. HUANG,P. BREHENY ANDS. MA mation is a majorizing function, and the algorithm Table 1 Application of the three group penalization methods and thus enjoys the descent property of MM algorithms a one-at-a-time method to a genetic association data set. (Lange, Hunter and Yang (2000)) whereby the ob- CV error is the average number of misclassification jective function is guaranteed to decrease at every errors over the ten validation folds iteration. Further details may be found in Breheny Genes Markers Cross-validation and Huang (2009). These algorithms have been im- selected selected error plemented in the R package grpreg, available at http://cran.r-project.org. The package com- One-at-a-time 19 49 0.441 putes the group LASSO, group bridge and compos- Group LASSO 17 435 0.390 ite MCP solutions for linear regression and logistic Group bridge 3 20 0.400 regression models. Composite MCP 8 11 0.391 3.3 Additive Penalties ers that previous biological studies have suggested Another approach to achieving bi-level selection may be related to the disease. is to add an ℓ penalty to the group LASSO (Wu 1 We analyze the data with the group LASSO, and Lange, 2008; Friedman, Hastie and Tibshirani, 1-norm group bridge and composite MCP methods 2010). by considering markers to be grouped by the gene J 2 J they belong to. Penalized logistic regression models 1 (3.6) y X β +λ β +λ β , were fit assuming an additive effect for all markers 2n(cid:13)(cid:13) −Xj=1 j j(cid:13)(cid:13)2 1k k1 2Xj=1k jk2 (homozygous dominant = 2, heterozygous = 1, ho- (cid:13) (cid:13) mozygous recessive = 0). In addition to the group where λ (cid:13) 0 and λ 0(cid:13)are regularization parame- 1≥(cid:13) 2≥ (cid:13) penalization methods, we analyzed these data using ters. Theabove objective function has thebenefitof a traditional one-at-a-time approach (single-marker being convex, eliminating the possibility of conver- analysis),inwhichunivariatelogisticregressionmod- gence to a local minimum during model fitting. The els were fit and marker effects screened using a p< group coordinate descent algorithm can no longer 0.05 cutoff. Ten-fold cross-validation was usedto se- be applied, however, as the orthonormalization pro- lect λ, and to assess accuracy (for the one-at-a-time cedure described in Section 2 will not preserve the approach, predictions were made from an unpenal- sparsityachieved bytheℓ penaltyoncethesolution 1 ized logistic regression modelfittothetrainingdata is transformed back to the original variables. Puig, usingall themarkersselected byindividualtesting). Wiesel and Hero (2011), Friedman, Hastie and Tib- The results are presented in Table 1. shirani (2010) and Zhou et al. (2010) have proposed Table 1 suggests the benefits of using group pe- algorithms for solving this problem without requir- nalization regression approaches as opposed to one- ing orthonormalization. at-a-time approaches: the three group penalization Inprinciple,thegroupLASSOportionof thepen- methods achieve lower test error rates and do so alty could be replaced with any of the convex 2- while selecting fewer genes (groups). Although the norm group penalties of Section 2.2; likewise the ℓ 1 error rates of 40% indicate that these 30 genes penalty could be replaced by, say, MCP or SCAD. ≈ likely do not include SNPs that exert a large ef- These possibilities, to the best of our knowledge, fect on an individual’s chances of developing age- have not been explored. Further work is needed to related macular degeneration, the fact that they are study the properties of this class of estimators and well below the 50% that would be expected by ran- compare their performance with other methods. dom chance demonstrates that these genes do con- tain SNPs related to the disease. The very differ- 3.4 Example: Genetic Association ent nature of the selection properties of the three We now give an example from a genetic associa- grouppenalizationmethodsarealsoclearlyseen.Al- tion study where bi-level selection is an important though group LASSO achieves low misclassification goal of the study. The example involves data from error, it selects 17 genes out of 30 and 435 markers acase-control studyof age-related macular degener- out of 532, failing to shed light on the most impor- ation consisting of 400 cases and 400 controls, and tantgenetic markers.Thebi-level selection methods was analyzed in Breheny and Huang (2009). The achieve comparable error rates with a much more analysisisconfinedto30genescontaining532mark- sparse set of predictors: group bridge identifies 3

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.