StatisticalScience 2012,Vol.27,No.4,447–449 DOI:10.1214/12-STS409 (cid:13)c InstituteofMathematicalStatistics,2012 Introduction to the Special Issue on Sparsity and Regularization Methods Jon Wellner and Tong Zhang 1. INTRODUCTION statisticsandoptimization.Rapidadvanceshavebeen 3 1 Traditionalstatisticalinferenceconsidersrelatively madeinrecentyears.Inviewofthegrowingresearch 0 activitiesandtheirpracticalimportance,wehaveor- small data sets and the corresponding theoretical 2 ganized this special issue of Statistical Science with analysisfocusesontheasymptoticbehaviorofasta- n the goal of providing overviews of several topics in tistical estimator when the number of samples ap- a modern sparsity analysis and associated regulariza- J proaches infinity. However, many data sets encoun- tion methods. Our hope is that general readers will 2 tered in modern applications have dimensionality get a broad idea of the field as well as current re- significantly larger thanthenumberoftrainingdata ] search directions. E available, and for such problems the classical statis- M tical tools become inadequate. In order to analyze 2. SPARSE MODELING AND high-dimensional data, new statistical methodology REGULARIZATION . t and the corresponding theory have to be developed. a t Inthepastdecade,sparsemodelingandthecorre- Oneofthecentralprobleminstatisticsislinearre- [s sponding use of sparse regularization methods have gression,whereweconsiderann×pdesignmatrixX emergedasamajortechniquetohandlehigh-dimen- andann-dimensionalresponsevectorY ∈Rn sothat 1 v sional data. While the data dimensionality is high, (1) Y =Xβ¯+ε, 1 thebasicassumptioninthisapproachisthattheac- whereβ¯∈Rp is thetrueregression coefficient vector 1 tualestimatorissparseinthesensethatonlyasmall andε∈Rn isanoisevector.Inthecaseofn<p,this 2 numberofcomponentsarenonzero.Onthepractical 0 problem is ill-posed because the number of parame- side,thesparsityphenomenonhasbeenubiquitously 1. ters is more than the number of observations. This observed in applications, including signal recovery, 0 ill-posedness can be resolved by imposing a spar- genomics, computer vision, etc. On the theoretical 3 sity constraint: that is, by assuming that kβ¯k ≤s 1 side, this assumption makes it possible to overcome for some s, where the ℓ -norm of β¯ is define0d as : the problem associated with estimating more pa- 0 v kβ¯k =|supp(β¯)|,andthesupportsetof β¯isdefined Xi rameters than the number of observations which is as s0upp(β¯):={j:β¯ 6=0}. If s≪n, then the effec- impossible to deal with in the classical setting. j r tive number of parameters in (1) is smaller than the Thereareanumberofchallenges, includingdevel- a number of observations. oping new theories for high-dimensional statistical The sparsity assumption may be viewed as the estimation as well as new formulations and compu- classical model selection problem, where models are tational procedures.Related problemshavereceived indexed by the set of nonzero coefficients. The clas- alotofattention invariousresearchfields,including sical model selection criteria such as AIC, BIC or applied math, signal processing, machine learning, Cp [1, 7, 11] naturally lead to the so-called ℓ regu- 0 larization estimator: Jon Wellner is Professor of Statistics and Biostatistics, 1 Department of Statistics, University of Washington, (2) βˆ(ℓ0)=argmin kXβ−Yk2+λkβk . Seattle, Washington 98112-2801, USA e-mail: β∈Rp(cid:20)n 2 0(cid:21) [email protected]. Tong Zhang is Professor of The main difference of modern ℓ analysis in high- 0 Statistics, Department of Statistics, Rutgers University, dimensional statistics and the classical model selec- New Jersey, USA e-mail: [email protected]. tion methods is that the choice of λ will be differ- This is an electronic reprint of the original article ent, and the modern analysis requires choosing a published by the Institute of Mathematical Statistics in larger λ than that considered in the classical model Statistical Science, 2012, Vol. 27, No. 4, 447–449. This selection setting because it is necessary to compen- reprint differs from the original in pagination and sate for the effect of consideringmany models in the typographic detail. high-dimensional setting. The analysis for ℓ regu- 0 1 2 J. WELLNER ANDT. ZHANG larization in the high-dimensional setting (e.g., [15] larization methods. Moreover, many articles go be- in this issue) employs different techniques and the yond the current state of the art in various ways. results obtained are also different from the classical Therefore, these articles not only give some high literature. level ideas about the current topics, but will also The ℓ regularization formulation leads to a non- be valuable for experts working in the field. 0 convexoptimizationproblemthatisdifficulttosolve • Bach,Jenatton,MairalandGuillaume(Structured computationally. On the other hand, an important sparsity through convex optimization, [2]) study requirement for modern high-dimensional problems convex relaxations based on structured norms in- is to design computationally efficient and statisti- corporatingfurtherstructuralpriorknowledge.An cally effective algorithms. Therefore, the main fo- extension of the standard ℓ sparsity model that 0 cus of the existing literature is on convex relaxation has received a lot of attention in recent years is methods that use ℓ -regularization (Lasso) to re- 1 structured sparsity. The basic idea is that not all place sparsity constraints: sparsity patterns for supp(β¯) are equally likely. 1 Asimpleexampleisgroupsparsitywherenonzero (3) βˆ(ℓ1)=argmin kXβ−Yk2+λkβk . β∈Rp(cid:20)n 2 1(cid:21) coefficients occur together in predefined groups. More complex structured sparsity models have This method is referred to as Lasso [12] in the liter- been investigated in recent years. Although the ature and its theoretical properties have been inten- paper by Bach et al. focuses on the convex op- sively studied. Since the formulation is regarded as timization approach, they also give an extensive an approximation of (2), a key question is how good survey of recent developments, including the use thisapproximationis,andhowgoodistheestimator of sub-modular set functions. βˆ(ℓ1) for estimating β¯. • van de Geer and Mu¨ller (Quasi-likelihood and/or Many extensions of Lasso have appeared in the robust estimation in high dimensions, [13]) ex- literature for more complex problems. One exam- tend ℓ regularization methods to generalized lin- ple is group Lasso [14] that assumes that variables 1 earmodels.Thisinvolvesconsiderationoflossfunc- are selected in groups. Another extension is the es- tions beyond the usual least-squares loss and, in timation of graphical models,whereone can employ particular, loss functions arising via quasi-likeli- Lasso to estimate unknown graphical model struc- hoods. tures[3,8].Athirdexampleismatrixregularization, • Huang, Breheny and Ma (A selective review of wheretheconceptof sparsitycan bereplaced bythe groupselectioninhigh-dimensionalregression,[5]) conceptoflow-rankness,andsparsityconstraintsbe- provide a detailed review of the most important come low-rank constraints. Of special interest is the special case of structured sparsity, namely, group so-calledmatrixcompletionproblem,wherewewant sparsity. Their review covers both convex relax- to recover a matrix from a few observations of the ation (or group Lasso) and approaches based on matrix entries. This problem is encountered in rec- nonconvex group penalties. ommender system applications (e.g., a person buys • Huet, Giraud and Verzelen (High-dimensional re- a book at amazon.com will be recommended other gression with unknown variance, [4]) address is- books purchased by other users with similar inter- suesinhigh-dimensionalregressionestimationcon- ests),andlow-rankmatrixfactorization isoneofthe nected with lack of knowledge of the error vari- maintechniquesforthisproblem.Similartosparsity ance. In the standard Lasso formulation (3), the regularization, using low-rank regularization leads regularization parameter λ is considered as a tun- to nonconvex formulations and, thus, it is natural ing parameter that needs to be chosen propor- to consider its convex relaxation which is referred tionally to the standard deviation σ of the noise to as trace-norm (or Nuclear norm) regularization. vector. A natural question is whether it is possi- Thetheoreticalpropertiesandnumericalalgorithms ble to automatically estimate σ instead of leaving fortrace-norm regularization methodshavereceived λ as a tuning parameter. This problem has re- attention. ceived much attention and a number of develop- mentshave beenmadeinrecentyears. Thispaper 3. ARTICLES IN THIS ISSUE reviews and compares several approaches to this The eight articles in this issue present general problem. overviews of the state of the art in a number of dif- • Lafferty, Liu and Wasserman (Sparse nonpara- ferent topics concerning sparsity analysis and regu- metric graphical models, [6]) discuss another im- 3 INTRODUCTION portant topic in sparsity analysis, the graphical sityconstraintis ℓ regularization, duetoitscom- 0 modelestimationproblem.Whilemuchofthecur- putational difficulty, most of the recent literature rentworkassumesthatthedatacomefromamul- focuses on the simpler ℓ regularization method 1 tivariate Gaussian distribution, this paper goes (Lasso)thatapproximatesℓ regularization.How- 0 beyond the standard practice. The authors out- ever, it is also known that ℓ1 regularization is not line a number of possible approaches and intro- a very good approximation to ℓ0 regularization. duce more flexible models for the problem. The This leads to the study of nonconvex penalties. authors also describe some of their recent work, The nonconvex formulations are both harder to and describe future research directions. analyze statistically and harder to handle com- • Negahban,Ravikumar,WainwrightandYu(Auni- putationally. Some fundamental understandingof fied framework for high-dimensional analysis of high-dimensional nonconvex procedures has only M-estimatorswithdecomposableregularizers,[9]) startedtoemergerecently. Nevertheless,someba- sic questions have remained unanswered: for ex- provideaunifiedtreatmentofexistingapproaches ample, properties of the global solution of non- to sparse regularization. The paper extends the standardsparserecoveryanalysisofℓ regularized convex formulations and whether it is possible to 1 computetheglobaloptimalsolutionefficientlyun- least squares regression problems by introducing der suitable conditions. The authors go a consid- a general concept of restricted strong convexity. erable distance toward providing a general the- This allows the authors to study more general ory that answers someof these fundamentalques- formulations with different convex loss functions tions. anda class of “decomposable” regularization con- ditions. • Rigollet and Tsybakov (Sparse estimation by ex- REFERENCES ponentialweighting,[10])presentathoroughanal- [1] Akaike, H. (1973). Information theory and an exten- ysis of oracle inequalities in the context of model sion of the maximum likelihood principle. In Sec- averaging procedures, a class of methods which ond International Symposium on Information The- has its original in the Bayesian literature. Model ory(Tsahkadsor, 1971)267–281.Akad´emiaiKiado´, Budapest. MR0483125 averaging is in general more stable than model [2] Bach, F., Jenatton, R., Mairal, J. and Obozin- selection. For example, in the scenario that two ski, G.(2012). Structuredsparsity throughconvex models are very similar and only one is correct, optimization. Statist. Sci. 27 450–468. model selection forces us to choose one of the [3] Banerjee, O., El Ghaoui, L. and d’Aspremont, A. models even if we are not certain which model (2008). Model selection through sparse maximum is true. On the other hand, a model averaging likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516. procedure does not force us to choose one of the MR2417243 two models, but only to take the average of the [4] Giraud, C., Huet, S. and Verzelen, N. (2012). two models. This is beneficial when several of the High-dimensional regression with unknown vari- modelsare similar and wecannot tell which is the ance. Statist. Sci. 27 500–518. correctone.Themodernanalysisofmodelaverag- [5] Huang, J., Breheny, P. and Ma, S. (2012). A selec- tive review of group selection in high dimensional ingproceduresleadstooracleinequalitiesthatare models. Statist. Sci. 27 481–499. sharperthanthecorrespondingoracleinequalities [6] Lafferty, J., Liu, H. and Wasserman, L. (2012). for model selection methods such as Lasso. The Sparse nonparametric graphical models. Statist. authors give an extensive discussion of such or- Sci. 27 519–537. acle inequalities using an exponentially weighted [7] Mallows, C. L. (1973). Some comments on Cp. Tech- nometrics 12 661–675. modelaveraging procedure.Suchprocedureshave [8] Meinshausen, N. and Bu¨hlmann, P. (2006). High- advantages over model selection when the under- dimensional graphs and variable selection with the lying models are correlated and when the model lasso. Ann. Statist. 34 1436–1462. MR2278363 class is misspecified. [9] Negahban, S., Ravikumar, P., Wainwright, M. J. • Zhang and Zhang (A general theory of concave and Yu, B. (2012). A unified framework for high- dimensional analysis of M-estimators with decom- regularization forhigh-dimensionalsparseestima- posable regularizers. Statist. Sci. 27 538–557. tion problems, [15]) focus on nonconvex penalties [10] Rigollet, P. and Tsybakov, A. B. (2012). Sparse es- andstudyavarietyofissuesrelatedtosuchpenal- timation by exponential weighting. Statist. Sci. 27 ties. Although the natural formulation of a spar- 558–575. 4 J. WELLNER ANDT. ZHANG [11] Schwarz, G. (1978). Estimating the dimension of a [14] Yuan, M. and Lin, Y. (2006). Model selection and model. Ann. Statist. 6 461–464. MR0468014 estimation in regression with grouped variables. [12] Tibshirani, R. (1996). Regression shrinkage and selec- J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. tion via the lasso. J. R. Stat. Soc. Ser. B Stat. MR2212574 Methodol. 58 267–288. MR1379242 [15] Zhang, C.-H. and Zhang, T. (2012). A general the- [13] van de Geer, S. and Muller, P. (2012). Quasi- ory of concave regularization for high dimensional likelihood and/or robust estimation in high dimen- sparse estimation problems. Statist. Sci. 27 576– sions. Statist. Sci. 27 469–480. 593.