ebook img

Variable Selection for Nonparametric Quantile Regression via Smoothing Spline ANOVA PDF

17 Pages·2013·0.31 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Variable Selection for Nonparametric Quantile Regression via Smoothing Spline ANOVA

Stat TheISI’sJournalfortheRapid DisseminationofStatisticsResearch (wileyonlinelibrary.com) DOI: 10.100X/sta.0000 ................................................................................................. Variable Selection for Nonparametric Quantile Regression via Smoothing Spline ANOVA a b(cid:3) c d Chen-Yen Lin , Howard Bondell , Hao Helen Zhang and Hui Zou Received 00 Month 2013; Accepted 00 Month 2013 Quantile regression provides a more thorough view of the e(cid:27)ect of covariates on a response. Nonparametric quantileregressionhasbecomeaviablealternativetoavoidrestrictiveparametricassumption.Theproblemof variableselectionforquantileregressionischallenging,sinceimportantvariablescanin(cid:29)uencevariousquantiles in di(cid:27)erent ways. We tackle the problem via regularization in the context of smoothing spline ANOVA models. The proposed sparse nonparametric quantile regression (SNQR) can identify important variables and provide (cid:29)exibleestimatesforquantiles.Ournumericalstudysuggeststhepromisingperformanceofthenewprocedure in variable selection and function estimation. Supplementary materials for this article are available online. Copyright (cid:13)c 2013 John Wiley & Sons, Ltd. Keywords: Model Selection, COSSO, Reproducing Kernel Hilbert Space, Kernel Quantile Regression .................................................................................................. 1. Introduction Quantileregression,asacomplementtoclassicalleastsquaresregression,providesamorecomprehensiveframeworkto studyhowcovariatesin(cid:29)uencenotonlythelocationbuttheentireconditionaldistribution(Koenker,2005).Inquantile regression problems, the primary interest is to establish a regression function to reveal how the 100(cid:28)% quantile of the response y depends on a set of covariates x =fx(1);:::;x(d)g. A parametric form of regression function is often assumed for convenience of interpretation and lower computation cost. While a linear regression function is studied in Koenker & Bassett (1978) and numerous follow-up studies, ProchÆzka (1988) explored nonlinear regression; see Koenker & Hallock (2001) and Koenker (2005) for a comprehensive overview. .................................................................................................. a Eli Lilly and Company, IN 46285, b Department of Statistics, North Carolina State University, NC 27695-8203. c Department of Mathematics, University of Arizona, AZ 85721-0089 d School of Statistics, University of Minnesota, MN 55455 (cid:3)Email: [email protected] .................................................................................................. Stat2013,001(cid:21)?? 1 Copyright(cid:13)c 2013JohnWiley&Sons,Ltd. Preparedusingstaauth.cls[Version: 2012/05/12v1.00] Stat Lin et al. As much as the parametric assumption enjoys a simple model structure and lower implementation cost, it is not (cid:29)exible enough and hence carries the risk of model misidenti(cid:28)cations for complex problems. For a single predictor model, Koenker et al. (1994) pioneered nonparametric quantile regression in spline models, in which the quantile function can be found via solving the minimization problem n (cid:88) min (cid:26) (y (cid:0)f(x))+(cid:21)V(f0); (1) (cid:28) i i f2F i=1 where (cid:26) (t)=t[(cid:28) (cid:0)I(t <0)]; (cid:28) 2(0;1); is the so-called (cid:16)check function" of Koenker & Bassett (1978), (cid:21) is a (cid:28) smoothing parameter and V(f0) is the total variation of the derivative of f. Koenker et al. (1994) showed that the minimizer is a linear spline with knots at the design points x;i =1;:::;n; provided that the space F is an expanded i second-order Sobolev space de(cid:28)ned as (cid:26) (cid:90) 1 (cid:27) F = f :f(x)=a +a x + (x (cid:0)y) d(cid:22)(y); V((cid:22))<1;a 2R;i =0;1 ; (2) 0 1 + i 0 where (cid:22) is a measure with (cid:28)nite total variation. Bloom(cid:28)eld & Steiger (1983) and Nachka et al. (1995) considered a similar problem to (1), but with a di(cid:27)erent roughness penalty on the function n (cid:90) (cid:88) min (cid:26) (y (cid:0)f(x))+(cid:21) [f00(x)]2dx: (3) (cid:28) i i f2F i=1 Theminimizerof (3)overasecond-orderSobolevspaceisanaturalcubicsplinewithalldesignpointsasknots.Bosch et al. (1995) proposed an interior point algorithm which is guaranteed to converge to the minimizer. For multi-dimensional feature space, He et al. (1998) proposed a bivariate quantile smoothing spline and He & Ng (1999) generalized the idea to multiple covariates using an ANOVA-type decomposition. Li et al. (2007) proposed a more general kernel quantile regression (KQR) method that penalizes the roughness of the function estimator using its squared functional norm in a reproducing kernel Hilbert space (RKHS). More speci(cid:28)cally, the KQR solves the regularization problem n (cid:88) (cid:21) min (cid:26) (y (cid:0)f(x ))+ jjfjj2 ; (4) f2HK (cid:28) i i 2 HK i=1 where H is a RKHS and jj(cid:1)jj is its functional norm. More recently, Fenske et al. (2011) proposed a boosting K HK method to select and estimate additive quantile functions. Although not primarily designed for variable selection, their method naturally achieves variable selection if the number of boosting iterations is small enough. Despite several existing nonparametric quantile function estimators, selecting relevant predictors in multi-dimensional dataisanimportantyetchallengingtopicthathasnotbeenaddressedindepth.Variableselectioninquantileregression is much more di(cid:30)cult than that in the least squares regression. The variable selection is carried at various levels of quantiles, which amounts to identifying variables that are important for the entire distribution, rather than limited to the mean function as in the least squares regression case. This has important applications to handle heteroscedastic data. Several regularization methods were proposed (Zou & Yuan, 2008a,b; Wu & Liu, 2009) for linear quantile regression. In the presence of multiple predictors, many nonparametric estimation procedures may su(cid:27)er from the curse of dimensionality. The smoothing spline analysis of variance (SS-ANOVA) model (Wahba, 1990) provides a (cid:29)exible and e(cid:27)ectiveestimationframeworktotackletheproblem.Sincesomeofthepredictorsmaynotbeusefulorredundantfor prediction, variable selection is important in nonparametric regression. In the context of least squares regression, the .................................................................................................. Copyright (cid:13)c 2013 John Wiley & Sons, Ltd. 2 Stat 2013, 00 1(cid:21)?? Prepared using staauth.cls Stat Sparse Nonparametric Quantile Regression COmponent Selection and Shrinkage Operator (COSSO) (Lin & Zhang, 2006) was proposed to perform continuous function shrinkage and estimation by penalizing the sum of RKHS norms of the components. In this paper, we adopt the COSSO-type penalty to develop a new penalized framework for joint quantile estimation and variable selection. In additive model, some nonparametric procedures for joint estimation and selection have been proposed in a basis expansion framework (Meier et al., 2009; Huang et al., 2010). The choice between the reproducing kernel Hilbert space (RKHS) and basis expansions mirrors that between smoothing splines and regression splines in the least squares setup.Theyarebothpopularinpracticeandhavetheirowndesiredproperties.TheRKHSapproachprovidesa(cid:29)exible and rigorous mathematical framework for multivariate function estimation, and it has wide applications in statistics and machine learning. Furthermore, the COSSO and adaptive COSSO methods o(cid:27)er a new regularization framework to nonparametric variable selection and have promising performance in the standard least squares context. It would be interesting to study them in the context of more complex problems such as quantile regression. The remainder of the article is organized as follows. Section 2 reviews the SS-ANOVA models and introduces the new estimator. Aniterative computationalgorithm is givenin Section3,along withparametertuningprocedure. Extensive empirical studies, including both the homogeneous and heterogenous errors are given in Section 4. Two real example analysis results are presented in Section 5. We conclude our (cid:28)ndings in Section 6. 2. Formulation 2.1. Smoothing Spline ANOVA In the framework of smoothing spline ANOVA (SS-ANOVA), it is assumed that a multivariate function f(x)= f(x(1);:::;x(d)) has the ANOVA decomposition d (cid:88) (cid:88) f(x)=b+ f (x(j))+ f (x(j);x(k))+(cid:1)(cid:1)(cid:1) ; (5) j j;k j=1 j<k where b is a constant, f ’s are the main e(cid:27)ects and f ’s are the two-way interactions, and so on. We estimate each j j;k of the main e(cid:27)ects in a RKHS denoted by H =f1g(cid:8)H(cid:22) whereas the interactions are estimated in a tensor product j j spaces of the corresponding univariate function spaces. A popular choice of H is the second-order Sobolev space j S2[0;1]=fg :g;g0 are absolutely continuous and g00 2L2[0;1]g. When endowed with the norm (cid:26)(cid:90) 1 (cid:27)2 (cid:26)(cid:90) 1 (cid:27)2 (cid:90) 1 jjgjj2 = g(x)dx + g0(x)dx + fg00(x)g2dx; (6) 0 0 0 the second-order Sobolev space is a RKHS with reproducing kernel R(x;y)=1+k (x)k (y)+k (x)k (y)(cid:0)k (jx (cid:0)yj); (7) 1 1 2 2 4 where k (x)=x (cid:0) 1, k (x)= 1(cid:2)k2(x)(cid:0) 1 (cid:3) and k (x)= 1 (cid:2)k4(x)(cid:0) 1k2(x)+ 7 (cid:3). When x(j) is a categorical 1 2 2 2 1 12 4 24 1 2 1 240 variable that takes only (cid:28)nite distinct values, f1;:::;Lg, we de(cid:28)ne the reproducing kernel as R(s;t)=L(cid:1)I(s =t)(cid:0)1; s;t 2f1;:::;Lg: (8) See Wahba (1990) and Gu (2002) for more details. The entire tensor-product space for estimating f(x) is given by d d F =(cid:79)H =f1g(cid:8)(cid:88)H(cid:22) (cid:8)(cid:88)(cid:0)H(cid:22) (cid:10)H(cid:22) (cid:1)(cid:8)(cid:1)(cid:1)(cid:1) : (9) j j j k j=1 j=1 j<k .................................................................................................. Stat2013,001(cid:21)?? 3 Copyright(cid:13)c 2013JohnWiley&Sons,Ltd. Preparedusingstaauth.cls Stat Lin et al. Note that F =(cid:10)d H is also a RKHS, and its reproducing kernel is the sum of the individual reproducing kernels of j=1 j those component spaces. In practice, the higher-order interactions in (9) will usually be truncated for convenience in interpretation and to avoid the curse of dimensionality. A general expression for a truncated space can be written as   (cid:77)q  F =f1g(cid:8) F ; (10) j   j=1 whereF ;:::;F areq orthogonalsubspacesofF.Aspecialcaseisthewell-knownadditivemodelHastie&Tibshirani 1 q (1990)withq =d,inwhichonlythemaine(cid:27)ectsarekeptinthemodel,sayf(x)=b+(cid:80)d f (x(j)).Whenbothmain j=1 j e(cid:27)ects and two-way interaction e(cid:27)ects are retained, the truncated space has q =d(d +1)=2. The goal of this paper is variable selection, so we mainly focus on the additive model. However, the propose method can be extended to a general model (5) if one is interested in identifying both important main e(cid:27)ects and as well as higher-order interaction terms. A typical method for estimating nonparametric quantile function is through solving the regularization problem n 1(cid:88) min (cid:26) (y (cid:0)f(x ))+(cid:21)J(f); (11) f2F n (cid:28) i i i=1 where J((cid:1)) is a penalty functional. A smoothing spline estimate uses the penalty J(f)=(cid:80)d (cid:18)(cid:0)1jjPjfjj2, with (cid:18) ’s j=1 j j as smoothing parameters. The estimation in (11) involves multiple tuning parameters (cid:18) ;(cid:1)(cid:1)(cid:1) ;(cid:18) , which needs to be 1 d selectedproperlyforagoodestimationresults.Theparameter(cid:21)isusuallyincludedand(cid:28)xedatsomeconvenientvalue for computational stability in practice. 2.2. New Methodology: Sparse Nonparametric Quantile Regression (SNQR) To achieve joint variable selection and function estimation in nonparametric quantile regression, we consider the following regularization problem n d 1(cid:88) (cid:88) min (cid:26) (y (cid:0)f(x ))+(cid:21) w jjPjfjj; (12) f2F n (cid:28) i i j i=1 j=1 where Pj’s are orthogonal projection operators that project f onto F respectively and w ’s are known weights. We j j will refer to the solution to (12) as the sparse nonparametric quantile regression (SNQR) estimator. The problem in (12) provides a (cid:29)exible modeling framework that includes existing methods as special cases. For instance, it reduces to the L -norm quantile regression (Li & Zhu, 2008) in linear models. Let f(x)=b+(cid:80)d (cid:12) x(j) 1 j=1 j (cid:8) (cid:9) (cid:82) and consider a linear function space F =f1g(cid:8) (cid:8)d fx(j)(cid:0)1=2g with inner product hf;gi= fg, then the RKHS j=1 norm penalty jjPjfjj is proportional to j(cid:12) j. Moreover, we allow each functional component to be penalized di(cid:27)erently j depending on its associated weight w . In principle, smaller weights are assigned to important function components j whilelargerweightsareassignedtolessimportantcomponents.ThisisinthesamespiritoftheadaptiveLASSO(Zou, 2006) and adaptive COSSO (Storlie et al., 2011). We propose to construct the weights from the data adaptively. For each component f , its L -norm jjf (x(j))jj =f[f (x(j))]2dF (x)g1=2, F ((cid:1)) is the distribution function of x(j), is j 2 j L2 j X(j) X(j) a natural measure to quantify the importance of the j-th functional component. In practice, given a reasonable initial estimator, f~, we propose to construct the weights by its inverse empirical L -norm, i.e. 2 (cid:40) n (cid:41)1=2 (cid:88) w(cid:0)1 =jjPjf~jj = n(cid:0)1 [Pjf~(x(j))]2 ; j =1;:::;d: (13) j n;L2 i i=1 .................................................................................................. Copyright (cid:13)c 2013 John Wiley & Sons, Ltd. 4 Stat 2013, 00 1(cid:21)?? Prepared using staauth.cls Stat Sparse Nonparametric Quantile Regression We recommend use the solution of the KQR as an initial estimate f~. DuetothefactthatboththechecklossandthepenaltyfunctionalJ(f)arecontinuousandconvexinf,theexistence of the minimizer of (12) is guaranteed as stated in the following Theorem. Theorem 1 Let F be an RKHS of functions with the decomposition (10), then there exists a minimizer to (12) in F. Directly minimizing (12) can be a daunting task as searching over the in(cid:28)nite dimensional space F for a minimizer is practically infeasible. Analogous to the smoothing spline models, the following Theorem shows that the minimizer to (12) lies in a (cid:28)nite dimensional space. This important result assures the feasibility of computation. Theorem 2 Representer Theorem: Let the minimizer of (12) be f^=b^+(cid:80)d f^ with f^2F , then f^2spanfR (x(j);(cid:1));i = j=1 j j j j Fj i 1;:::;ng where R ((cid:1);(cid:1)) is the reproducing kernel of F . Fj j Due to the space restriction, we relegate the technical proofs of both Theorems to the Appendix, which are given in online supplementary materials. 3. Computational Algorithm To further facilitate the computation, we present an equivalent formulation of (12). By introducing non-negative slack variables (cid:18) ;j =1;:::;d; and using Lemma 2 in Lin & Zhang (2006), it is easy to show that minimizing (12) is j equivalent to solving the following optimization problem n d d 1(cid:88) (cid:88) (cid:88) min (cid:26) (y (cid:0)f(x ))+(cid:21) w2(cid:18)(cid:0)1jjPjfjj2; s.t. (cid:18) (cid:20)M; (cid:18) (cid:21)0;8j; (14) f2F;(cid:18) n (cid:28) i i 0 j j j j i=1 j=1 j=1 where (cid:21) and M are both smoothing parameters. The roles of the slack variables (cid:18) ’s are di(cid:27)erent from those in the 0 j standard smoothing spline model. The slack variables (cid:18) ’s allow us to recover the sparse structure since (cid:18) =0 if and j j only if jjPjfjj=0 (Lin & Zhang, 2006). Moreover, when (cid:18) ’s are given, the penalty function in (14) reduces to that j in the traditional smoothing spline and thus by the representer theorem of Kimeldorf & Wahba (1971), the minimizer of (14) has a (cid:28)nite representation form   n d (cid:88) (cid:88) f(x)=b+  (cid:18)jwj(cid:0)2RFj(xi(j);x(j))ci; (15) i=1 j=1 where c =(c ;:::;c )2Rn and b 2R. 1 n Let R =fR (x(j);x(j))gn be an n(cid:2)n matrix and 1 be a column vector of n ones. When evaluated the j Fj i i0 i;i0=1 n minimizer at the design points, we write f =(f(x );:::;f(x ))T as f =b1 +((cid:80)d w(cid:0)2(cid:18) R )c and de(cid:28)ne jjvjj = 1 n n j=1 j j j C(cid:28) n(cid:0)1(cid:80)n (cid:26) (v) for any vector v of length n. With these notations, the objective function in (14) can be expressed as i=1 (cid:28) i (cid:13) (cid:13)   (cid:13) d (cid:13) d (cid:13) (cid:88) (cid:13) (cid:88) bm;ci;n(cid:18)(cid:13)(cid:13)(cid:13)y (cid:0)b1n(cid:0) j=1(cid:18)jwj(cid:0)2Rjc(cid:13)(cid:13)(cid:13) +(cid:21)0cT j=1(cid:18)jwj(cid:0)2Rjc; C(cid:28) (16) d (cid:88) s.t. (cid:18) (cid:20)M; (cid:18) (cid:21)0; 8j: j j j=1 .................................................................................................. Stat2013,001(cid:21)?? 5 Copyright(cid:13)c 2013JohnWiley&Sons,Ltd. Preparedusingstaauth.cls Stat Lin et al. 3.1. Iterative Optimization Algorithm It is possible to simultaneously minimize the objective function in (16) with respect to all unknown parameters (b;cT;(cid:18)T)T, but the programming e(cid:27)ort can be daunting. Alternatively, we can separate the parameters into two parts, (cid:18) and (b;cT)T, and then iteratively solve two optimization problems in turn, with respect to (cid:18) and (b;cT)T. Consequently, we suggest the following iterative algorithm: 1. Fix (cid:18), solve (b;cT)T (cid:13) (cid:13)   (cid:13) d (cid:13) d (cid:13) (cid:88) (cid:13) (cid:88) mb;cin(cid:13)(cid:13)y (cid:0)b1n(cid:0) (cid:18)jwj(cid:0)2Rjc(cid:13)(cid:13) +(cid:21)0cT  (cid:18)jwj(cid:0)2Rjc: (17) (cid:13) j=1 (cid:13) j=1 C(cid:28) 2. Fix (b;cT)T, solve (cid:18) d (cid:88) minky(cid:3)(cid:0)G(cid:18)k +(cid:21) cTG(cid:18); s.t. (cid:18) (cid:20)M;(cid:18) (cid:21)0; 8j; (18) (cid:18) C(cid:28) 0 j j j=1 where y(cid:3) =y (cid:0)b1 and G =(g ;:::;g ) is an n(cid:2)d matrix with columns g =w(cid:0)2R c. n 1 d j j j Theoptimizationproblemsin(17)and(18)canbecastintoquadraticandlinearprogrammingproblems,respectively. We defer all the derivations to the Appendix available from online supplementary materials. Based on our experience, thealgorithmconvergesquicklyinafewsteps.Wehavenotedthattheone-stepsolutionoftenprovidesasatisfactory approximation to the solution. As a result, we advocate the use of one-step update in practice. AnimportantconnectionbetweentheSNQRandtheKQRcanbeunraveledbyrealizingthattheobjectivefunctionin (17) is exactly the same as that in the KQR. This connection suggests that the SNQR method reduces to the KQR when (cid:18) is known. The optimization problem for estimating (cid:18) essentially imposes the non-negative garrote (Breiman, 1995) type shrinkage on (cid:18)’s, and hence achieves variable selection by shrinking some of (cid:18) ’s to zero. j 3.2. Parameter Tuning Like any other penalized regression procedure, the performance of the SNQR critically depends on properly-tuned smoothingparameters.Areasonableparameterisusuallytheonethatminimizessomegeneralizederrororinformation criterion. In the quantile regression literature, one commonly used criterion is the Schwarz information criterion (SIC) (Schwarz, 1978; Koenker et al., 1994) (cid:32) (cid:33) n 1(cid:88) logn log (cid:26) (y (cid:0)f^(x )) + df; (19) n (cid:28) i i 2n i=1 where df is a measure of complexity of the (cid:28)tted model. Various authors (Koenker, 2005; Yuan, 2006; Li et al., 2007) have suggested using the number of zero residuals as an estimate of e(cid:27)ective degrees of freedom (EDF). In our numerical experiments, we observed that the estimated EDF (cid:29)uctuates a lot when smoothing parameter changes, which may make the tuning process unstable. Instead of using the number of zero residuals as an EDF estimate, we propose a bootstrap method to estimate the EDF and defer the technical detail to the appendix. In addition to the SIC, another popular criterion to choose the smoothing parameter is k-fold cross validation (CV), which has been widely applied to various regression and classi(cid:28)cation problems and usually gives competitive performance.Basedon ournumerousexperiments,the tuningwithCVoverallproducesbetter prediction performance .................................................................................................. Copyright (cid:13)c 2013 John Wiley & Sons, Ltd. 6 Stat 2013, 00 1(cid:21)?? Prepared using staauth.cls Stat Sparse Nonparametric Quantile Regression than that with SIC for the proposed nonparametric quantile estimation. Therefore, we recommend the use of k-fold CV for parameter tuning in practice. In this paper, k =5 is used in both simulated and real data analysis. In the following, we summarize the complete algorithm for the SNQR method, including both model (cid:28)tting and parameter tuning steps. Step 1. Initialization. Set (cid:18) =1; 8j. j Step 2. Foreachofthegridpointsof(cid:21) ,solve(17)for(b;cT)T,andrecordtheSICscoreorCVerror.Choosethebest 0 (cid:21) that minimizes the SIC or CV error, then (cid:28)x it in later steps. 0 Step 3. For each of the grid points of M, solve (16) for ((cid:18)T;b;cT)T using the aforementioned iterative optimization algorithm.RecordtheSICscoreorCVerrorateachgridpointandchoosethebestM thatminimizeseitherSIC score or CV error. Step 4. Solve (16) using the chosen (cid:21) and M pair, on the full data. Note that this is already done if tuning was based 0 on SIC. Sincethetuningproceduredescribedabovedoesnotcoverallpossiblepairsof((cid:21) ;M),wesuggesttodothefollowing 0 toenhancethetuning.AfterStep3,say,weobtaintheoptimalpair((cid:21)(cid:3);M(cid:3)).Thenwefocusonanarrowedandmore 0 focused region, the neighborhood of ((cid:21)(cid:3);M(cid:3)) and apply Step2 and 3 again. The optimal parameters determined at 0 thisre(cid:28)nedstep,say,((cid:21)(cid:3)(cid:3);M(cid:3)(cid:3))willbeusedasthe(cid:28)nalselection.Oursimulationstudyalsocon(cid:28)rmsthatthisre(cid:28)ned 0 tuning procedure can improve the prediction and selection performance substantially. 4. Numerical Results InthissectionwepresenttheempiricalperformanceoftheSNQRprocedureusingsimulateddata.Fortheexperiment design, we use the following functions as building blocks: g (t)=t; g (t)=(2t(cid:0)1)2; g (t)= sin(2(cid:25)t) and 1 2 3 2(cid:0)sin(2(cid:25)t) g (t)=0:1sin(2(cid:25)t)+0:2cos(2(cid:25)t)+0:3sin2(2(cid:25)t)+0:4cos3(2(cid:25)t)+0:5sin3(2(cid:25)t). 4 Weevaluatetheperformancefromtwoaspects,predictionaccuracyandmodelselection.Theintegratedabsoluteerror (IAE),de(cid:28)nedasIAE=Ejf^(X)(cid:0)f(X)j,isusedtoassesspredictionaccuracy,wheretheexpectationisevaluatedbya montecarlointegrationwith10,000testpointsgeneratedfromthesamedistributionasthetrainingdata.Intermsof model selection, we (cid:28)rst denote M^ =fj :(cid:18)^ 6=0g and M =fj :jjPjfjj>0g as the selected model and true model, j 0 respectively,andjMjasthecardinalityofthesetM.Thenwecomputefourstatisticsforassessingselectionaccuracy: correct selection, I(M^ =M ), type I error rate, jM^\Mc0j, power, jM^\M0j, and model size, jM^j. For the purpose of 0 d(cid:0)jM0j jM0j comparison,wealsoincludethesolutionoftheKQR(cid:28)ttedwithonlyrelevantpredictorsbasedon5-foldcrossvalidation tuning. This method will later be referred to as the Oracle estimator. The Oracle estimator provides a benchmark for the best possible estimation risk if the important variables were known. We also include the KQR and boosting QR (Fenske et al., 2011) for comparison, to illustrate the superior performance of our method to the existing methods. Another property that we would like to study is the role of the adaptive weights in the performance of the SNQR procedure. Without any a priori knowledge on the importance of each predictor, we can set all w =1 in (16) and j proceed to solve the objective function. This method will be referred to as an unweighted method. For the weighted procedure, we use KQR as an initial estimate, f~, to produce an adaptive weight. Two di(cid:27)erent quantile values (cid:28) =f0:2;0:5g, are used throughout the simulation. For each of the following examples, we repeat 100 times and report the average summary statistics and their associated standard errors. .................................................................................................. Stat2013,001(cid:21)?? 7 Copyright(cid:13)c 2013JohnWiley&Sons,Ltd. Preparedusingstaauth.cls Stat Lin et al. 4.1. Computation Cost Before comparing di(cid:27)erent methods, we (cid:28)rst study the computation cost for the SNQR method in this section. In particular, we assess the computation cost using the average elapsed CPU times for solving (16) with a (cid:28)xed ((cid:21) ;M) 0 over 200 replicates. We (cid:28)rst generate the predictors x(j);j =1;:::;d; independently from U(0;1) and then take n observations from the model y =5g (x(1))+3g (x(2))+4g (x(7))+6g (x(10))+"; i =1;:::;n; (20) i 1 i 2 i 3 i 4 i i where " are independently drawn from t(3). We consider multiple combinations of sample sizes n and dimension d. i All the computations are done on a desktop PC with an Intel i7-2600K CPU and 12GB of memory. The average CPU time is presented in Table 1. Table 1 shows that the proposed algorithm is pretty fast in general. For example, when n =200;d =40,ittakesinaverageabout0.3secondtosolvetheoptimizationproblemwitha(cid:28)xedtuningparameter. The sample size n is one major factor in(cid:29)uencing the computation speed. We observe that the computing time varies little when p increases from 10 to 40 for a (cid:28)xed n, but increases substantially when n increases from n =100 to n =300 for a (cid:28)xed d. 4.2. Homoskedastic Error Model We (cid:28)rst consider a model with the response generated from a location family in (20) and with the homoskedastic error. The 100(cid:28)% quantile function is then given by Q ((cid:28)jx)=5g (x(1))+3g (x(2))+4g (x(7))+6g (x(10))+F(cid:0)1((cid:28)); (21) y 1 2 3 4 " whereF ((cid:1))isthedistributionfunctionof".Thecovariatesx(j); j =1;:::;40;followU(0;1)distributionandthereis " autoregressive(AR)typeofdependencyamongthem,i.e.,withthepairwisecorrelationcor(x(j);x(k))=(cid:26)jj(cid:0)kj; 8j 6=k. We consider two levels of dependency with (cid:26)=f0;0:7g. The sample size n =200. Tables 2 summarizes the performance of (cid:28)ve competing procedures: KQR, boosting QR, SNQR, adaptive SNQR, and the Oracle estimator. Another desired property of the SNQR procedures is their robustness property. Although the least squares regression and quantile regression are not generally comparable, the conditional median and mean functions coincide in this example, making the comparison between them justi(cid:28)able. Therefore we also incorporate two sparse least squares nonparametric regression models, COSSO (Lin & Zhang, 2006) and adaptive COSSO (Storlie et al., 2011), to estimate the conditional mean function and compare with the corresponding SNQR estimate. From Tables 2, we observe that the adaptive SNQR outperforms existing procedures under both independent and dependent cases. In terms of prediction accuracy, the adaptive SNQR has the smallest IAE, which is also quite close to that of the Oracle, and it is followed by SNQR, boosting QR, and the KQR is the worst. Furthermore, the tuning with 5-fold CV produces better results than the tuning with SIC most of the times. It is clear that the KQR su(cid:27)ers considerably from the presence noise variables. Although the boosting QR has a model size consistently greater than 20, during the boosting process, the four important predictors are selected to update the solution most frequently, whereas the noise variables are selected with a much smaller frequency. Fenske et al. (2011) found similar results and recommended using the selection frequency as a guidance for variable selection. With regard to variable selection, the SNQR procedures e(cid:27)ectively identi(cid:28)es important variables, particularly when using SIC as a tuning procedure. Throughoutallsimulationscenarios,theSNQRhaveawell-controlledTypeIerror,andthepowerisnolessthan90%. With regard to estimation performance, relative to the Oracle estimator, the proposed estimator is more than 90% and about 70-80% as e(cid:30)cient in terms of IAE for the independent and dependent cases, respectively. The conditional mean function estimators, COSSO and adaptive COSSO, give comparable model selection. However, theirestimationperformanceisseriouslya(cid:27)ectedbytheheavytailoftheerrordistribution.Evenwithadaptiveweight, .................................................................................................. Copyright (cid:13)c 2013 John Wiley & Sons, Ltd. 8 Stat 2013, 00 1(cid:21)?? Prepared using staauth.cls Stat Sparse Nonparametric Quantile Regression COSSO still performs worse than the SNQR method. In addition, the standard errors are almost 10 times larger than the other median estimators, implying our SNQR method enjoys the robust property for median estimation. Figure 1 gives a graphical illustration for the pointwise con(cid:28)dence bands for the (cid:28)tted curves using (cid:28) =0:2. For comparison, the estimated functions by the Oracle are also depicted. Since we apply each procedure to 100 simulated datasets, the pointwise con(cid:28)dence band is constructed by using the 5% and 95% percentiles. From Figure 1, we see that the SNQR estimate the true function very well, and it performs very close to the Oracle estimator. 4.3. Heteroskedastic Error Model To further examine the (cid:28)nite sample performance of the new methods, we consider generating response from a location-scale family (cid:104) (cid:105) y =5g (x(1))+3g (x(2))+4g (x(7))+6g (x(10))+exp 2g (x(12)) "; i =1;:::;n; (22) i 1 i 2 i 3 i 4 i 3 i i i:i:d: where " (cid:24) N(0;1). In the heteroskedastic model, the 100(cid:28)% quantile function is given by i (cid:104) (cid:105) Q ((cid:28)jx)=5g (x(1))+3g (x(2))+4g (x(7))+6g (x(10))+exp 2g (x(12)) (cid:8)(cid:0)1((cid:28)); (23) y 1 2 3 4 3 where (cid:8)((cid:1)) is the distribution function of standard normal. The covariates are generated in the same fashion as that in the previous example and we use sample size n =300 in this case. From this example, we aim to evaluate the performance of SNQR when some covariates can only be in(cid:29)uential on a certain range of quantiles. More speci(cid:28)cally, like the homoskedastic example, the median function in this example depends on the 1, 2, 7 and 10th predictors. However, other than the median function, x(12) will not only be in(cid:29)uential but its e(cid:27)ect becomes larger and larger toward the tails. Tables 3 summarizes the performance of all the competing methods under comparison. We (cid:28)rst notice that, when (cid:28) =0:5, the average model size is close to 4 as expected. However, when (cid:28) is away from 0.5, the SNQR procedures successfullyidenti(cid:28)estheadditionalinformativepredictor,x(12),suggestingthatthenewmethod’scapabilitytoidentify all the relevant predictors that in(cid:29)uence the distribution of the response. As for the function estimation, again, we observe that the adaptive SNQR performs the best and the KQR is the worst. However, the boosting QR provides as e(cid:30)cient or sometimes even more e(cid:30)cient estimation than the SNQR. For variable selection, the tuning with SIC usually selects a model of a smaller size and identi(cid:28)es the correct model with high frequency. 5. Real Data Analysis We apply the SNQR method to two real datasets: prostate cancer data and ozone data. The prostate data is from Stameyetal.(1989),consistingof97patientswhowereabouttoreceivearadicalprostatectomy.Thisdatawasused by Tibshirani (1996) to model the mean function of the level of prostate-speci(cid:28)c antigen on 8 clinical covariates and select relevant variables. The ozone data contains 330 observations collected in Los Angeles in 1976, and the purpose of the study is to model the relationship between the daily ozone concentration and 8 meteorological covariates. The data has been used in various studies (Buja et al., 1989; Breiman, 1995; Lin & Zhang, 2006). These two data are publicly available in R packages ElemStatLearn and cosso, respectively. We assess the model performance using the prediction risk, E(cid:26) (Y (cid:0)f(X)), whereas the expectation is evaluated by (cid:28) randomly reserving 10% of the data as testing set. We select the smoothing parameters and estimate the function .................................................................................................. Stat2013,001(cid:21)?? 9 Copyright(cid:13)c 2013JohnWiley&Sons,Ltd. Preparedusingstaauth.cls Stat Lin et al. using only the training set. The estimated function will then be applied on the testing set and the compute the prediction risk. The entire procedure is repeated 100 times and averaged. Based on the result summarized in Table 4, the adaptive weights is not always bene(cid:28)cial in real applications. The advantage of adaptive weight is clear for the prostate data. But, the di(cid:27)erence between the weighted and the unweighted methods are usually within reasonable error margin. Overall, we still observe that the SNQR method provides competitive and sometimes superior prediction than the existing methods. Apartfromcomparingpredictionerror,wealsoapplytheSNQRtothecompleteprostatedataandsummarizevariable selection.Aninterestingcomparisonisthatinthestudyofmeanfunction,Tibshirani(1996)selectedthreeprominent predictors, log-cancer volume, log-weight and seminal vesicle invasion. These three predictors are also selected by the SNQRwhenweconsiderthemedian.However,inthe20%quantile,gleasonscoreshowsupasanadditionalpredictor. Meanwhile, in the 80% quantile, only log-cancer volume and seminal vesicle invasion are chosen. 6. Conclusions Weproposeanewregularizationmethodthatsimultaneouslyselectsimportantpredictorsandestimatetheconditional quantile function nonparametrically. Our method is implemented in the R package cosso, which is available online supplementary materials and the Comprehensive R Archive Network. The SNQR method conquers the limitation of selecting only predictors that in(cid:29)uence the conditional mean in least squares regression, facilitating the analysis of the full conditional distribution. The proposed method also includes the L -norm quantile regression and the KQR as 1 special cases. In a simulation study and real data analysis, our method provides satisfactory model (cid:28)tting and great potential for selecting important predictors. The sample size and the number of predictors we consider in both simulation and real data are moderate. As demonstrated in Section 4.1, the computation bottleneck in our algorithm is the sample size, which a(cid:27)ects both linear programming and quadratic programming. In our experience, the computation e(cid:27)ort becomes formidable when the sample size is greater than 500. On the other hand, the number of predictors only a(cid:27)ects the linear programming, and hence has relatively minor impact on the computation di(cid:30)culty. In addition, the linear programming can be scaled up to a much higher dimension if using e(cid:30)cient numerical optimizers. In ultra-high dimensional feature space, Fan et al. (2011) recently proposed a screening procedure for nonparametric regression model. Further study can work toward incorporating a suitable screening procedure as a (cid:28)rst step and then apply our proposed method at the second in a ultra-high dimensional feature space. Supporting Information Technical Derivations: The proofs of Theorem I and Theorem II are given in Appendix 1. We provide more detailed quadratic and linear programming derivations in Appendix 2 and 3, respectively. The bootstrapped degrees of freedom estimation is given in Appendix 4 (Supplement.pdf). R-package for SNQR: R-package cosso contains computation code and the ozone dataset used in the real data analysis. (GNU zipped tar (cid:28)le) References Bloom(cid:28)eld, P & Steiger, W (1983), Least absolute deviations: Theory, Applications and Algorithms, Boston: Birkh(cid:160)user-Verlag. .................................................................................................. Copyright (cid:13)c 2013 John Wiley & Sons, Ltd. 10 Stat 2013, 00 1(cid:21)?? Prepared using staauth.cls

Description:
variable selection for quantile regression is challenging, since important . prediction, variable selection is important in nonparametric regression. An important connection between the SNQR and the KQR can be unraveled by
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.